1 Introduction

Digital video is an ordered collection of images captured by a digital camera. It also contains audio and other data. People are becoming heavily dependent on multimedia contents in day to day life, particularly on digital videos. The surveillance camera is also one of the treasures of contemporary technology used at offices, homes, and various public places have gained enormous popularity as an efficient safety measure. It’s been the fact that video footages are treated as proof in most of the nations against the sort of crimes. Also, due to easy access to advanced editing software and use of the latest smartphone, it is easy for anyone to perform the manipulations in digital video and falsify it. The intentional modification made in the digital video for falsification is called a video forgery, and it may be hard for human beings to decide the authenticity of those digital videos by the naked eye. Hence, it becomes essential to analyze and decide if video content is original or modified in order to use as a piece of evidence in court. Digital forgery detection techniques are therefore needed to inspect the integrity and authenticity of digital videos.

The digital video forgery detection is a process to validate whether the digital video contents have undergone any intentional manipulation. The techniques to detect the forgery in a digital video can be generally categorized as active and passive. Active techniques use pre-embedded information such as watermark or signature to check the integrity and authenticity of a video. In contrast, passive techniques work in the absence of pre-embedded data. But in most of the cases, videos do not contain pre-embedded information such as watermark or signature, in that case, it is tough to detect the manipulation using an active approach. So, In recent years, passive video forgery detection techniques are getting considerable attention in the scientific community, as depicted in Fig. 1. It shows the pictorial representation of the publications on video forgery detection using passive techniques over the last 15 years (i.e., from 2006 to 2020). The selection process of the papers is based on Query Firing. The keywords such as video forgery detection and video forgery are used to fire the query on standard digital libraries such as IEEE, Springer, and Elsevier.

Fig. 1
figure 1

Publications over the last 15 years on video forgery detection using passive techniques

Some of the surveys on video forgery detection have already been published: Rocha et al. [89], Wahab et al. [127], Pandey et al. [84], Sitara et al. [106], Mizher et al. [75], Singh et al. [104], Johnston et al. [47]. It has been observed in the mentioned surveys that 1) Critical explanation about the topic is missing. 2) Systematic, easy to understand, and comprehensive survey on passive video forgery detection techniques are not done yet. 3) Deepfake detection in the video is not discussed in any of the survey paper. 4) All Performance parameters used for testing and validation of technique are not described thoroughly. 5) Discussion on standard benchmark datasets for video forgery has not done. 6) Provide limited research paths for future directions.

Our survey is different from the surveys as mentioned above in a way that a systematic method is followed in order to perform an exhaustive study on video forgery detection and delivers the in-depth literature for passive video forgery detection techniques categorized based on the feature or method used. The major highlights of this study can be precisely given, as follows,

  • Basic terminology related to video forgery detection is introduced.

  • The systematic and detailed survey on a passive video forgery detection technique is presented.

  • Anti-forensics strategies and deepfake detection in the video are also discussed.

  • The standard benchmark video forgery datasets are overviewed.

  • The generalized architecture of passive video forgery detection is presented.

  • Hopeful challenges and future research direction in the passive video forgery detection are also discussed.

The paper is catalogued as follows. Section 2 deals with the basic terminology required for understanding video forgery detection. Section 3 gives a detailed survey of existing passive video forgery detection techniques. Sections 4 and 5 address the anti-forensic strategies and deepfake detection in videos, respectively. Section 6 focuses on the detailed analysis of existing benchmark video forgery datasets. Section 7 presents the generalized architecture design for passive video forgery detection. Section 8 illustrates the discussion and new challenges in passive video forgery detection. Section 9 covers the conclusions.

2 Basic terminology in digital video forgery

This section presents the basic terminologies need to understand this survey.

2.1 Types of video forgeries

There are several types of forgery present in the digital video, pretty commonly divided into two subcategories, such as intra-frame forgery and inter-frame forgery. These forgeries can be performed using video editing tools such as Adobe Premiere Pro, Adobe Photoshop, etc. Figure 2 shows the types of digital video forgeries.

Fig. 2
figure 2

Types of digital video forgeries

2.1.1 Intra-frame forgery

In this type of forgery, the original contents of particular frames are manipulated. It is also called as spatial based video tampering. Some of the intra-frame forgery types are as follows.

  1. a)

    Copy-Move Forgery: It is one of the most common types of forgery performed on digital image/video [62]. In this type of forgery, an attacker can insert or delete an object from a video scene. At the same time, it can be used for creating duplicate objects in the video by copying a portion of the video frame and pasting it to another location either in the same or the different frame of the video. Therefore, it is also called as copy-paste forgery or region manipulation forgery. The operations performed in copy-move forgery can also be used for hiding the desired area in the frame [18, 30]. Figure 3. shows an example of copy-move forgery in the video, wherein (a) part, the frame region (a flower) is copied and pasted to the other place in the same video frame (i.e., new object is created into the video frame). And in (b) part, a keyboard is removed from the actual video frame, which is highlighted by a yellow curve. Copy-move forgery is also called it as inpainting forgery which is used for removing certain objects from digital images or videos and fill that area with matching background content. Inpainting can be done in one of two ways:

    • Temporal Copy and Paste Impainting: In Temporal Copy and Paste (TCP) inpainting, forged area filled-up using similar pixels from the adjacent regions of the same video frame or with the help of the most coherent blocks from the frames adjacent to the affected frames.

    • Exemplar Based Texture Synthesis Impainting: In Exemplar Based Texture Synthesis (ETS) inpainting, the missing areas of a video frame are filled with the use of sample textures.

  2. b)

    Splicing: In this type of forgery, a new video frame is formed by photocopying a piece from one video frame and pasting it to another one. Figure 4 shows an example of splicing forgery in a video in which new composed video frame is formed by merging the object of two video frames.

  3. c)

    Upscale Crop: The outer part of the video frame is crop out in upscale crop to remove some region or object [102]. Figure 5 shows an example of upscale crop forgery wherein (a) shows the original video frame and (b) shows the frame after performing upscale forgery (a walking lady is removed).

Fig. 3
figure 3

Copy-move forgery in Video a Frame region (a flower) is copied and pasted it to another place b A keyboard is removed from the actual video frame which is marked by a yellow curve (also, called as video inpainting forgery)

Fig. 4
figure 4

Splicing forgery in video (two different frames are merged into a single frame)

Fig. 5
figure 5

Upscale crop forgery in video a) Original video frame b) Frame after upscale crop forgery (a walking lady is removed)

2.1.2 Inter-frame forgery

These types of forgery alter the order of frames in a video in some of the other ways. Figure 6 shows the inter-frame forgeries in digital video. It is also called as temporal tampering. The various types of inter-frame forgery are as follows.

  1. a)

    Frame Deletion: This type of manipulation purposefully removes some of the frames in a video to produce false evidence as an unlawful activity. Figure 7 shows the frame deletion forgery in the video wherein part (a) is an original video sequence, and part (b) shows the forged video sequence after performing the frame deletion forgery in which the third and fourth frame is deleted from the original video sequence.

  2. b)

    Frame Duplication:

    This type of forgery intentionally duplicates some of the frames in a video. Figure 8 shows the frame duplication forgery in the video wherein part (a) is an original video sequence and part (b) shows the forged video sequence after performing the frame duplication forgery in which the sixth frame is duplicated in the place of the third frame.

    Frame-mirroring is one of the form frame duplication forgery mentioned in [122], which copies a some of the frames from the input video and pastes its mirrored copy in the same video at some random locations. Frame mirroring is shown in Fig. 9 wherein (a) part shows the original video sequence and (b) part shows the forged video sequence created after performing frame mirroring forgery where the mirrored copy of 2nd frame is copied and pasted at location 2 denoted by M2 whereas as a mirrored copy of 6th frame is copied and pasted at location 5 denoted by M6.

  3. c)

    Frame Insertion: In the frame insertion forgery, frames from other videos or the same video are added intentionally at some random position for any illegal activity or fake evidence. Figure 10 shows the frame insertion forgery in the video wherein part (a) is an original video sequence and part (b) shows the forged video sequence after performing insertion forgery in which frame I1 and I2 from another video are added in between the 2nd and 3rd frame of the original video sequence.

  4. d)

    Frame Shuffling/Replication: This forgery shuffles or alters the original order of video frames, which gives the different meaning to the original video. Figure 11 shows the frame shuffling forgery in the video in which some of the frames in an original video sequence are shuffled wherein (a) part is an original video sequence and (b) part shows the forged video sequence after performing the frame shuffling forgery wherein 4th frame is shuffled with 2nd frame.

Fig. 6
figure 6

Inter-frame forgeries in the video a Represent the original video sequence b Frame 4 and 6 is deleted from the original video sequence c Frames 3, 4 & 5 (marked by red color) are duplicated d Frame f1 & f2 is inserted into an original video sequence e Frames 5, 6 and 9, 10 (marked by red color) are shuffled

Fig. 7
figure 7

Frame deletion forgery a Original video sequence b Forged video sequence after deletion forgery (3rd and 4th frame is deleted from the video sequence)

Fig. 8
figure 8

Frame duplication forgery a Original video sequence b Forged video sequence after performing the duplication forgery (6th frame is duplicated in place of 3rd frame)

Fig. 9
figure 9

Frame Mirroring forgery a Original video sequence b Forged video sequence after performing mirroring forgery

Fig. 10
figure 10

Frame insertion forgery a Original video sequence b Forged video sequence after insertion forgery (I1 and I2 frames is added in between 2nd and 3rd frame)

Fig. 11
figure 11

Frame Shuffling Forgery a Original video sequence b Forged video sequence after performing replication forgery (4th frame is shuffled with 2nd frame)

2.2 Performance parameters

To evaluates the performance of digital video forgery detection techniques, the common measures used by the different authors are mentioned in this section.

$$ PR=\frac{TP}{TP+FP} $$
(1)
$$ RR=\frac{TP}{TP+FN} $$
(2)
$$ TNR=\frac{TN}{TN+FP} $$
(3)
$$ FPR=\frac{FP}{FP+TN} $$
(4)
$$ MR=\frac{FP+FN}{TP+FN+TN+FP} $$
(5)
$$ DA=\frac{TP+TN}{TP+FN+TN+FP} $$
(6)
$$ F1 Score=2 \times\frac{RR \times PR}{RR+PR} $$
(7)
$$ PFACC=\frac{Correctly\_classified\_pristine\_frames}{Pristine\_frames} $$
(8)
$$ FFACC=\frac{Correctly\_classified\_forged\_frames}{Forged\_frames} $$
(9)
$$ DFACC=\frac{Correctly\_classified\_double\_compressed\_frames}{double\_compressed\_frames} $$
(10)
$$ FACC=\frac{Correctly\_classified\_frames}{All\_the\_frames} $$
(11)
$$ VACC=\frac{Correctly\_classified\_video\_clips}{All\_the\_video\_clips} $$
(12)

True Positive is given by TP, which is the count of genuine video frames that are categorized as authentic i.e., correct positive detection. False Negative is given by FN, which is the count of forged video frames that are categorized as authentic i.e., incorrect negative detection. True Negative is given by TN, which is the count of forged video frames that are categorized as forged i.e., correct negative detection. False Positive is given by FP, which is the count of genuine video frames that are categorized as forged i.e., incorrect positive detection. PR denotes Precision Rate, is computed as the number of correct positive detections divided by the total number of positive detections. RR denotes Recall Rate also, called as Sensitivity (SN) or True Positive Rate (TPR), is computed as the number of correct positive detections divided by the total number of positives. TNR denotes the True Negative Rate, also called as Specificity (SP), is computed as the number of correct negative detections divided by the total number of negatives. FPR denotes False Positive Rate, is computed as the number of incorrect positive detections divided by the total number of negatives. FPR can also be calculated as 1−TNR. MR denotes Misclassification Rate, also called as Error Rate, is calculated as the number of all incorrect detections divided by the total number of sample present in the dataset. DA denotes Detection Accuracy is computed as the number of all correct detections divided by the total number of samples in the dataset. F1 Score is a weighted average or harmonic mean of Recall and Precision. Apart from the above-mentioned parameters, the Pristine Frame Accuracy (PFACC), Forged Frame accuracy (FFACC), Double-compressed Frame Accuracy (DFACC), Frame Accuracy (FACC) and Video Accuracy (VACC) are the parameters defined by Chen et al. [20]. PFACC is the ratio of correctly classified original frames to all the original frames. FFACC is the ratio of correctly classified forged frames to all the forged frames. DFACC is the ratio of correctly classified double compressed frame to all the double compressed frames. FACC is the ratio of correctly classified frames to all the available frames (forged as well as original). VACC is the correctly classified videos to all the videos. Receiver Operating Characteristics (ROC) curve is one of the parameters which is used to plot the fraction of TP vs FP.

3 Video forgery detection techniques

The techniques for the detection of the forgery in a digital video can be generally categorized as active and passive. The main aim of this section is to study passive techniques designed for video forgery detection.

3.1 Active techniques

In these techniques, authentication information such as watermark or signature is inserted in a digital video that enables the authenticity and integrity of its contents [97]. If someone has manipulated the content of a video, then the watermark or signature embedded in the video is getting changed that gives the clear indication that video has been manipulated [96]. The advantage of active techniques is that forgery detection in the video is straightforward due to the presence of information like a watermark or signature. But in most of the cases, videos downloaded over the Internet do not contain a watermark or signature, in that case, it is tough to detect the manipulation. The limitation of these techniques is that if the videos do not contain pre-embedded information like watermark or signature, then it is not possible to detect the manipulation. Another issue is that it reduces the quality of an original video due to the presence of embedded information.

3.2 Passive techniques

Passive techniques depend on the internal characteristics of the digital video itself instead of information that provide to check the originality of video. Passive techniques work in the absence of pre-embedded data such as watermark or signature to check the integrity and authenticity of a video. Without knowing about the pre-embedded information inside the video, it becomes a challenging task for the researcher to work on passive techniques. Hence in recent years, passive techniques on video forgery detection have become noteworthy attention in the scientific community. Passive digital video forgery detection techniques investigate the artifacts left after the forgeries to distinguish the original videos with the tampered ones. The passive techniques are alternatively called as blind techniques as it works under the assumption that forgeries produce certain kind of static and temporal artifacts in a video which is to be checked for identifying the manipulated videos. Figure 12 shows the categorization of passive video forgery detection techniques on the basis features/artifacts used.

Fig. 12
figure 12

Categorization of passive video forgery detection techniques

3.2.1 Compression artifacts based techniques

Digital videos are generally compressed through MPEG-1, MPEG-2, MPEG-3, MPEG-4 and H-264 coding standard to optimize the storage space and transmission time. Compression artifacts-based techniques used the coding clues or artifacts acquired during the process of compression to detect the forgery present in the video. The compression artifacts used in video forgery detection is shown in Fig. 13.

Fig. 13
figure 13

Compression artifacts used in video forgery detection

The manipulations in digital videos are performed in the uncompressed domain. To perform the forgery in a video, someone must decode it first, make changes and then recompress it which we generally called as double compression. The Compression artifacts look at the specific characteristics of video such as compression properties, variations in the quantization parameters after double compression, periodic features, variations in the Discrete Cosine Transform (DCT) coefficients, and properties of GOPs (Group of Pictures). Thus, the existing compression shall expose the forgery in the video. In compression artifact techniques, GOP’s analysis plays a crucial part in the detection of falsification in the video. The GOP term is related to MPEG compressed video. Figure 14 shows the structure of the GOP in the video. The frames in GOP’s are arranged in a specific order such as intra-frame (I), predictive frame (P) and bi-directionally predictive frame (B), each having a varying degree of compression [102]. I frames are called as intra-coded frames or independent frames and need a lot of data storage and offer the least compression ratio. Whereas, P frames are known as predicted, or dependent frames that contain only information that is distinct from it’s previous I or P frame, and it requires less space as compared to I frame. During encoding, frames in a video are grouped in GOP’s according to a structure that begins with an I-frame and then allows a number of P and B frames [52]. Table 1 shows the analysis of video forgery detection techniques based on compression artifacts.

Fig. 14
figure 14

GOP structure

Table 1 Analysis of compression artifacts based forgery detection techniques (QR: Quantization Scale Ratio)

Wang et al. [130] focused on MPEG compressed videos and explained the fact that static and temporal features are introduced in the video after being subjected to double MPEG compression to detect the manipulation. The same authors have performed some modification and suggested a new technique in [133] to check whether a digital video is doubly MPEG compressed or not. Subramanyam et al. [115] have suggested a passive approach for the detection of spatial and temporal copy-paste forgery using video compression artifacts and Histogram of Oriented Gradients (HOG ) features. In case of spatial forgery, thresholding algorithm is applied on video frames to divide it into the blocks, after that HOG features were collected from each of the blocks and subsequently matched with other blocks to detect the copy-paste forgery. For temporal forgery, they analyzed the change in GOP structure size and video compression properties. The authors have reported the detection accuracy as a part of a performance measure. The detection accuracy in case of spatial forgery is 96 % for a 60 × 60, and 80 × 80, size forged area and 93.3 % for 40 × 40, size forged area whereas detection accuracy of temporal forgery is 84.5 % for 60 × 60 size forged area and 99 % for 80 × 80 size forged area. Moreover, the same authors have proposed a new approach based on the estimation theory and double compression in [116]. They detected the double quantization and region manipulation in forged video with the help of variation in DCT coefficient and GOP analysis. Labartino et al. [59] have presented a technique to detect and locate the region manipulations forgery in the video using the analysis of Double Quantization (DQ) traces, Histogram of DCT coefficients and Variation of Prediction Footprint (VPF). A method for the detection and localization of insertion/deletion forgery in the videos using double encoding detection is described by Gironi et al. [33]. They used a VPF and DCT coefficients analysis for detecting forgery in the video. Liu et al. [70] proposed a technique based on the sequence of average residual of P-frames (SARP) for the detection of frame deletion forgery in the video. A technique depending on Spatially Constrained Residual Errors (SCREs) of P frames is implemented by Aghamaleki et al. [2] to identify and locate frame insertion/deletion forgery and double compression in a video. The authors investigated the traces of residual error quantization in video frames. The same authors have also introduced another technique in [3], which consists of three modules, such as detection of double compression, detection of malicious manipulation, and fusion of decisions. In the detection of double compression, The DCT coefficients of I-frames are used as features which are then supplied to the Support Vector Machine (SVM) classifier to classify the single or double compressed video. Whereas, malicious tampering detection module analyzed the time-domain analysis of quantization effects on residual P frame errors to determine the frame insertion or deletion forgeries. Lastly, the output of both the module is fed to the decision fusion module to classify the videos into three types as Single Compressed Videos; Double Compressed Videos with forgeries and Double Compressed Videos without forgeries. The benefit of the both of the proposed technique [2, 3] is that it can work for the video with distinct GOP lengths and structure; however, the performance is affected for the video with moving camera and low compression ratio videos. Fadl et al. [29] have developed an approach based on the concept of residual frames for the identification and localization of digital video inter-frame duplication. The entropy of DCT coefficients in the standard deviation value of each residual frame is calculated, and the similarity among the pairs of feature vectors is explored to detect and locate the frame duplication forgery.

3.2.2 Noise artifacts based techniques

Noise is an essential feature or a clue in the video forensics for the identification of various forgery in the video. Noise artifacts based techniques take the help of sensor artifacts produced by the digital camera. Digital Video Camera usually leaves a characteristic fingerprint in the form of noise which can be used by the researcher to expose the forgery in the video due to that reason someone may also be called it as a camera-based detection technique. The noise artifacts used in video forgery detection is as shown in Fig. 15. Several noises such as Photon Shot Noise (PSN), Fixed Pattern Noise (FPN), Sensor Pattern Noise (SPN), Quantization Noise (QN) and Photo Response Non-Uniformity Noise (PRNU) are used for the detection of forgery in the video. Table 2 shows the analysis of video forgery detection techniques based on noise artifacts.

Fig. 15
figure 15

Noise artifacts for video forgery detection

Table 2 Analysis of video forgery detection techniques based on noise artifacts

Mondaini et al. [76] used FPN, PRNU and Self-Building Reference Pattern (SBRP) to identify forgeries such as object insertion, copy-move and frame insertion in a video. The noise is extracted from the video frame, and then the several correlations among them are computed to detect the forgery. The technique is tested on both compressed and uncompressed video, but it works efficiently only for uncompressed video with a stationary background. Hsu et al. [41] used the noise residue correlation at the block level to locate inpainting forgery like TCP and ETS in a video. The authors worked on the principle that when some frames have tampered, then the correlation values between temporal noise residues changes. Firstly, the video is divided into a series of frames. After that noise residue is extracted from each of these frames, then, these video frames are further partitioned into non-overlapping blocks, and the correlations between every two consecutive frames are calculated. Finally, the forgery present in a video is located by analyzing the correlation of block-level noise residue using Gaussian Mixture Model (GMM) model and the Bayesian classifier. A method based on the noise inconsistencies is introduced by Kobayashi et al. [53] for the identification of forged regions in the video. The photon shot noise is exploited as a piece of evidence, and the linear Noise Level Function (NLF) is formulated to analyze the relationship among the extracted noise to detect the forgery. The same authors have extended the existing work and suggested another method in [54] based on the Nonlinear NLF and inconsistencies in noise to detect the manipulations. The characteristic of photon shot noise is exploited, and correlations among both variance and mean are calculated with the help of nonlinear NLF to expose the manipulations. The framework to handle the copy-move tampering in the video is presented by Chetty et al. [21]. The noise and quantization residue features are obtained from the sub-block of each video frames. These features are then converted into cross-modal subspace to detect the forgery. The SPN based representation method is proposed by Hyun et al. [45] with the help of Minimum Average Correlation Energy (MACE) filter to detect the forged region in the video. This method is also used for source camera identification. In the first stage, the source camera for a given video is identified. Then in the second stage, several forgeries such as partial manipulation, video alternation and upscale-crop are identified by computing the scalar factor and correlation coefficient. Video forgery detection technique is proposed by Ravi et al. [87] for frame deletion and copy-move forgery by identifying double compression. The compression noise is used as a feature which is extracted from the video frames by the modified Huber Markov Random Field (HMRF) prior model. The extracted noise can be modelled as a first-order Markov features which are then given to the SVM classifier to detect the forgery. Pandey et al. [83]-a have designed an approach for the detection of temporal copy-move forgery (i.e., frame duplication) in the video using wavelet denoising and noise residue-based techniques. Hu et al. [42] have developed a technique to detect the region tampering in digital video using the properties of extrinsic camera parameters. Firstly, each of the video frames is divided into several block areas, followed by the calculation of extrinsic parameters from each of these blocks. Then differences among these parameters are computed. Finally, a certain threshold is chosen to detect the manipulations. Singh et al. [102] have proposed the techniques to detect intra-frame forgeries such as upscale-crop (outer parts of the frames are cropped out) and splicing forgery using pixel-correlation examination and noise-inconsistency investigation. For that, they used the resampling detector, which is referred to as Modified-Gallagher (MG) Detector and F-MG Detector (Fractional MG). In addition to this, authors have presented three schemes in [101] to detect and localize the copy-paste forgery in digital video. In the first scheme, Sensor Pattern Noise Correlation (SPNC) is used to detect and locate the manipulation. In the second, Color Filter Array Artifacts (CFA-V) is used to expose the manipulations in uncompressed frames. The final scheme is a Duplicate Cluster Detection Scheme (H-DC) based on the concept of Hausdorff distance-based pixel-clustering to identify the manipulation. The presented technique able to detect the forgery from MPEG-2, 4, MJPEG and H.264/AVC encoded videos, captured with static and moving cameras and it is independent of GOP structure length. With the use of SPN and noise residue correlation, Fayyaz et al. [31] developed the technique to detect the temporal copy-paste inpainting forgery. The noise residue patterns are extracted from each of the video frames and then compared it with the collected SPN using adaptive DCT filtering to detect the forgery.

3.2.3 Motion features based techniques

Motion-based features are time-dependent features in the digital video which define the relationships among the adjacent frames. When forgery is performed in the digital video, then motion features and relation among the adjacent frames are going to be changed, this used as a clue to identify the forgery in the video. Motion features for video forgery detection is shown in Fig. 16. The motion-based features are captured in the form of Motion Residual, Optical Flow Coefficients, Motion Vector Pyramid (MVP), and Motion Compensated Edge Artifacts (MCEA). The MCEAs are special artifacts that occur in videos that are compressed using block-based motion-compensated frame prediction coding algorithms. Successive video frames are decoded with the aid of previously decoded frames during motion-compensated frame estimation, which allows successive video frames to become dependent on each other. Inter-frame forgeries break these associations or comparisons, resulting in even more visibility in the current block boundary objects in the video frames. The spike in block boundary objects, known as MCEA, will help detect inter-frame forgeries. Another useful forensic aspect that enables the detection of inter-frame forgeries is optical flow, that refers to the pattern of the apparent movement of objects, edges, and surface within successive video frames. In a genuine video, optical flow differences between successive frames appear more or less constant, in case if some inter-frame manipulation is performed on video, the optical flow starts to show such anomalies that can act as the fingerprint. Velocity field relates to the disturbance between neighbouring video frames induced by time separation. The velocity field tends to follow a consistent pattern in a genuine video, whereas it gets disturbed in case of forgery is done on the video. The analysis of video forgery detection techniques based on motion features is shown in Table 3.

Fig. 16
figure 16

Motion features for video forgery detection

Table 3 Analysis of video forgery detection techniques based on motion features (LA: Localization Accuracy)

Wang et al. [131] have suggested an adaptive motion algorithm to identify region manipulation forgery in de-interlaced and interlaced video. They analyzed the changes in correlation introduced by the de-interlacing algorithm to identify the forgery in the de-interlaced video. Whereas to identify the forgery in interlaced video, they measured the interfiled and inter-frame motion. MCEA based technique is presented by Su et al. [114] for frame deletion forgery in digital video. They explained MCEA error which is produced after frame deletion manipulation in the video due to the effect of a decrease in temporal correlation. One more MCEA based technique is designed by Dong et al. [28] to detect the inter-frame video forgery like frame insertion/deletion. The MCEA value of each P frame in the video is extracted, and the Fast Fourier Transform (FFT) is used onto the difference of MCEA values between adjacent P frames. Then check for the presence of spikes in Fourier transform (if present then a video is tempered else it is authenticated). The inpainting forgery such as TCP and ETS in the videos are detected by Kancherla et al. [49] using a Markov model on extracted motion-based features in the videos. The salient motion-based features in a video using motion extractor and Markov model is extracted. The SVM algorithm is then used to obtain a binary classification on these extracted features. The Block-based motion estimation algorithm is presented by Li et al. [61]. The authors have detected the object removal forgeries in digital videos. They have analyzed the fact that if the certain object is deleted from the video, then the motion vector is changed. The motion information in the form of the motion vector is extracted as a clue of tampering from the adjacent video sequences to detect the forgery. Then, the original region is differentiated from the manipulated region using the orientation and magnitude of the motion vectors. Based on the analysis of the footprints left on the residual, Bestagini et al. [13] have proposed an algorithm that detects the tampering such as adding or removing certain objects from videos. They also made an enlargement of the SULFA database [86] by adding more forged videos in it. The authors have reported some parametric value such as TP, FN, TN and FP which are 0.75, 0.25, 0.97 and 0.03 respectively for the video which is not recompressed, 0.71, 0.29, 0.98 and 0.02 respectively for the video with Quantization Parameters QP = 10, 0.58, 0.42, 0.96 and 0.04 respectively for the video with QP = 15 and 0.44, 0.56, 0.84 and 0.16 respectively for the video with QP = 20. Chao et al. [17] have presented an inter-frame forgery (insertion and deletion forgery) detection method for the digital video using an optical flow consistency algorithm. The window-based rough detection model is designed for the insertion forgery. Whereas, the frame-to-frame mechanism and double adaptive threshold-based detection model is designed for detecting the frame deletion forgery. Wang et al. [134] have also developed an optical flow-based algorithm for forgery detection and localization in digital videos by analyzing the discontinuity points and optical flow sequence. They extracted optical flow variation sequence from adjacent frames to locate discontinuity points and detected the forgery such as a frame insertion, deletion, and duplication. The algorithm for handling the frame deletion forgery in the video is proposed by Feng et al. [32] based on the total motion residual. They exploited the distinctive fluctuating feature of motion residual to detect the deletion forgery and used the adaptive threshold method to locate it. The testing is performed on CBR and VBR encoded videos with both fixed and variable-length GOP structure are taken from VTL [126]. Wu et al. [137] have developed an algorithm to detect forgeries such as frame deletion and duplication in the digital video. They used block-based cross-correlation on the video to find a velocity field sequence. The generalized Extreme Studentized Deviate (ESD) test is used to detect and locate the forgery present in the video. An Inter-frame forgery detection method for digital video is created by Wang et al. [129] using an optical flow consistency. The optical flow values between each of the adjacent video frame in both x and y direction are calculated. The computed values are then given to the SVM to differentiate the forged and original video. The authors reported the classification accuracy for the single type of forgery in the x-direction for 25 frame insertions, 100 frame insertions, 25 frame deletions, and 100 frame deletions are 98.41 %, 98.20 %, 86.82 %, and 92.61 % respectively. Whereas, the classification accuracy for the single type of forgery in y-direction for 25 frame insertions, 100 frame insertions, 25 frame deletions, and 100 frame deletions are 98.60 %, 98.54 %, 86.02 %, and 88.56 % respectively. For the two types of forgery, the classification accuracy of 25 frame insertion and deletion in both x and y direction are 91.72 % and 90 % respectively whereas for 100 frame insertion and deletion in both x and y direction are 89.83 % and 92.63 % respectively. The technique based on GOP structure for object-based manipulation (adding or erasing moving object) detection in a digital video is proposed by Tan et al. [119]. They created frame manipulation detector with the use of motion residual extracted from video frames. Then CC-PEV feature set is utilized to obtain the feature vector from each of the motion residual. Later, these feature vectors are given to two ensemble classifier which categorized the video into pristine, double compressed or forged one. Based on the Lucas Kanade optical flow Bidokhti et al. [14] have developed a technique to expose the copy-move and frame duplication forgery in the video. Firstly, the video frames are separated into two parts. After that, an optical flow coefficient is calculated between these video frames. Finally, the forgery in a video is identified if any unusual changes are observed in optical flow coefficients. The MVP (Motion Vector Pyramid) consistency and it’s Variation Factor (VF) is used by Zhang et al. [150] to detect and locate the frame deletion and frame duplication forgery in the video. They used discontinuity points in the VF sequence as a clue for detecting a forgery in the digital video. The method divided into two stages 1) Features extraction 2) Discontinuity point detection. The MVP sequence with it’s associated VF is computed for the subsequent frames in a video in the first stage. Moreover, in the subsequent stage, forgery is detected and localized with the use of a modified generalized ESD test. Yu et al. [146] have proposed the approach for the identification of frame deletion forgery in the video by analyzing abrupt changes in video streams. The authors have used two features to find out the magnitude difference in prediction residual (PR) and the Number of Intra Macroblocks (NIMB’s). Based on these features, the fused index is constructed to detect frame deletion forgery. The passive forgery detection algorithm is developed by Chen et al. [20] to identify and localize the object-based tampering (Insertion or removal of objects) in the video using motion residual features. The frame manipulation detector is used to find out the residual motion feature left in video frames produced by the unethical operations. Then, SPAM, CC-PEV, CDF, SRM, CF*, J + SRM, and CC-JRM feature sets are used to create the feature vector which is obtained from each of the motion residual.Footnote 1 Then the ternary classifier (Ensemble Classifier) is used which takes these feature vectors as input and categorizes the corresponding video into a pristine, double compressed, or forged one. Singh et al. [103] developed the forensic system based on optical flow and the prediction residual to handle the frame insertion, deletion, and replication forgery in the video. The optical flow analysis-based technique is used here for frame insertion and deletion detection, where they focused on the brightness gradient component of optical flow. However, the prediction residual examination scheme is used to detect and localize the replicated frames. The forgery detection technique using the optical flow gradients features and the analysis of prediction residual is presented by Kingra et al. [52]. The technique can identify and locate the frame deletion, insertion and duplication in the videos. They evaluate the fact that the temporal correlations among the adjoining frames are disrupted when the video is manipulated. The window-based concept is used to locate the forgery. The proposed scheme is specifically designed for H.264 video and MPEG-2 codec. It works well for both slow and fast motion video, while the detection performance is slightly affected when the video is subject to high illumination. Sitara et al. [107] have developed a technique to expose the frame deletion, insertion, duplication, and shuffling forgeries in the videos using inconsistencies in the velocity field and VPF. The Generalized Extreme Studentized deviate (ESD) algorithm is designed by the authors to locate the forged places in the video. The technique is capable of identifying forgery even if the complete GOP’s Structures are deleted and also for the adaptive GOP structure. The approach based on spatial constraints and stable feature to expose the frame deletion forgery in the video is proposed by Pu et al. [85]. Initially, they obtained a Quantitative Correlation Rich Region (QCRI), then optical flow information is calculated to identify suspicious forged points. At last Gradient Structure Similarity Feature (GSSIM) are calculated to finalize the forgery. The proposed approach is independent on the frames deletion count, and it is robust against the attacks like noise, filtering and blur.

3.2.4 Statistical features based techniques

Statistical feature-based or pixel-based techniques for the video forgery detection look at statistical attributes/properties of objects, pixel-level variance and correlations among frames. This technique is also called Geometric/physics inconsistencies-based techniques as it deals with the inconsistencies (such as lighting, brightness, shadows, etc.) in the video frames. The statistical attributes may be changed after performing the forgery in the video, which is then investigated to detect the manipulations. Figure 17 shows the statistical features used in video forgery detection. Table 4 shows the analysis of video forgery detection techniques based on statistical features.

Fig. 17
figure 17

Statistical features used in video forgery detection

Table 4 Analysis of video forgery detection techniques based on statistical features

Based on the temporal and spatial correlations, Wang et al. [132] have exploited the correlation coefficient as a measure to detect the forgery in the video. Based on ghost shadow artefact, Zhang et al. [148] have presented a technique to identifies the video inpainting forgeries such as TCP and ETS. The statistical properties of the object based on the Adjustable Width Object Boundary (AWOB) algorithm is used by Chen et al. [19] to identify the object insertion or removal forgery in the video. The contourlet coefficient and gradients information features are extracted from the video frames to identify the manipulations. These extracted features are then supplied to the SVM to distinguish the forged and original objects. Frame duplication detection technique is developed by Hu et al. [43] with the help of video sub-sequence fingerprints. First, the video is divided into the series of frames and formed the Temporally Informative Representative Images (TIRI) of each frame. Then, TIRI is split into the overlapping blocks, and the DCT coefficient is extracted from each of these blocks. Finally, hamming distance is computed to check the similarity among the frames to detect the forgery. They considered TPR and FPR parameters to assess the effectiveness of their method. The average TPR and FPR value without post-processing operations are 100 % and 0 % respectively for the block size 4 and 8 whereas the FPR values are get changed to 55.55 % for the block size 16. The average TPR and FPR values for the videos with a change in brightness and MPEG compressed video are 94.31 %, 0.33 % and 49.31 %, 0.33 % respectively. The new technique is proposed by Lin et al. [69], for the frame duplication detection and localization using spatial and temporal analysis. The technique works in four stages. The first stage is candidate segment selection, where the histogram difference among the adjacent frames in Red Green and Blue (RGB) color space is used as a forensic feature. The second stage is spatial similarity analysis, where the high correlation between the two frames is observed using the block-based algorithm. The third stage is to create a classifier for detecting the duplication forgery, and the last stage is to perform post-processing. Liao et al. [66] have proposed a technique for identifying and locating the frame duplication forgery in the digital video with the use of Tamura Texture Features (TTF). Firstly, TTF features (like contrast, directionality, and roughness) are extracted from each of the video frames to generate an eigenvector matrix. After that, the dictionary ordering concept is applied to sort these eigenvectors to calculate the variation between the eigenvector and their neighbour vectors. Finally, the difference among these eigenvectors is observed to check the duplication forgery. The Spatio-temporal slices are extracted and analyzed by Lin et al. [67] to identify and localizes the inpainting forgeries such as TCP and ETS. The approach is divided into two parts: Spatio-Temporal Artifact Analysis (STSA) and Refinement. The STCA from the video frames is extracted, and abnormal regions with high inconsistency or similarity are analyzed. Then, the map of the Whole Spatio-Temporal Slice Artifacts (WSTSA) is obtained. At last, the refinement process is applied with the use of the WSTA map to match every Spatio-temporal slice artifact to detect the forgery. The limitation of their approach is that it is not suitable for multiple object removal forgery. To overcome the flaws, the same authors in [68] modified the existing approach to identify and localizes the inpainting forgeries such as TCP and ETS in the video. They filled the area left after the object removal forgery, and design a new approach depends on coherence examination to handle the manipulated areas in digital video. The technique has experimented on a set of 18 test videos.Footnote 2 Although it detects the multiple object removal forgery, the performance of their technique is affected by an increase in the compression of video. Based on structural similarity Li et al. [60] have suggested a method to detect and locate the frame duplication (alternatively called as temporal copy-move forgery) in the video. The frames in the video are separated into an overlapping block, and the structural similarity among two consecutive frames are measured to detect the forgery. Zheng et al. [153] presented a technique to detect the frame insertion forgery in the video based on the Block wise Brightness Variance Descriptor (BBVD). They divide the video into a series of frames and theses frames again partitioned into an overlapping block. BBVD features are extracted and analyzed from each of these blocks to detect the forgery. Wang et al. [128] have presented a technique based on the Consistency of Correlation Coefficients of Gray Values (CCCoGV) to detect frame insertion and deletion forgery in the video. The differences in CCoGV values among adjacent frames of videos are computed to identify the forgery and SVM algorithm is used to distinguish the forged and original video. The authors have also reported the classification accuracy, for a single type of forgery with 25 frame insertions, 100 frame insertions, 25 frame deletions, and 100 frame deletions are 99.22 %, 99.34 %, 94.19 %, and 97.27 % respectively. Whereas classification accuracy for two types of forgery with 25 frame insertions & 25 frame deletions is 96.21 % and with 100 frame insertions & 100 frame deletions is 95.83 %. Yin et al. [145] proposed method using Nonnegative Tensor Factorization (NTF) for the detection and localization of frame insertion/deletion forgery in the video. The method is based on the finding consistency of the time-dimension factor to detect inter-frame forgery. The video is factorized with the use of NTF algorithm, and then the time-dimension factor is extracted from it. At last, The correlation among the extracted elements of the coefficient is compared to detect the forgery. Chittapur et al. [22], have designed a method to detect the region level forgery based on the statistical property of mean and pixel comparison. The temporal difference among each of the video frames is examined to identify and locate the forged region. Tralic et al. [120] presented frame duplication forgery detection method for the video based on Local Binary Patterns (LBP) and Cellular Automata (CA). The video frames are divided into overlapping blocks. Then, the histogram rule is created and applied a CA to every block to detect the forgery. Based on the inconsistency of Quotients of Consecutive Correlation Coefficients of LBPs (QCCoLBPs), Zhang et al. [151] presented video forgery detection algorithm to expose the inter-frame forgery (i.e., frame insertion or deletion). The QCCoLBP is calculated between the neighbouring frames in the video. Then, the Tchebyshev inequality concept is used to detect suspicious abnormal points. The Precision and Recall parameters are taken into consideration to measure the performance of the algorithm. The (Precision, Recall) values for single type of forgery with 25 frame insertions, 100 frame insertions, 25 frame deletions and 100 frame deletions are (98.62 %, 95.33 %), (98.78 %, 94.49 %), (89.27 %, 87.48 %) and (94.31 %, 91.47 %) respectively. Whereas precision, recall values for two types of forgery such as insertion and deletion are 88.16 % and 85.80 % respectively. Singh et al. [105] have suggested a method to identify and locate the frame duplication forgery in the video with the help of block-based features. They divided each frame of video into four sub-blocks (B1, B2, B3, B4) and approximately, nine features from each frame in the form of the mean of a block, ratio and residue for each sub-block are extracted. Then, a lexicographical sort is performed on to the extracted feature to group the similar frames of video. After that Root Mean Square Error (RMSE) value between adjacent frame is calculated, if it is less than a threshold value, then the frames are rejected, and a remaining frame is kept as doubtful. Finally, The correlation between doubtful frames is computed to identify the frame duplication. Pandey et al. [83]-b suggested forgery detection method to expose copy-move forgery in the video frame based on Scale-Invariant Features Transform (SIFT) and K-NN matching algorithms. A compressive sensing technique is proposed by Su et al. [110], to identify moving foreground removal from the video with a static background. They collected the feature difference among the adjacent frames with the use of the Singular Value Decomposition (SVD) algorithm. After that, random projection concept is applied to investigate the features in lower-dimensional space. These features are then clustered using a k-means technique to detect the manipulations. Bagiwa et al. [10] have proposed an approach to detect the chroma key forgery present in the video depends on the correlation among extracted blurring artifact. Chroma key is a kind of splicing forgery in which two videos are combined, with one video’s background color becoming transparent to expose another video. They computed cross-correlation between video foreground blocks and background to detect the forgery. Xu et al. [139] have suggested a technique to detect the frame deletion, insertion, and duplication forgery in a video based on the histogram intersection. The correlation coefficients are calculated using the histogram intersection, and then the outliers from it are analyzed to confirm the forgery. Li et al. [65] proposed the method using the uniformity of Quotient of Mean Structural Similarity (QoMSSIM) to detect the frame deletion and insertion forgery in the video. They examined the facts that QoMSSIM are consistent for the original video and it disturbed in case of forged video. QoMSSIM between each of the two frames is calculated and observed for the presence of forgery. They used the SVM to distinguished the original and forged video. The suggested method shows the robustness against recompression and white Gaussian noise. The authors have reported the classification accuracy, for a single type of forgery with 25 frame insertions, 100 frame insertions, 25 frame deletions, and 100 frame deletions are 98.62 %, 98.96 %, 90.72 %, and 94.94 % respectively. Whereas for two types of forgery with 25 frame insertions & 25 frame deletions are 92.27 % and with 100 frame insertions & 100 frame deletions is 92.75 %. Mathai et al. [74] presented the algorithm to detect and localize the content duplication forgery (also called as a temporal copy-move forgery) in video using moment features and cross-correlation concept. The features from the prediction-error array are estimated for every frame-block, and then the normalized cross-correlation is checked to find out the duplication. Yang et al. [140] have proposed approach to detect and localize the frame duplication forgery in a video with the use of a similarity analysis method. The method worked in two steps. In the first step, the features of each frame are collected by using the SVD algorithm. Then, Euclidean distance is computed among the features of every frame with a reference frame. In the second step, the duplications present in the video are identified using random block matching. Liu et al. [71] have proposed the technique to identify the inter-frame forgeries in the video with the use of Zernike Opponent Chromaticity Moments and Coarseness Analysis (ZOCM). The same authors presented a Three-Stage Foreground Analysis And Tracking Algorithm (3FAT) in [72] to identify the blue screen composition video forgery. They exploited irregularities of the contrast and luminance between background and foreground to detect the forgery. In the first step, foreground blocks in a video are extracted using the multi-pass foreground locating method such as GMM. After that, to detect the forged block, A mixture of local features, such as luminance, contrast, etc. are used to verify the resemblance of the foreground block and the background. Finally, the forged block in a subsequent frame is monitored with the assistance of a compressive monitoring concept using a quick target search algorithm. Bozkur et al. [15] have introduced the technique to detect the frame duplication forgery and localization of it in video based on forgery line. They divided each frame of video into the non-overlapping sub-blocks, and DCT is applied to each of these sub-blocks. After that, a row vector that contains the averaged DCT values is created from each frame. These row vectors are then binarised to compute a correlation matrix and creates a correlation frame. Finally, hough transform is used on the correlation frame to find forgery line to detect the forgery. Based on the binary features, the technique to detect and locate the frame duplication and frame mirroring forgery in the video is proposed by Ulutas et al. [122]. Firstly, the video is split into the frames, and each frame then transformed into a binary form. The binary features from these frames are extracted to determine the similarity among feature. After that, the Euclidean distance measure is computed for analyzing the similarity among adjacent frames. Then Peak Signal to Noise Ratio (PSNR) values among similar frames are measured to avoid the false duplication. At last, the post-processing operation is applied to enhance performance. The same authors have designed a method to handle the frame duplication forgery present in the video using Bag of Words (BoW) model in [123]. The BoW model is invented here to generate visual words and construct the dictionary from SIFT key points of frames in the video to detect the duplicated parts. A patch-based algorithm to identify and localize the copy-move forgery in the video with the help of Zernike moments features is mentioned in D’Amiano et al. [24]. The similarity analysis-based scheme is developed by Zhao et al. [152], to detect and localize the forgeries like frame deletion, insertion, and duplication with the help of histogram and Speeded Up Robust Features (SURF). In the first module, the HSV (Hue-Saturation-Value) color histogram comparison algorithm is used to detect the forgeries. The SURF and FLANN (Fast Library for Approximate Nearest Neighbors) algorithms are used in the second module to localizes the forgery. Su et al. [111] have suggested a forgery identification method using Exponential-Fourier Moments (EFMs) features to identify the region duplication forgery (also called as copy-move manipulation) in videos. EFMs features are extracted from every block of the current frame and check whether there is a matching pair or not. Then, the Post-Verification Scheme (PVS) is used to eliminate manipulated pairs and locate the forged area in the video frame. At last, an Adaptive Parameter based Fast Compression Tracking (AFCT) method is used for checking the forged areas in the corresponding frames. The proposed method worked efficiently for the forged region with mirroring attack (mirror invariant). Furthermore, the same authors have presented a technique in [109] for detecting the duplication forgery (Copy-Move forgery) in the digital video using Mirror-invariant and Inversion-invariant SIFT (MI-SIFT). The MISFIT algorithm is used to extracts features from the current video frame. Then, the manipulated regions in the current video frame are detected. At last, Spatio-temporal context learning algorithm is created to finds the manipulated regions in the other frames. Moreover, authors have developed another algorithm in [112] to detect the forgery in videos with variable bit-rate compression for the detection of foreground removal (also called as object removal ) forgery in the video. They created the Energy Factor to detect forged frames and locate the manipulated region in those frames by developing an adaptive parameter-based visual background extractor (AVIBE). The proposed algorithm is robust against post-processing operation like noise addition, brightness change, shaking screen and water ripples. Wei et al. [136] developed the detection technique based on a multi-scale standardized mutual information to detect inter-frame forgeries such as frame duplication, insertion, and deletion forgery in the video. The crucial features are extracted from the frames, and then the similarity between the adjacent frames is calculated using the relevant measurement function. Based on the correlation coefficients and coefficients of variation, Singh et al. [100] developed two separate algorithms to detect the forgery in videos. The first algorithm extract mean features from each frame and estimate the correlation among the frames to detect the frame duplication forgery. In contrast, the second algorithm estimates the similarity among region within the frames to locate the copy-move forgery. The algorithms are tested on both static and moving background videos. To detect and localize the frame insertion, duplication, and deletion forgery in video Bakaset al. [12] proposed the approach by analyzing the Haralick correlation inconsistency among the frames. The benefit of the proposed approach is that it is independent of GOP size/structure, and the number of frame deletion. Also, it is suitable for both slow-motion static and moving background videos encoded with MPEG-4, XViD, H.264 and H.265 codecs. The authors tested the proposed approach on static as well as a dynamic background video and reported some parametric values such as precision, recall and F1score. In case of video with static background parametric values for frame insertion/deletion detection and localization are PR= 85 %, RR= 89 %, F1-Score= 87 % and PR= 95.8 %, RR= 94.2 %, F1-Score= 94.8 % respectively whereas for frame duplication detection and localization the values are PR= 93 %, RR= 100 %, F1-Score= 96 % and PR= 98.8, RR= 100 %, F1-Score= 99 % respectively. In case of Dynamic background video, parametric values for frame insertion/deletion/duplication detection and localization are PR= 95.6 %, RR= 82.4 %, F1-Score= 88.4 % and PR= 99.4 %, RR= 97.6 %, F1-Score= 98.4 % respectively. Bai et al. [11] presented a technique to identify and locate the TCP and ETS inpainting forgery in video using Spatio-temporal LBP analysis. The proposed method is tested on both static as well as moving background video. However, performance is affected by fast-moving background videos. Aparicio et al. [8] presented a technique to detect and locate the copy-move and frame duplication forgery in video using a block correlation matrix. The block correlation matrix is used to stores both the spatial and temporal information of all the pixels to detect the forgery. Based on texture inconsistency, Saddique et al. [94] proposed a new method to detect the region manipulation forgery in the video. Firstly, the Difference of Consecutive Frames (DOCFs) from the video sequence is calculated. Discriminating features are then extracted via a CCD-DRLBP (Chrominance value of Consecutive Frame Difference and Discriminative Robust Local Binary Pattern) descriptors which is then helpful for the detection and localization of forgery. These extracted features are then supplied to the SVM to identify video clips as authentic or forged one. The proposed approach is robust against the geometric transformation and post-processing operations. However, it is not suitable for the video captured through moving camera. Aloraini et al. [5] have proposed an approach for detecting the object-based forgery (specifically moving object) in the video. The proposed approach divided into three stages such as spatio-temporal filter, sequential analysis and object movement estimation. In spatio-temporal filter stage, the video is divided into frames, and spatial decomposition is applied with the help of Laplacian pyramid.Footnote 3 Then the temporal high pass filter is used to detect the edges. The Sequential analysis is the second stage which is used to identify the pixels change in video frames. At last, the forged object estimation is done by summarizing all the pixels change in video frames. Furthermore, The same authors have modified the existing approach based on Sequential and Patch analysis and developed a new approach in [6] for the identification and localization of object removal forgery in the video. In Sequential analysis, video sequences are modelled as stochastic processes and alterations in the parameter during sequence modelling are explored for the detection of forgery. Whereas in Patch analysis, video sequences are modelled as a combination of normal and abnormal patches to identify the distribution of each patch. Finally, the forged regions are localized by observing the movement of the removed objects using abnormal patches. Kharat et al. [51] proposed a two-stage algorithm to identify the frame duplication forgery in MPEG 4 video. The motion vectors for all the frames are determined to classify suspicious frames in the first stage. In the next stage, SIFT features of every frame are calculated to take the final decision to identify duplication forgery. The suggested method works fine for both on compressed and uncompressed videos with different compression rate.

3.2.5 Machine learning-based techniques

The use of machine learning techniques in the area of computer vision encourages the researchers to apply machine learning (ML) and deep learning (DL) models for video forgery detection. These techniques are data-driven (i.e., which need a huge amount of data), and they are capable of automatically learning necessary complex features/artifacts required to detect the forgery in the video. The different types of ML/DL models such as SVM, K-Nearest Neighbour (KNN), Logistic Regression (LR), Linear Discriminant Analysis (LDA), Multilayer Perceptron (MLP), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Auto Encoder etc. are used by the researchers for the detection of forgery in the video. The analysis of video forgery detection techniques based on machine learning is shown in Table 5. Shanableh et al. [95] have presented the machine learning-based approach for the detection of frame deletion forgery in the digital video. They extracted different features such as prediction residuals, quantization scales, percentage of intra-coded macroblocks, and PSNR values from the video. They used machine learning methods such as K Nearest Neighbor (KNN), Logistic Regression (LR) and SVM to detect the deletion forgery from a video. They used 36 MPEG-2 coded videos with Constant Bit Rate (CBR) and Variable Bit Rate (VBR). The presented method works on CBR and VBR encoded video with both fixed GOP and variable GOP length structure. Yao et al. [143] have designed deep CNN model to handle the object-based forgery in a video. They transformed the input video into image patches by using an absolute difference algorithm. Then the training data set is generated, which is labelled as a positive and negative sample of image patches. After that, the five-layer CNN model is trained using the generated training data. They used Caffe deep learning framework [46] to implement the CNN model. The designed model is tested on videos (encoded with H.264/MPEG-4 codec) taken from SYSU-OBJFORG dataset [20]. Long et al. [73] have proposed Convolutional 3D Neural Network (C3D) model to detect and localize the frame deletion or dropping forgery in a video by exploiting the Spatio-temporal relationship in the digital videos. The proposed model is tested and validated on videos are taken from the Yahoo Flickr Creative Commons 100 Million (YFCC100m) [144] and Nimble Challenge 2017 dataset [79]. The proposed model is suitable for the video with stationary and moving background videos. The work by D’Avino et al. [27] used the deep learning model based on autoencoder and RNN to detect the splicing forgery in a video. They extracted frame residual-based features to train the network. The experiment is implemented in TensorFlow using the Adam learning algorithm and tested on a personal dataset, which is available at [34]. The limitation of their model is that it takes too much time to train the deep learning network. Based on the Spatio-temporal consistency, Kono et al. [55] have proposed Convolutional Long Short-Term Memory (ConvLSTM) models to detect the object removal forgery in the video. They used CNN to consider the spatial aspects of the video, whereas RNN is used to consider the temporal aspect of the videos. The method works for both static and dynamic background videos. Hong et al. [39] presented a scheme to delete the frame deletion forgery in HEVC encoded video. They concentrated on the sort of frame changes that occur when the frame is deleted, which create subtle differences between both the coding patterns in the source and the manipulated video. The proposed scheme consists of two parts. In the first part, the useful features from compressed coding information are extracted. The second part uses the classifiers such as LDA, KNN and MLP to check the genuineness of the video. The benefit of this scheme is that it is designed for the video encoded with the latest codec, HEVC. Johnston et al. [48] proposed a framework for localization of region tampering in video using features learned from original video contents. They used CNN to estimate the compression parameters like quantization scale, deblock filter setting and intra/inter-frame type. Zampoglou et al. [147] presented a technique to detect the double quantization, frame insertion and region manipulation in video using Deep CNN. They designed two forensics filter one is based on DCT and other is based on quantization error. The filter outputs are then given to the Deep CNN to differentiate original and forged video.

Table 5 Analysis of video forgery detection techniques based on machine learning

4 Anti-forensics techniques

Video anti-forensic techniques have been developed to deceive forensic investigation by removing or concealing traces left after the forgery. Although forensic techniques are useful in identifying digital manipulations in videos, most of them could fail if a forger uses anti-forensic approach. The anti-forensic techniques work on the principle that if someone removes or reduces the traces left over after the manipulations in the video, which itself leads to other evidence and that need to be further investigated to identify the forgery. Stamm et al. [108] concentrated on the periodical re-compressed artifacts left after the frame insertion and deletion forgery. They designed the anti-forensic technique by identifying the P-frame prediction error in a manipulated digital video. Furthermore, the counter anti-forensic technique has also been designed to make a comparative analysis between the actual predicted error and a predicted error acquired from the video. Su et al. [113] presented an anti-forensic method where the inter-frame relationships of coding modes in adjacent frames are analyzed to determine whether the intra-prediction can be applied during the re-encoding process of the tampered video. After re-encoding, the coding parameters and the bit-rates are also examined to predict the targeted distribution of quantization indices to detect the tampered video. Kang et al. [50] modified the frame deletion detection methodology in [70] and proposed new methodology which can also detect the frame insertion forgery in the video. The authors also designed an anti-forensic method based on the analysis of P frame prediction error for the detection frame deletion forgery. Besides, the counter anti-forensic approach for frame deletion forgery have also been proposed, where the predicted error is estimated and then after it is compared with the stored prediction error. Yao et al. [142] focused on the inter-frame interpolation as an anti-forensics operation to identify frame deletion forgery in the video. The method is tested on video encoded with H.264 and H.265 codec with GOP default size of 250. The analysis of anti-forensics techniques for video forgery detection is shown in Table 6.

Table 6 Analysis of anti-forensics techniques

5 Deepfake detection

Deepfakes are media that use the machine learning to take a person in an actual photo or video and replace them with someone else’s identity. Deepfakes were used in porn pictures and videos to swap faces of politicians or celebrities. Hence, deepfake video can be misused to trigger political or religious instability to fool the public and affect election campaign results or disrupt financial markets by creating fake news stories [78]. Figure 18 shows the example for deepfake video wherein the original face is replaced from a new one. The analysis of deepfake detection techniques is shown in Table 7.

Fig. 18
figure 18

Deepfake example [25]

Table 7 Analysis of deepfake detection techniques

Li et al. [64] examined the fact that a normal human would usually blink somewhere among 2-10 seconds, and it would take 0.1-0.4 seconds for every blink. Authors also noted that blinking rates in deepfake video are relatively lower than those in normal videos. Based on these physiological signal (such as eye blinking), they proposed a Long-term Recurrent Convolutional Neural Networks (LRCN) model to detect the deepfake video. The set of eye sequences are provided as an input to the LRCN model, which consist of three stages such as 1) feature extraction 2) sequence learning 3) state prediction. The same authors proposed deep learning-based model in [63] to detect the deepfake videos with the help of face wrapping artifact. The CNN models such as such as VGG16 [99], ResNet152, ResNet101 and ResNet50 [38] are used to detect the deepfake forgery. The PRNU analysis is adopted by Koopman et al. [56] to expose the deepfake detection in a video. They divide the video into frames and faces are cropped out from those frames. The extracted faces are then divided into groups and PRNU calculated for each of these groups. After that, the mean normalized cross-correlation score is calculated to distinguish deepfakes from authentic videos. Guera et al. [37] explored the intra-frame frame and inter-frame consistency between video frames and developed the temporal-aware pipeline approach using CNN and LSTM model. The frame-level features are extracted using CNN, which are then fed to the LSTM model to detect the deepfake video. The proposed model is tested on 300 deepfake videos with an average accuracy of 96.96 %. Afchar et al. [1] proposed a MesoNet deep learning network to observe the mesoscopic properties of images/frames for detecting the forged video of faces. They evaluate the proposed deep network on fake video dataset with an average detection rate of 98 %. To identify the deepfake video, a Recurrent Convolutional Network (RCN) model is suggested by Sabir et al. [92]. The model is based on the integration of the CNN features with DenseNet [44] and the gated recurrent unit cells [23] to analyze the temporal correlation across frames. The suggested model is tested on the FaceForensics++ dataset [91], that consist of 1,000 videos. Yang et al. [141] presented a deepfake detection method by analyzing the differences between 3D head poses containing head orientation and position. The extracted artifacts are given to the SVM classifier to get the detection result. Nguyen et al. [77] suggested capsule networks that identify the manipulation in images and videos. They used VGG-19 network [99] to extract the latent features from video frame and then fed it to the capsule networks (which is based on dynamic routing algorithm [93]) for classification. Zhang et al. [149] have presented a novel transfer learning-based technique to identify the deepfake forgery in the video. They used two neural network model such as Inception-v3Footnote 4 and MobileNet V1 [40] to detect the deepfake video. Amerini et al. [7] presented a technique to expose the deepfake detection in video using optical flow coefficients and CNN classifier. Firstly, they divide the video into frames, and then optical flow coefficients among all these frames are extracted. Finally, the extracted features fed to the CNN model to identify the original or fake video.

6 Video forgery datasets

In this section, the analysis of existing available video forgery dataset is studied and analyzed. Table 8 shows the analysis of video forgery datasets. Qadir et al. [86] have created another video dataset for testing video forgery detection technique named as Surrey University Library for Forensic Analysis (SULFA). It consists only copy-move type of forgery-based videos. The SULFA dataset consists of 150 videos collected from static cameras, and it is available online at [117]. Each video in a dataset is 10 seconds long, with a frame rate of 30 fps and has a resolution of 320 × 240. SYSU-OBJFORG is one of the forged video datasets, which comprises of 100 original video footages and 100 forged video footages developed by Chen et al. [20]. These video sequences are of 11 seconds long, with a resolution of 1280 × 720, compressed by H.264/MPEG-4 codec with a bit rate of 3 Mbit/s and has a frame rate of 25 fps. REWIND forged video dataset is created by Bestagini et al. [13]. They used SULFA dataset [86] to create their dataset. This dataset consists of 10 original and 10 forged videos which are having a resolution of 320 × 240 pixels with a framerate of 30 fps and compressed with MJPEG and H264codec. REWIND dataset contains the differences between the frames of the original sequences and the forged sequences, which is useful in video forgery detection. The dataset is available at [88]. Ulutas et al. [123] have created a dataset which consists of 31 forged videos (with both static and moving background videos) with frame duplication forgery. They perform the manipulation on 25 videos are taken from SULFA dataset [86] and 6 videos from different movie scenes using virtual dub software. The dataset is available online at [26]. D’Amiano et al. [24] have created a dataset which consists of 15 forged videos with copy-moves forgery (forged videos with 10 additives and 5 occlusives). They used After Effects Pro tool to perform the forgery in the video. The dataset is available online at [35]. Davino et al. [27] have created the dataset which contains the forged video with splicing forgery. This dataset contains 10 forged videos along with the 10-original video. The Adobe After Effects CC tool is used to perform the forgery in the video. The dataset is available at [34]. Al-Sanjary et al. [4] created a Video Tampering Dataset (VTD) which contain manipulated videos which are used for testing the performance of video forgery detection technique. Videos are collected from YouTube and networking websites. The VTD includes 33 videos, categorized among three types of forgeries such as Splicing forgery, Copy-Move forgery, and Swapping-Frames. The length of each video is of 16 seconds, with a resolution of 1280 × 720, and a rate of 30 frames per second. Their dataset is available at [125]. Ardizzone et al. [9] have created datasets of tampered videos by cloning the objects (copy-move forgery) from a video sequence. Also, they applied various transformations on tampered videos such as Scaling, Shearing, Rotations, Flipping, Luminance and RGB. They gathered different videos from SULFA [86] and CANTATA [16] video datasets for the scenario related to traffic control and parking surveillance. Their dataset contains 160 forged videos with an average duration of 30 cloned frames.

Table 8 Analysis of video forgery datasets [FPS: Frame per Second]

7 Generalized architecture of passive video forgery detection

Video forgery detection using passive techniques are binary classification techniques. The main aim of these techniques is to classify given videos into two classes, such as original and forged videos. Most of the existing passive forgery detection techniques, first extract distinct features from videos, then select an appropriate classifier and train it using the extracted feature set to classify the videos. Few such techniques are proposed in Chen et al. [20], Aghamaleki et al. [2], Aghamaleki et al. [3], Hsu et al. [41], Ravi et al. [87], Kancherla et al. [49], Wang et al. [129] Tan et al. [119], Chen et al. [19], Lin et al. [69], Wang et al. [128], Li et al. [65], Shanableh et al. [95], Yao et al. [143], Long et al. [73], D’Avino et al. [27], Sabir et al. [92], Yang et al. [141], Guera et al. [37] and Nguyen et al. [77]. The generalized architecture for passive video forgery detection technique is shown in the Fig. 19 which consist of the following important stages:

  • Pre-processing: - The main objective of pre-processing is an enhancement of the digital video frames that suppresses from unnecessary alteration or improves some features crucial for later processing. Before the feature extraction stage, some important operations have to be performed on the video, like RGB to gray conversion, DWT or DCT transformation and cropping to optimizes the classification performance.

  • Feature Extraction: - This stage starts with a set of calculated data and builds resultant values which are called as features that considered being relevant and non-redundant. A collection of features shall extract for every class of video frame that is used to differentiate it from other classes. In digital video analysis, feature extraction obtains the useful artifact from a video which will be helpful for further investigation.

  • Feature Pre-processing: - The use of this module is to decrease the feature dimensionality without significantly reducing the efficiency of classification.

  • Forgery Detection Technique: - The main aim of this stage is to apply certain techniques on extracted and pre-processed features for detecting the forgery in the digital video.

  • Classification: - The prime use of this module is to analyze to which of the class a new inspection fits in, with the use of a training set of video contents containing observations whose class is known. Based on the extracted collection of chosen features, the suitable classifier is designed to make a distinction between the original and the forged video.

  • Forgery Localization: - The main target of this stage is to locate the exact place of the forgery present in the video.

Fig. 19
figure 19

Generalized architecture for passive video forgery detection

8 Discussion and new challenges in video forgery detection

Based on the study of various passive video forgery detection techniques invites several merits and demerits illustrated in Table 9.

Table 9 Merits and demerits of various passive video forgery detection techniques

It is observed from the study on various existing passive video forgery detection techniques, for a particular scenario, the suitable method for detecting the forgery relies on following essential parameters, such as.

  • Compression: The performance of most of the video forgery detection techniques discussed in the literature relies on the video codecs such as H.264, MPEG-4, MPEG-3, MPEG-2, and MPEG-1 used for compression. Compression artifacts based techniques may fail in uncompressed forged videos. It is recognized that the forgery detection accuracy of many of the existing techniques decreases with the increase in compression ratio. Also, it is affected by the change in video bit-rates and quantization scale ratio. In most of the cases, compression artifacts present in the video degrades the performance of the detection system. Many of the techniques proposed so far able to detect the forgeries in video compressed with specific codec only. Video recompression using the same encoding parameters and forgery identification in highly compressed videos are some of the issues that need to be addressed.

  • GOP’s Structure: The most usually used video encoder such as H.264/AVC uses adaptive GOP’s structure in the current scenario where the GOP size will expand up to 250 frames depending on the video content changes. Many of the mentioned techniques work well for GOP with a fixed structure size, and quite a few them are useful for detecting the forgery in variable GOP structure videos and are unable to detect the deletion of a complete GOP or multiples of GOP’s.

  • Noise: Since the video noise has clearly changed over the last 15 years; it’s been a challenging task for the researcher to create a new methodology as of for the new type of noise. Also, it is observed that the noise present in the video affect the performance of the detection system.

  • Video Background: Many recent forgery detection techniques designed so far are capable of detecting the forgery in a video with a static background (i.e., not suitable for the video with dynamic or moving background). Exceptionally few techniques are developed to expose the forgery in video with a moving background, so it is another issue for researchers to work on it.

  • Detection and Localization of Forgery: Most of the stated techniques deal with the identification and localization of a single type of forgery in the video. At the same time, they are not capable of examining multiple forgeries present in the video. Splicing, Frame replication, upscale crop, and frame mirroring are a different kind of forgeries in the digital video, which are not much explored.

  • Video Frame Count: Most of the present techniques are dependent on the numbers of frame inserted, deleted or duplicated in case of detection of inter-frame forgery. Also, these techniques are not able to detect the forgery in the video when the video frame count is less than a certain threshold.

  • Video Quality and Length of the Video: Many video forgery detection techniques have designed only for low resolution and short length videos. Due to which there is an extended scope for the researchers to develop a better method to detect and localize the forgery in long length videos.

  • Video Forgery Datasets: The foremost concern of existing techniques discussed in the literature is the lack of video forgery datasets to perform comparative experimental analysis. The current datasets mostly consist of videos with a single type of forgeries such as copy-move, splicing, and frame duplication, also it mostly contains the forged videos with stationary background only. Very few datasets reviewed in the literature consist of a forged video with a moving background. Presently no such video forgery dataset is publicly available on the Internet which includes of inter-frame forgeries such as frame insertion, frame deletion, and frame shuffling. Hence, there is ample scope for the researchers to create the forged video dataset for other types of forgery with a moving background.

  • Computational Time: It is the primary task for researchers to reduce the high computational time needed to detect and locate forgery in the video.

  • Post-processing operations: The most of forgery detection techniques presented in the survey has not addressed the robustness against post-processing operations such as intentional noise addition, compression and brightness change.

  • Use of Machine Learning/Deep Learning: Very few techniques are developed so far, which make the use of machine learning methods, especially deep learning. The immense scope is there for the researchers to work with different types of ML/DL models for the detection of both inter/ intra frame forgery in the video. The use of ML/DL models in the area of video forgery detection encourages the researchers to design the automated technique for forgery detection.

  • Inadequate Anti-forensic and Deepfake Detection Strategies: Very few anti-forensics techniques are developed so far to expose the forgery in the video. Especially most of the techniques designed can handle frame deletion forgery only. So, it has become a great chance for the researchers to explore the anti-forensic strategies for other types of forgery. Furthermore, deepfake detection in the video is one of the hot areas for further research in video forensics domain.

  • Audio aspect in Video: Although the visual contents of video help us in legal matters at the same time, it is impossible to ignore the role of audio in making the decision. All the existing forgery detection technique proposed so far only focused on visual content, i.e., no attention has been given to the audio component of digital video.

We believe that this study will enable researchers working in the field of video forgery detection to find new useful approaches and ideas. The detail summarization of video forgery detection techniques is presented in Table 10.

Table 10 Summarization of video forgery detection techniques (A: Copy-Move, B: Splicing, C: Region Manipulation (Object insertion or deletion), D: Frame Insertion, E: Frame Deletion, F: Frame Duplication, G: Frame Replication, H: TCP & ETS Inpainting, I: Upscale Crop, J: Mirror Invariant, K: Detection, L: Localization, M: Fixed Size GOP, N: Variable size GOP, O: Video with Static Background & P: Video with Moving Background

9 Conclusions

This paper presented a comprehensive analysis of passive video forgery detection techniques. The detailed analysis of passive video forgery detection techniques is performed in terms of features/method used, forgery identified, datasets used, performance parameters along with their limitations. The emerging topic, such as anti-forensics strategies and deepfake detection in the video have also been discussed. Furthermore, the standard benchmark datasets related to video forgery have been reviewed. Some of the critical challenges which can contribute to significant research in this field has also been mentioned. Although the researchers have proposed several techniques for passive video forgery detection, still there is a necessity to introduce some new techniques which can overcome the points discussed in Section 8. It is observed that most of the existing video forgery detection techniques deal with identifying a single type of forgery and are unable to deal with multiple forgeries. Also, most of the current techniques are dependent on the size of GOP’s structure, codec used for video compression, compression rate, noise, size/length of the video, video frame count and background of the video. Very few techniques are designed so far that can detect the forgeries in the video with the help of machine/deep learning. Anti-forensic and deepfake detection in the video is the new aspects that need to be explored more. This survey will be helpful for the research fraternity to improve passive video forgery detection techniques with new ideas.