1 Introduction

For the past two decades, video content can be easily modified or falsified (called video forgery) with many commercial multimedia editing tools (Singh and Aggarwal 2015). Such falsification on video content can lead to severe results. For example, voters can be misled for elections with video forgery of politicians; video forgery in the military field may lead to a war crisis. Practically, such carefully crafted video forgery may not be easily distinguishable even for human experts, leading to the issues of authenticity, originality, and integrity of video contents. For these reasons, effective forensic techniques are urgently demanded.

Digital video forgery is mainly divided into two categories: whole frame forgery, and object forgery. Whole frame forgery (Li et al. 2016; Liu and Huang 2017; Zhang et al. 2015) is the modification of video contents using an image frame as the forgery unit. Existing techniques in this category include frame deletion, frame insertion, and frame duplication. On the other hand, object forgery is the insertion or deletion of objects in the video content, e.g., video splicing forgery (Chen et al. 2016), video copy-move inter/intra-frame forgery (hereinafter referred to as inter/intra-frame forgery).

The construction of the whole frame forgery is relatively simple, and its forgery result is usually imperfect, and the visual effect always looks unnatural. Therefore, most state-of-the-art detection methods can achieve satisfactory results for whole frame forgery, including scene dependency (Li et al. 2016; Liu and Huang 2017; Zhang et al. 2015), optical flow (Bidokhti and Ghaemmaghami 2015; Jia et al. 2018), compression artifacts exploitation (Aghamaleki and Behrad 2016; Yu et al. 2016), and deep learning (Bakas and Naskar 2018; Long et al. 2017, 2019). A coarse-to-fine detection strategy (Jia et al. 2018) based on Optical Flow (OF) is designed to address the frame copy-move forgery, namely, frame duplication. The coarse detection analyzes the consistency of OF sum between the consecutive frames to find the suspected tampered positions (start-points or end-points of the duplicated frame sequences). The fine detection matches the duplicated frame pairs based on OF correlation.

The other forgery type, object forgery, can achieve a realistic result because it requires more sophisticated and finer forgery techniques such as splicing forgery and copy-move forgery. In splicing forgery, a splicing object and the background elements are firstly shot with different surveillance cameras and then synthesized together (Chen et al. 2016; Davino et al. 2017). For detecting splicing forgery, a machine learning method (Chen et al. 2016) and a deep learning method (Davino et al. 2017) were proposed to identify the inconsistency of statistical properties between the splicing object and real background. Their effective and efficient performance was reported in (Chen et al. 2016; Davino et al. 2017).

Video copy-move forgery achieves an excellent visual effect but requires relatively complex manipulation, that can be done with inter-frame and intra-frame (Zhong et al. 2020). Inter-frame forgery pastes the copied objects from one frame to other corresponding frames in the video, while intra-frame forgery involves successive operations of pasting one or some copied objects from one frame into the same frame. When a video copy-move forgery aims to confuse the frame by adding some objects, it is called additive manipulation. Oppositely, it is called occlusive manipulation when aiming at hiding some objects. Figure 1 shows some examples of inter/intra-frame forgeries with additive/occlusive manipulations. Noteworthy, it is very difficult to detect a carefully crafted inter/intra-frame forgery using the above-mentioned machine learning or deep learning methods under the consistency of statistical properties. It is because the copied objects and the background of the pasted frame are shot under the same surveillance camera, both of which have the same statistical properties and therefore indistinguishable. For this reason, video copy-move forgery detection is currently the most challenging technique for video forensics.

Fig. 1
figure 1

The sample clips of the video copy-move forgery with additive/occlusive manipulations. a, b show the additive (a red car) and the occlusive (the background floors) samples of inter-frame forgeries, respectively; c, d show the additive (a cell) and the occlusive (the background wall) samples of intra-frame forgeries, respectively; The 1st and the 2nd rows of ad show the forgery frame clips and the corresponding ground-truth clips, respectively. The black color indicates the background, and the white color indicates the forgery region

In the literature, only a few works can achieve satisfactory detection results for video copy-move forgery while requiring a high computational cost. Moreover, most existing work is only designed for a single type of video copy-move forgery detection (either inter-frame or intra-frame) but not both. Subramanyam et al. (Subramanyam and Emmanuel 2012) proposed a Histogram of Oriented Gradients (HOG) feature matching and video compression properties to address only intra-frame forgery in MPEG4 format. However, this work takes unacceptably high computational cost, and hence unsuitable for long video clips. In (Bestagini et al. 2013), a detection algorithm is proposed that allows a forensic analyst to reveal and locate the inter-frame forgeries, but fails in resisting the geometrical manipulations, such as rotation. Su et al. (Su et al. 2018) presented the extraction of Exponential-Fourier Moments features in each frame to find the potential matching pairs of intra-frame forgery. However, this work can only detect the inter-frame forgery and lacks robustness to resist compression. The deep neural network (DNN) schemes, e.g., Motion Residual and Parasitic Layers (MRPL) (Saddique et al. 2020)are proposed to address the video copy-move forgery issue. However, MPRL only detect the differences between the forgery frame start and end, and the adjacent frames. Due to the rich forgery objects, DNN for detecting inter/intra-frame forgeries is in an infant stage.

To summarize, the existing detection methods for video copy-move forgery suffer from three defects:

  1. (i)

    Most state-of-the-art methods cannot make a good trade-off between accuracy and A video with a medium length may contain hundreds of frames that already require a prohibitive computational cost.

  2. (ii)

    It is almost impossible to identify the true video copy-move forgery regions from similar backgrounds of the adjacent frames based on statistical properties.

  3. (iii)

    Most state-of-the-art methods cannot detect video copy-move forgery only suitable for a single type of video copy-move forgery: either inter-frame or intra-Furthermore, most methods cannot achieve satisfactory results while detecting forgery regions under post-processing and geometrical transformation.

A fast forgery frame detection method is proposed for both inter and intra-frame video copy-move forgery identification to address the defects. The contributions of this proposed method are as follows:

  1. (i)

    The sparse feature extraction and matching (in Section III-speed-up the algorithm processing and reduce the time cost greatly (Defect (i)).

  2. (ii)

    A newly adaptive two-pass filtering algorithm (in Sect. 3-B) is proposed to remove the outlier-pairs for locate truly forgery frame-pairs (FFP) effectively and address the similarity problem (Defect (ii)) both in the inter and intra-frame forgery.

  3. (iii)

    Based on the results of these frame-pairs, the type of video copy-move forgery detection can be identified (Defect (iii)). Furthermore, the copy-move frame-pair matching algorithm (in Sect. 3-C)) locates truly FFP, thus further reducing the computation cost and false alarm for detecting the inter/intra-frame forgery efficiently and effectively (Defect (i)).

Experimental results demonstrate that our proposed method achieves better performance (in accuracy and time) than the existing state-of-the-art methods, even under post-processing manipulations and geometric attacks.

The rest of the paper is organized as follows. Section II briefly overviews the related work. Section III gives the novel video copy-move forgery detection. The experimental discussions and the conclusion are presented in Sects. IV, V, respectively.

2 Related work

Only a few state-of-the-art methods (Bestagini et al. 2013; Lowe 2004; Saddique et al. 2020; Su et al. 2018; Subramanyam and Emmanuel 2012; Zhang et al. 2015) can address video copy-move forgery detection. Subramanyam et al. (Subramanyam and Emmanuel 2012), propose a Histogram of Oriented Gradients (HOG) feature matching and video compression properties to address only intra-frame forgery in MPEG4 format. However, this work is not sufficiently robust to resist rotation manipulation and also takes unacceptably high computational cost, and hence unsuitable for long video clips. Su et al. (Su et al. 2018). presented the extraction features of Exponential-Fourier Moments (EFMs) in each frame to find the potential matching pairs of intra-frame forgery. An adaptive parameter-based fast compression tracking is applied to track the above forgery object in the subsequent frames if any suspicious forgery object is found. However, this work can only detect the inter-frame forgery and lacks robustness to resist compression. Even worse, the EFMs relying on the block-based feature, is similar to the other block-based methods which fail in detecting scaling forgeries.

Recently, the local descriptors with the superiority of geometrical invariances and high efficiency present good solutions to the above defects for video copy-move forgery identification. Therefore, our proposed method uses local descriptors with the geometrical invariances instead of the block features to extract useful keypoints.

The popular and effective local descriptors contain Scale Invariant Feature Transform (SIFT) (Lowe 2004), and speeded up robust features (SURF) (Bay et al. 2006). Each local descriptor for keypoint extraction has its own characteristics, e.g., the simple feature bit and sparse keypoints of ORB descriptor for fast matching, or the abundant features and dense keypoints of SIFT and SURF for accurate matching. Considering the localization of the copy-move forgery frames, the relatively sparse ORB keypoints 0with the binary bits (0/1) can greatly speed up the matching for the frame localization, and the relatively dense SURF keypoints with abundant features (0–255) can find more keypoint matches for the fine pixel indication. Different local descriptors are suitable for different stages that can strike a balance between efficiency and effectiveness. In our proposed method, we aim at to obtain a near real-time processing speed. Therefore, we prefer sparse ORB feature extraction to other local descriptors and the following feature matching to speed-up the algorithm processing.

In the matching stage, there are many effective matching and filtering algorithms, such as FLANN (Muja and Lowe 2014), KNN matching (Abeywickrama et al. 2016), and Random Sample Consistency (RANSAC) (Fischler and Bolles 1981). However, some of them are not well-designed for keypoint matching, especially while addressing a huge amount of the keypoints with high-dimensional descriptors. These methods will generate a large number of false-positive matches. In literature, the Nearest-Neighbor (2NN) test (Amerini et al. 2011) and GMS are respectively demonstrated to be an effective technique for keypoint matches, and a good solution to address a number of false-positive matches.

3 Our proposed method

The pre-processing operation of the proposed method is used to transform the RGB video into a gray-scale composite image. The sparse features are extracted in the composite image and matched to find the best matching keypoint-pairs (Sect. 3-A). If any best matching keypoint-pair is found, a newly adaptive two-pass filtering algorithm is applied to remove the outlier-pairs (Sect. 3-B). The statistical information of the remaining best matching keypoint-pairs (namely, the inter/intra-frame keypoint-pairs) in all the frame-pairs is used to locate the best matching frame-pairs (Step 1 in Sect. 3-C). Then, the successive best matching frame-pairs are preserved as the truly FFP, which contributes to identifying if the video is the original or inter/intra-frame forgery (Step 2 in Sect. 3-C).

The proposed method consists of three subsections:

  1. (A)

    sparse feature extraction and matching for finding the best matching keypoint-pairs;

  2. (B)

    an adaptive two-pass filter for removing the outlier-pairs from the best matching keypoint-pairs to obtain inter/intra-frame keypoint-pairs;

  3. (C)

    copy-move frame-pairs matching algorithm locates the best frame-pairs (Step 1), the successive best matching frame-pairs are preserved as the truly FFP (Step 2) (Fig. 2).

Fig. 2
figure 2

The framework of a fast forgery frame detection method for video copy-move inter/intra-frame identification

  • A. Sparse feature extraction and matching.

    ORB is a combination of the FAST keypoint detector and the BRIEF descriptor generation algorithm. ORB with the inherent orthogonality and geometrical invariances, can effectively resist post-processing and geometrical manipulation. Arguably, ORB performs nearly as well as SIFT and SURF in the geometrical invariances but faster in almost two orders of magnitude. However, it is well known that feature matching takes much higher computation than feature extraction. For this reason, only 128-dimensional features of binary bits (0/1) are used in the extracted ORB descriptor in order to speed up the descriptor matching and lower the matching cost. Compared to SIFT and SURF, the relatively low dimension of the descriptor can also improve the matching efficiency. On the basis of assurance efficiency, ORB provides a sufficient number of keypoints for fast frame-pair matching.

    Then, the Nearest-Neighbor (2NN) test and Euclidean distance (Amerini et al. 2011) are used to match the keypoint-pairs with similar local descriptors as the keypoint-pairs. Given a vector, d = {d1, d2, d3, …, dn−1} records the 128-dimensional Euclidean distances between the local descriptors of keypoint kpi and the remaining (n-1) keypoints, where n is the keypoint number. Then, the vector ds is sorted in increasing order to obtain ds = {ds1, ds2, ds3, …, dsn−1}. The 2NN matching procedure is conducted by evaluating the ratio of the 1st closest distance ds1 to the 2nd closest one ds2. While the ratio of the Euclidean distance, namely the correlation coefficient, satisfies the following:

    $$ds_{1} /ds_{2} < t$$
    (1)

    where threshold t = 0.6 is demonstrated as an effective hyperparameter for keypoint matching in CMFD (Li and Zhou 2019). Others as the false match will be filtered out.

  • B. Two-pass filtering in inter/intra-frame forgery.

    After 2NN test matching, each keypoint can find its best matching keypoint as the best matching keypoint-pairs. However, there are some disturbed keypoint-pairs in the result of the best matching keypoint-pairs. In particular, many disturbed keypoint-pairs belong to the same object with small spatial-distance in the intra-frame forgery. Besides, the similar background of adjacent frames in inter-frame forgery may also generate many disturbed keypoint-pairs. Figure 3c, d shows the sparse feature extraction and matching result (best matching keypoint-pairs), which contains disturbed keypoint-pairs in the adjacent frames of Fig. 3a, b. Therefore, an adaptive two-pass filter consisting of low-pass and high-pass filters is proposed.

    Fig. 3
    figure 3

    Sparse keypoint filtering results of different kinds of forgeries. For better indication, yellow and white respectively indicate the background and the forgery region of the ground truths; the red points indicate the keypoints

    1. 1.

      Low-pass filtering in intra-frame forgery.

      The low-pass filter uses a relatively small spatial-distance to remove the outlier-pairs and obtain the intra-frame keypoint-pairs. As a matter of fact, every frame of the intra-frame forgery can be regarded as the copy-move image forgery. Therefore, we have referred to the filtering distance L1 of the copy-move image forgery detection (Zhong and Pun 2019) as shown in Eq. (2). In intra-frame forgery, the copy and paste regions are both in the same frame so that the distances of the best matching keypoint-pair kd1 must be smaller than the frame width W:

      $${L_1} \leqslant {k_{d1}}<W$$
      (2)

      Here\({L_1}=\frac{{H+W}}{{\sqrt {min(H,W)} }}\)where H and W are respectively the height and the width of a video frame.

    2. 2.

      High-pass filtering in inter-frame forgery.

      Firstly, forgery frames must be of a certain length in the inter-frame forgery. Based on the persistence of vision, it requires 0.4 s, namely, 10 frames per second (fps), for the human eye to better discern the continuous contents of the video. It means that the required number of copy clips and paste clips of an inter-frame video forgery is no less than 10 frames. Secondly, the backgrounds of the adjacent frames taken by the same surveillance cameras are so similar that the 2NN test generates many disturbed keypoint-pairs. Therefore, a high-pass filter is designed to remove the disturbed keypoint-pairs with relatively long spatial distance. In inter-frame forgery, the high-pass filtering distances kd1 between the best matching keypoints-pair is given in Eq. (3).

      $$k_{{d1}} \ge L_{2} \cdot W$$
      (3)

      where L2 is the number of filtering frames, the smaller L2 is, the more disturbed keypoint-pairs preserve. The persistence of vision determines that the number of forgery frames is no less than 10 frames. Therefore, the filtering number L2 is set in 1–9 frames.

      To determine the best number L2, we have conducted an extensive test on our available dataset. Figure 4 shows that the percentage of the remaining keypoint-pairs on the distance of 1 to 9 filtering frames. Noted that, the best matching keypoint-pairs of the inter-frame forgery contain inter-frame keypoint-pairs and disturbed keypoint-pairs. While the filtering number L2 increases, the more disturbed keypoint-pairs are removed, and the number of remaining keypoint-pairs is rapidly decreased. When the filtering number L2 of the keypoint-pair increases from 1 to 7, the remaining keypoint-pairs decrease from 90.41 to 70.12 %. When L2 is more than 7, the total number of the keypoint-pairs is unchanged essentially. It means that the disturbed keypoint-pairs are almost filtered out, and the remaining are inter-frame keypoint-pairs. Therefore, the L2 is set to 7 based on the analysis of Fig. 4.

      Combining two-pass filtering analysis of the inter/intra-frame forgery, we finally set the distances kd1 of the best match keypoint-pair as follows for filtering the disturbed keypoint-pairs: if \({L_1} \leqslant {k_{d1}}<W\), the remaining best matching keypoint-pairs belong to intra-frame keypoint-pairs; if \({k_{d1}} \geqslant {L_2} \cdot W\), the remaining best matching keypoint-pairs belong to inter-frame keypoint-pairs. To summarize,

      $$\begin{gathered} \text{The} \;~{\text{remaining best matching keypoint - pairs}} \hfill \\ \in \left\{ \begin{gathered} \text{Intra} - {\text{frame}}\;{\text{keypoint - pairs,}}\;w.r.t.\;L_{{\text{1}}} < k_{{d1}} < W \hfill \\ \text{Inter} - {\text{frame}}\;{\text{keypoint - pairs,}}\;w.r.t.\;k_{{d1}} > \left( {L_{{\text{2}}} \times W} \right) \hfill \\ \end{gathered} \right\} \hfill \\ \end{gathered}$$
      (4)

      Figure 3e and f show that the inter/intra-frame keypoint-pairs marked in red mainly exist in the forgery region after the two-pass filtering. Based on the above analysis, finding the copy-move frame-pairs in the next step is very beneficial.

      Fig. 4
      figure 4

      The percentage of remaining on different frame distance

  • C. Copy-move frame-pairs matching.

    After two-pass filtering, the preserved keypoints are inter/intra-frame (remaining best matching) keypoint-pairs in the forgery frame-pairs. The frame-pairs with the maximum number of remaining best matching keypoint-pairs are regarded as the potential best matching frame-pairs. Therefore, we use this characteristic as the index to find the potentially best matching frame-pairs. However, the best matching frame-pair only represents the frame-pair with the strongest correlation. It is not necessarily the truly FFP. Based on the persistence of vision, there is no sense for forgery in an isolated frame-pair. In other words, the truly FFP must be successive best matching frame-pair. For this reason, our goal is to find the successive best frame-pairs as the truly FFP.

    Fig. 5
    figure 5

    The example of truly forgery frame-pairs.keypoint-pairs

Given the total number N of video frames, a combined set of all candidate best matching frame-pairs of a video U={u1,1, u1,2, …u1, N, u2,1, …, ui, j,…,uN, N }, where i, j\(\in\){1, 2, 3, …, N}. The total number of the keypoint-pairs of each frame-pair ui, j is denoted as si, j. The steps and the pseudocode of the proposed Algorithm 1 are given in the following.

Step 1

Find the best matching frame-pairs. Given any frame i, search all frames except frame i to find frame j with the maximum number si,j of the best matching keypoint-pairs. Then, return the best matching frame-pair ui, j.

Step 2

Find the truly forgery frame-pairs (FFP). Set the number of the successive frame-pairs in order to filter out the isolated best matching frame-pairs, and obtain the truly FFP.

Figure 5 shows an example of a truly FFP ui, j with successive frame-pairs, i.e., if τ > 5, the best matching frame-pair ui, j is a truly FFP.

figure a
Fig. 6
figure 6

The copied and the pasted regions in the different/same frames of the video clip

In fact, inter/intra-frame forgeries have different properties on their truly FFP. For inter-frame forgery, the copied and the pasted regions are found in different frames, as shown in Fig. 6a. Therefore, its truly FFP are two different frames. For intra-frame forgery, the copied and the pasted regions are in the same frame as shown in Fig. 6b. As a result, its truly FFP are the same frame. Based on the truly FFP, the video can be checked for forgery or original. If there is no truly FFP, the video is considered as original. Otherwise, the video is checked if the forgery is inter-frame (i.e., truly FFP frames are two different frames) or intra-frame (the same frame). In this way, the video forgery frames can be identified accurately with truly FFP.

4 Experiments

The proposed method is compared with several state-of-the-art methods through many experiments under various adverse conditions. This section presents the datasets, the performance metrics, and finally, the experiment results comparisons and analysis.

  1. A.

    Datasets for video copy-move forgery detection.

    Three benchmark datasets (GRIP) are employed to evaluate the proposed method and the state-of-the-art methods. The GRIP datasetFootnote 1 comprises 15 short videos and 93 derivative forgeries of inter/intra-frame videos. There are very little or even no traces to raise suspicion on the forgery videos. All the 15 base videos suffered from JPEG compression (compression factor is 10, 15, 20), 8 of them suffered from rotation (5o, 25o, 45o), and 9 of them suffered from flipping, which makes more difficult to detect the forgeries. Table 1 shows the synthetic statistics of the GRIP dataset. On the left column of Table 1 (Original Video) the properties of the original base videos are shown. On the right column (Copy-Move Video), the properties of the forgery videos are shown, including additive (Add.) or occlusive (Occ.), inter or intra-frame forgery, JPEG compression (Com.), rotation (Rot.), flipping (Flip.).

    Table 1 Statistics of the GRIP dataset
  2. B.

    Unified performance metrics.

    We have presented detection accuracy (det.), false alarm (f.a.), performance metrics (F1) and processing time (time). If a forgery video is correctly detected in the proposed method, it is marked with “✓” in “det.” column in Table 2. If an original video is falsely detected as a copy-move one, we mark it with “✖” in “f.a.” column in Table 2. To accurately measure the detection performance, the evaluation criteria TP (True Positive), FP (False Positive), and FN (False Negative) are used, where TP represents the detected forgery frames as true forgery frames, FP represents the original frames that are falsely detected as forgery frames, and FN represents the forgery frames that are missed in detection. The combination of TP, FP, and FN, constitutes F1 indicator as Eq. (5).

    $${F_1}=\frac{{2TP}}{{2TP+FP+FN}}$$
    (5)

    It is known that a higher F1 score denotes performance better. Finally, the experiments are implemented on a computer with an Intel (R) Core i7-8700 @3.20 GHz CPU and 16 GB RAM and GPU RTX2070. The efficiency is measured in terms of normalized CPU time in s/Mpixel.

  3. C.

    Experimental results.

    There are several state-of-the-art methods, including the Dense moment feature index and best match algorithm with radial-harmonic-Fourier moments (DMFIBM) (Zhong et al. 2020), Bestagini et al. (Bestagini et al. 2013), MRPL method (Saddique et al. 2020). However, the methods in the literatures of (Subramanyam and Emmanuel 2012) cannot be applied to real datasets like GRIP because they are with very restrictive assumptions on forgery videos. The Bestagini (Bestagini et al. 2013) method can only detect the inter/intra-frame forgery at the frame level. DMFIBM method can detect both the inter-frame and intra-frame forgeries. However, it is based on block feature extraction, and the following block feature matching will lead to expensive computation costs. MPRL, based on the residual signal between the adjacent frames, is suitable for identifying the forgery frame start and end, but fails in addressing the static forgeries.

The results of plain copy-move forgeries for GRIP are shown in Table 2, where\(\sum\) represents the total number of variables, including the det. or f.a., and\(\mu\)represents the average number of variables F1 or time. Noted that, the plain manipulations involve translations on forgery objects without other geometrical attacks and post-processing transformations.

Table 2 Detection and efficiency performance for plain copy-moves on the GRIP dataset

Table 2 shows that our proposed method can detect all the plain forgery videos and gets the best performance of F1 = 0.93. The DMFIBM method also detects all the forgery videos with an F1 score of 0.92. The followed method, the Bestagini method, misses six videos (#videos 6, 11, 12, 13, 14, and 15) and gets a bad F1 score of 0.49. The CNN model, the MPRL model, misses five videos (#videos 11, 12, 13, 14, and 15) and gets the weakest performance with an F1 score of 0.28. This because MPRL is only competent in searching the difference and coherence between the adjacent frames. Nevertheless, the copied object and its source frame belong to the genuine sources that do not appear the forgery traces. The MPRL has already missed half of the copy-move frames. Therefore, MPRL gets the weakest performance.

In terms of the false alarm, our proposed method can identify all of the original videos accurately, namely, f.a.=0, while DMFIBM, Bestagini, and MPRL respectively get the f.a.=1, 5, and 7. In comparing the average time cost, our proposed method gets an average of 1.23 s/Mpixel, which is much faster than 16.37 s/Mpixel in DMFIBM, 8.8 s/Mpixel in Bestagini, and 1.70 s/Mpixel in MPRL. In summary, Table 2 shows that the proposed method achieves the best performances, namely, detection accuracy (det.), false alarm (f.a.), comprehensive performance (F1), and the lowest computational costs in the plain copy-moves of the GRIP dataset.

Subsequently, the comparisons under the challenging forgery attacks of the whole GRIP dataset are presented, including JPEG compression, rotation, and flipping attacks. For simplicity, experimental results are simplified and reported in Table 3. In these comparisons, the DMFIBM method achieves the best detection performance at the video level (det.=90/93), and the followed method is our proposed method with det.=88/93, which is only slightly lower than the DMFIBM method. Nevertheless, the proposed method and DMFIBM method obtain the best performance at the frame level (the identical score F1 = 0.90). The proposed method also achieves the best score to identify the original video (f.a.=2/93). It is a similar case to Table 2 that the Bestagini method comes third place, and the MPRL gets the last place. In Table 3, the total statistics of the mean µF1 score are based on the F1 score of each case, as listed in Eq. (6).

$${\mu _{{F_1}}}=\frac{{\sum\limits_{{i=1}}^{8} {{w_i}{F_{1,i}}} }}{{93}}$$
(6)

where i = 1, 2, 3, 4, 5, 6, 7, 8, respectively represent the cases of Plain, QF = 10, QF = 15, QF = 20, θ = 5°, θ = 25°, θ = 45°, flipping, wi represents the number of video in the corresponding 8 cases mentioned above, namely, wi = 15, 15, 15, 15, 8, 8, 8, and 9. The scores F1,i represent the F1 scores of the corresponding 8 cases.

Table 3 Detection results on the whole grip dataset

5 Conclusion

This paper proposes a fast forgery frame detection method for video copy-move inter/intra-frame identification. It consists of sparse feature extraction and matching, two-pass filtering, and copy-move frame-pairs matching can address three issues:

  1. (i)

    A video of medium length containing hundreds of frames incurs a prohibitive computational cost;

  2. (ii)

    Similar backgrounds in contiguous frames are easily mistakenly detected as copy-move forgery regions, resulting in a large number of false alarms;

  3. (iii)

    Most state-of-the-art methods cannot detect video copy-move inter-frame or intra-frame forgeries, simultaneously.

The proposed method makes a good trade-off between efficiency and effectiveness. Our proposed method achieves the best false alarm 2/93 and the best performance F1 = 0.90 in the whole GRIP (Table 3) dataset, and the lowest computational costs of 1.23 s/Mpixel. In future work, we plan to develop novel and efficient techniques, e.g., CNN, for video copy-move forgery detection for higher computation efficiency.