1 Introduction

In recent years, ordinary people have acquired easy access to powerful and easy-to-use image editing tools, such as Photoshop and so on. These convenient tools can add spice to people’s lives. However, some malicious users abuse these powerful tools for their ends and tamper with images without the authors’ permission. These forged images increasingly and negatively burdened society [26]. One of the most active subtopics in forgery detection involves copy-move forgery [9, 26], in which a part of an image is copied from single or multiple regions and then pasted into other parts of the same image to obscure important content. In general, copy-move forgery operations are used not just in copying and moving (i.e., translation), but also includes geometrical distortions and post-processing operations such as scaling, rotation and compression. Detecting these imperceptible regions is difficult after such post-processing operations. Because the copy and paste regions are sourced from regions of the same image, their inner features are similar and compatible. Many proposed copy-move forgery detection (CMFD) methods have focused on correlating the inner features of detected regions to detect copy-move forgeries.

Based on previous research, CMFD methods can generally be divided into two categories [9, 35]: block-based [5, 11, 13, 14, 16, 17, 19,20,21,22, 24, 25, 30,31,32] or keypoint-based methods [1, 2, 4, 6,7,8, 10, 15, 18, 23, 27,28,29, 33]. The existing block-based and keypoint-based methods both employ similar processing procedures [9].

The main difference between block-based and keypoint-based methods lies in feature extraction. Block-based methods [9] divide the image into overlapped blocks that are typically rectangular, but some improved methods have proposed circular blocks [14, 35]. Then, various feature extraction algorithms have been applied to compute the feature vectors for each block. Subsequently, relevant detected regions are matched based on feature vector coefficients. Over the last fifteen years, numerous block-based methods have been proposed for CMFD. Fridrich et al. [13] first proposed the quantized Discrete Cosine Transform (DCT) coefficients to extract features of the overlapped and rectangular blocks. Popescu et al. [22] applied principal components analysis (PCA) to reduce the dimension of DCT. The 8 × 8 DCT coefficient blocks [19], DCT transform Domain [20], the sum of the pixel intensities [21] and the histogram of orientated gradients [16] are proposed to extract block features. However, the extracted features of these methods are lack of spatial invariances. They do not work well in copy-move forgery detection. In recent years, to analyze intrinsic image features, some invariant moment methods have been proposed for CMFD. Ryu et al. [24] employed the magnitude of Zernike moments against rotation operation and constructed an algebraic rotation moment invariant. Ryu et al. [25] further proposed constructing copy-rotate-move (CRM) detectors for the overlapping blocks. Ustubıoglu et al. [30] proposed calculating RGB color moments and entropy from the overlapping blocks. Yap et al. [31] proposed Polar Harmonic Transforms (PHTs). PHTs encompass orthogonality and invariance. Moreover, the kernels of PHTs are much simpler than are Zernike moments. Bi et al. [5] proposed the multi-level mask, then used Polar Complex Exponential Transform (PCET) that belongs to PHTs to extract features of the multi-level masks. The method has a good detection precision but in low efficiency. Gan et al. [14] and Li et al. [17] proposed the Discrete PCET (DPCET) and rotationally invariant Polar Cosine Transform (PCT) to extract block features, respectively. Emam et al. [11] also employed DPCET to extract features of the segmented blocks. Then, they used Locality Sensitive Hashing (LSH) to identify similar blocks. DPCET works well in detecting translation and rotation distortions and can precisely indicate the contour of copied regions. The LSH search also performs well in terms of accuracy but is not as fast as is desirable. There are two serious problems in block-based methods. The first one is incompetent to resist scaling distortions. The second one is due to its inefficiency. Zandi et al. [32] proposed an iterative procedure to adjust the density of keypoints; however, as with other block-based methods, this method also lacks scaling invariance.

To raise efficiency and handle the scaling distortion, the keypoint-based methods are proposed to extract the image features from the entire image. The popular keypoint-based methods are Scale Invariant Feature Transform (SIFT) [1, 4, 6, 8, 15, 18, 23, 29, 33] and Speeded Up Robust Features (SURF) [7, 10, 27]. SIFT and SURF methods are the feature extraction algorithm of the image. However, both of them only show the extracted the local maxima and minima as keypoints located on the detected suspicious regions but fail in describing the contour of the regions. Therefore, they also fail to output satisfactory detection results. To compare the method performances, we should employ some post-processed procedures, such as filtering, classifying and matting, etc. to indicate the detected regions. Pun et al. [23] proposed Simple Linear Iterative Clustering (SLIC) to segment the image into superpixels. It provided an important clue for better image matting. But the fixed threshold of SLIC did not adaptively segment the superpixels accurate.

In this paper, we present a new copy-move forgery detection scheme using adaptive keypoint filtering and iterative region merging. The main contribution of the proposed method is listed as follows:

  1. 1)

    The adaptive keypoint filtering procedure is the first time to measure the classification errors of the extracted keypoints. It can correct the misclassified keypoints and then sharply reduce the classification error of keypoints. It can obtain as many inliers as possible to get the accurate affine matrices for the geometrical transformation evaluation of the forgery regions.

  2. 2)

    Iterative region merging is proposed to iteratively generate more keypoint (inlier) pairs and their suspected regions which are based on the invariant features and accurate affine matrices, then merge the adaptive superpixels to implement the copy-move region matting. The iteration and the merging algorithm can more precisely indicate the forgery regions, no matter single or multiple forgeries.

The remainder of this paper is organized as follows. The related work is described in Section 2. The proposed method using adaptive keypoint filtering and iterative region merging is presented in Section 3. Section 4 presents the experiments and their discussions for CMFD, and Section 5 provides concluding remarks and directions for future work.

2 Related work

In this section, we introduce some notable methods which are relevant to the proposed method. Pun et al. [23] have presented SIFT to extract the image features (keypoints) effectively. Each SIFT keypoint has 128-dimensional SIFT descriptors which contain localization information, gradient amplitude, dominant orientation, and scale. Due to the low extraction efficiency of SIFT, the improved local descriptor SURF [7, 10, 27] with only 64-dimensional descriptors uses the Fast Hessian to extract keypoints. Christlein et al. [9] have demonstrated that SURF descriptors have better feature extraction and detection performance than SIFT descriptors in copy-move forgery detection scheme. The efficiency of SURF is also better than SIFT. The local descriptors, no matter SIFT or SURF, are only some sparse keypoints. They cannot cover the whole forgery regions. Zandi et al. [32] proposed iterating the interest points to obtain the suspected regions, limiting the number of iterations to 4. The main defect of the method used in [32] is similar to that of the block-based method [11] which is unable to address scaling distortions-especially large-scale scaling. Pun et al. [23] proposed a similar iteration algorithm around the keypoints. It can resist the scaling operation. However, the iterative keypoints which are based on the RGB features do not have the rotation invariance [5], fail in resisting rotation attacks. Besides that, the iteration is a random operation that it is not clear how many iterations of this is a satisfactory result. What is worse is that not all the keypoints are clustered into the correct classifications. It will lead to iteration errors. Li et al. [18] employed Simple Linear Iterative Clustering (SLIC) to segment the host image into meaningful blocks (superpixels). Then, when the number of inliers in every block satisfies the threshold, the corresponding superpixels pairs will be filled. However, this region-filling method has some defects as described below. First, the threshold of the initial segmentation with a fixed empirical value cannot get the satisfied segmentation. Second, SLIC is only an approximate texture segmentation method. It cannot find precise locations for some forged regions. Therefore, region-filling operations that rely only on SLIC segmentation with a fixed threshold do not obtain satisfactory matting results.

3 The proposed method

To overcome the defects of methods mentioned above, we propose a new copy-move forgery detection scheme which takes full advantage of the block-based and keypoint-based methods, namely efficiency of keypoint extraction and accuracy of the block/region filling. The proposed method contains three main stages: 1) keypoint extraction and matching (sub-section 3.1); 2) adaptive keypoint filtering (sub-section 3.2); 3) iterative region merging (sub-section 3.3).

The framework for the proposed method is depicted in Fig. 1. First, we employ the image pre-processing that contains the median filtering and color-to-gray conversion. Then, we implement three main processing stages as follow: 1) In the keypoint extraction and matching stage, we present Speeded Up Robust Features (SURF) to extract the candidate keypoints and employ Best-Bin-First search and Nearest Neighbors test to match the candidate keypoint pairs. 2) Subsequently, the adaptive keypoint filtering stage involving adaptive nearest neighbor pair filtering and outlier filtering sub-stages is presented to remove the outliers and obtain the inlier (authentic keypoint) pairs. The first sub-stage adaptively removes the nearest-neighbor pairs by employing the Euclidean distance, while the second sub-stage evaluates the rest keypoints and then corrects misclassified keypoints to obtain as many inlier pairs as possible. Adaptive keypoint filtering can identify both single and multiple forgeries by adaptively employing inliers to cover them. 3) Finally, an iterative region-merging stage involving adaptive region iteration and region merging sub-stages is presented to obtain the forged regions. The first sub-stage proposes a random sample consensus (RANSAC) and the affine algorithm iteratively to generate the neighboring keypoints (NKs) for each corresponding inlier. Discrete Polar Complex Exponential Transform (DPCET) is employed to extract the circular block features corresponding to the above keypoints. The matching circular blocks generate the suspected regions. The segmentations (superpixels) which are segmented by Simple Linear Iterative Clustering (SLIC) are merged by iterating the suspected regions to determinate the copy-move regions.

Fig. 1
figure 1

The framework of the proposed CMFD scheme

3.1 Keypoint extraction and matching

This section describes and discusses the detailed steps for the feature (keypoint) extraction and matching algorithm. Detailed descriptions of the adaptive keypoint filtering and iterative region-merging algorithm, which are the two main contributions of this paper, are provided in Sub-sections 3.2 and 3.3, respectively.

Image feature extraction is an important task in CMFD. There are many types of feature extraction methods, such as SIFT [6], SURF [27], PCET [11] and so on. Based on the analysis of the above methods given in the introduction, we employ SURF methods [27] which is more effiective than SIFT to extract the image features from the entire image. Then, the matching algorithm is employed to match the feature descriptors. The extracted SURF image features are expressed in the form of keypoints. Suppose there are a set of candidate keypoints P = {p1, p2, ⋯  pn} with their 64-dimensional SURF descriptors {Sdi, 1, Sdi, 2, ⋯, Sdi, 128} in high-dimensional feature space; each candidate keypoint has 64-dimensional feature including rotation, scaling, orientation features, and so on. To distinguish different clusters and match candidate keypoints into pairs, various clustering methods are presented in [9]. The kd-tree algorithm [15, 18] and Best-Bin-First (BBF) [3] are the commonly used methods for obtaining the approximate nearest neighbors. In our proposed method, we employ Euclidean distance to measure the correlation between two candidate keypoints. To intuitively compare the correlation between two candidate keypoints, their distance ratio is used as the estimated standard. A 2NN test [1] is normally employed as a ratio threshold when matching two candidate keypoints as a pair. The list of the Euclidean distances between the ith keypoint and the other keypoints is sorted in ascending order to identify similar feature vectors. We use a powerful tool, like OpenCv, which can provide the FlannBasedMatcher function, for implementing the SURF feature matching easily. The equation is given in (1), and the 2NN test is given in (2).

$$ C{D}_i=\left(c{d}_{i,1},\kern0.5em c{d}_{i,2},\kern0.5em ...,\kern0.5em c{d}_{i,n}\right) $$
(1)

where cdi,1 is the correlation distance of the closest neighbor between the no.i candidate keypoint and other keypoints, namely, the minimum Euclidean distance, while cli,2 is the second-closest neighbor. To perform a more effective matching method, Amerini et al. [1] suggested using the ratio of cdi,1 to cdi,2 to match the candidate keypoint pairs. Each matched pair is normalized with Eq. (2):

$$ c{l}_{i,1}/c{l}_{i,2}<{T}_1\kern0.3em ,\kern0.4em where\kern0.5em {T}_1\in \left(0,1\right) $$
(2)

The number and distribution of the keypoints are also important. When T1 is close to 1, the 2NN test can obtain more candidate keypoint pairs. Otherwise, the small number of the extracted keypoint pairs fails to detect the forged regions. A good matching algorithm should find as many keypoints as possible to concentrate and cover the copy-move regions. Others, namely, mismatched point pairs, should be kept to a minimum. Therefore, we suggest fixing T1 to 0.5 based on empirical findings. Figure 2 shows the process of the candidate keypoint extraction and matching algorithm. After keypoint matching is complete, the goal was that the matched keypoint pairs should be concentrated in the copy-move regions. It can be observed from Fig. 2-(d) that there are some false keypoint pairs. Therefore, the adaptive keypoint filtering to obtain the inliers is given a more detailed description in Sub-section 3.2.

Fig. 2
figure 2

The process of candidate keypoint extraction and matching: (a) shows the host image, (b) is the ground-truth region, (c) illustrates the extracted maxima and minima, and (d) illustrates the candidate keypoint pairs of the image

3.2 Adaptive Keypoint filtering

Various filtering algorithms have been proposed to remove weak and false keypoint pairs (matches). Euclidean distance and correlation coefficients are the most common filtering methods [9]. However, the fixed threshold involved in the Euclidean distance and correlation coefficient filtering methods limits their ability to obtain ideal filtering results. In this section, the adaptive filtering algorithm, which fuses the state-of-the-art algorithms, is proposed to enable better keypoint filtering. Figure 3 shows a flowchart of the adaptive keypoint filtering method that is divided into two sub-stages: adaptive nearest neighbor pair filtering and adaptive outlier filtering. The purpose of the first sub-stage is to remove the nearest neighbor pairs and obtain as many the correct keypoint pairs as possible. First, Euclidean distance with a fixed threshold is employed to filter out the neighboring keypoints pairs in which both members are close to each other. Then, a new filtering threshold based on average distance of the remaining keypoint pairs and the low-frequency distribution PLF of an image is employed to adaptively remove the nearest neighbor keypoint pairs once again. After the first sub-stage filtering, most of the unwanted outliers have been removed. Then, the second filtering sub-stage is undertaken to adaptively correct misclassified keypoints and address the multiple forged regions. Random sample consensus (RANSAC) is proposed to repeatedly evaluate the keypoint clusters and obtain the inliers, remove outliers. Finally, the authentic inliers of the ith cluster are obtained. After the second filtering sub-stage, the inliers are preserved and the outliers are removed.

Fig. 3
figure 3

A flowchart of the adaptive keypoint filtering process

3.2.1 Adaptive nearest neighbor pair filtering

Euclidean distance [9] is a commonly used method for filtering out mismatched keypoint pairs, especially for the nearest neighbor pairs. The Euclidean distance of the no.i keypoint pairs is defined as follows.

$$ E{d}_i={\left\Vert {m}_{i,1},{m}_{i,2}\right\Vert}_2>{T}_b, $$
(3)

where || ||2 is the L-2 norm, Edi represents the ith candidate keypoint pairs, and mi,1 and mi,2 are the two candidate keypoints in a single matched pair. Here, Tb = (H + W)/200 is the filtering distance, where H and W are the height and width of the detected image, respectively. The 1st Euclidean distance is employed to filter out the nearest neighbor matches. Then, the adaptive threshold of the Euclidean distance is applied to the 2nd filtering. The weight of the adaptive filtering distance is a sensitive research point. Some initial threshold of image segmentation methods can provide important cues. Pun et al. [23] proposed an adaptive over-segmentation algorithm to segment the image into non-overlapped blocks. The segmentation algorithm is based on the four-levels of Discrete Wavelet Transform (DWT), using the ‘Haar’ wavelet. Zheng et al. [34] employed the Haar wavelet to set the initial size of the segmentation. The candidate keypoint pairs in the same block are considered as mismatched pairs and the initial size is considered as the threshold for the Euclidean distance [34]. The Haar coefficient is defined as follows:

$$ {E}_{LF}=\sum \mid C{A}_4\mid $$
(4)
$$ {E}_{HF}=\sum \limits_i\left(\sum |C{D}_i|+\sum |C{H}_i|+\sum |C{V}_i|\right) $$
(5)
$$ {P}_{LF}=\frac{E_{LF}}{E_{LF}+{E}_{HF}}\times 100\%, $$
(6)

where i = 1, 2, 3, 4 and CA4 denotes the approximated coefficients at the 4th level of DWT, while CDi, CHi, and CVi denote the detailed coefficients at the ith level of DWT. PLF is a reflection of the image low-frequency distribution. When PLF is close to 1, the energy of the detected image is strongly concentrated in the low-frequency band or the low resolution. When the speed of change is low and smooth, the distribution of the keypoint pairs may be more widely spread and the number of the keypoint pairs is more likely to be sparse or fewer. Therefore, PLF is near to 1, and the filtering threshold is set to a higher distance. Pun et al. [23] and Zheng et al. [34] recommended using the above setting. These methods provide a good scheme for adaptive segmentation but consider only the frequency distribution to segment the image. The threshold is also set to a fixed value based on empirical findings. In Fig. 4-(a3), the candidate keypoint pairs are all filtered out by employing the threshold setting used in [23]. There are no keypoint pairs in the detected image; therefore, the detection algorithm failed. A fixed threshold that either too large or too small does not achieve satisfactory results. Therefore, we propose an adaptive threshold to obtain improved filtering results. The average distance is considered as another factor for adaptive segmentation. The average Euclidean distance of the remaining keypoint pairs is defined as follows.

$$ Edp=\frac{1}{m}\sum \limits_{i=1}^m{\left\Vert {p}_{i,1},{p}_{i,2}\right\Vert}_2, $$
(7)

where Edp is the average distance of all matched pairs, and pi,1 and pi,2 are the candidate keypoints of the no.i pair.

Fig. 4
figure 4

The comparison results for the filtering algorithm. The top row shows the filtering results of the fixed filtering algorithm [23, 34], while the bottom row shows the filtering results of the adaptive nearest neighbor filtering algorithm

The adaptive algorithm contains two steps. Figure 4 shows an example of the filtering algorithm. First, the nearest neighbor pairs are filtered using the low threshold. Then, the average distances of the remaining keypoint pairs are calculated. The average distance Edp with PLF as the weight-coefficient is used to adaptively filter the candidate keypoint pairs. The adaptive filtering threshold is defined as (8):

$$ {T}_p={P}_{LF}^{4/3}\times Edp, $$
(8)

The images in Fig. 4-(a1)~(a5) show the filtering results of the fixed filtering algorithm [23, 34]. Figure 4-(b1)~(b5) show the filtering results of the adaptive nearest neighbor filtering algorithm. Note that Fig. 4-(a1), (a2), (a4) and (a5) have similar filtering performances compared to Fig. 4-(b1), (b2), (b4) and (b5), respectively. In Fig. 4-(a3), the fixed threshold will filter out all the keypoint pairs. The size of the image in Fig. 4-(a3) is 3888 × 2592 and PLF = 0.4075. In Fig. 4-(b3), it can be observed that the adaptive filtering algorithm removes the nearest neighbor pairs effectively while preserving other correctly matched pairs. Compared to other fixed algorithms, filtering the candidate keypoints using the proposed filtering algorithm can cover the copy-move regions more precisely.

3.2.2 Adaptive outliers filtering

The first filtering stage is simply a preliminary filtering step in the scheme. After filtering, some unwanted outliers remain, as shown in the top left of Fig. 4-(b2). To eliminate the effect of these unwanted outliers, random sample consensus (RANSAC) [9, 12], the state-of-the-art method, is introduced to find an affine matrix H to estimate the best correlation coefficient among a certain number of trials. The goal is to filter out the falsely matched pairs (outliers) to obtain the correctly matched pairs (inliers). RANSAC has been shown to perform well in filtering operations for single forgeries, such as in Fig. 4-(b1) and (b2). However, RANSAC is not suitable for correctly filtering multiple copy-move forged regions. As shown in Fig. 4-(b1)~(b5) and Fig. 5-(a), another serious problem is that the keypoint classifications obtained by employing SURF and 2NN test are not entirely correct. Some candidate keypoints rightly belong to the A class but are assigned to the B class of the same cluster. The candidate keypoints belonging to the A class are denoted in green, while the candidate keypoints of the B class are denoted in red in Fig. 5-(a). Therefore, the classification and filtering must be re-estimated using RANSAC and the affine matrix to correct the misclassifications of the keypoint pairs. First, we employ the 1st RANSAC to filter out the candidate outliers and obtain the estimated candidate inliers used to obtain the affine matrix H1. Then, the candidate outliers are re-evaluated by employing H1 to measure whether these outliers truly belong to the inliers. When these outliers have been misclassified, they must be corrected to become inliers. This step eliminates misclassifications and obtains the A1 class and B1 class of the 1st correct cluster (inliers). The A1 class and B1 class of the 1st correct cluster are shown in Fig. 5-(b). When multiple copy-move forged regions exist, the rest of the outliers are re-estimated as the sources of the 2nd keypoint cluster. When the number of non-collinear outlier pairs falls below 3, the re-estimation algorithm is terminated. Otherwise, the cluster filtering is continually updated using the iteration steps described above. When the image contains only one forged region, the number of iterations is 1. The adaptive filtering method fuses the affine transform and RANSAC to obtain superior performance. Figure 5 shows the steps of the keypoint filtering algorithm. It can be observed that a portion of the candidate keypoint pairs in Fig. 5-(a) are misclassified. Figure 5-(b) shows the A1 class and the B1 class from the adjusted 1st clusters, namely, the 1st correct inliers. Figure 5-(c) shows the adjusted 2nd clusters, namely, the 2nd correct inliers. Figure 5-(d) shows the two correct clusters in the same image.

Fig. 5
figure 5

The image output of the steps of the adaptive outliers filtering algorithm

To conduct the second stage filtering, the RANSAC is employed to estimate the results of clusters. The number k for each RANSAC is defined as follows:

$$ k=\mathit{\log}\left(1-p\right)/\mathit{\log}\left(1-{w}^m\right). $$
(9)

where the number k ≤ 200, the confidence p is set to 0.995, w is the inliers ratio of all estimated pairs, m is the number of the estimated samples, and m > 3. There are two problems when using RANSAC. The first is that multiple copy-move regions may exist. RANSAC can obtain the 1st inliers but abandons the rest as outliers, meaning that the 2nd inliers or other inliers will be not obtained. The second problem is that some candidate keypoints are incorrectly classified by SURF extraction. These incorrect classifications will be regarded as outliers and abandoned. Smaller-sized regions with a limited number of candidate keypoint pairs cannot be easily detected. To obtain more inliers for the estimation and matting, the results of RANSAC should be analyzed. The affine matrix H1 can be easily obtained by calculating the inliers.

$$ \left[\begin{array}{c}{X}_1^{\hbox{'}}\\ {}{Y}_1^{\hbox{'}}\\ {}1\end{array}\right]=\left[\begin{array}{ccc}{h}_{11}& {h}_{12}& {h}_{13}\\ {}{h}_{21}& {h}_{22}& {h}_{23}\\ {}{h}_{31}& {h}_{32}& {h}_{33}\end{array}\right]\left[\begin{array}{c}{X}_1\\ {}{Y}_1\\ {}1\end{array}\right]={H}_1\left[\begin{array}{c}{X}_1\\ {}{Y}_1\\ {}1\end{array}\right], $$
(10)

where X1 = [x1, 1,   x1, 2,  … ,  x1, m1]and Y1 = [y1, 1, y1, 2  ,  … ,  y1, m1]are the coordinates of the class A1 of the inliers obtained by the 1st RANSAC, m1 is the number of candidate inlier pairs, \( {X}_1^{'}=\left[{x}_{1,1}^{\hbox{'}},\kern0.4em {x}_{1,2}^{\hbox{'}},...\kern0.3em ,\kern0.4em {x}_{1,m1}^{\hbox{'}}\right] \) and \( {Y}_1^{'}=\left[{y}_{1,1}^{\hbox{'}},{y}_{1,2}^{\hbox{'}},...\kern0.3em ,\kern0.4em {y}_{1,m1}^{\hbox{'}}\right] \) are the coordinates of class B1. According to the least-squares method, the affine matrix H1 can be obtained as follow.

$$ {H}_1={\left[{\left[{\left[{X}_1\kern0.4em {Y}_1\kern0.4em 1\right]}^T\left[{X}_1\kern0.4em {Y}_1\kern0.4em 1\right]\right]}^{-1}{\left[{X}_1\kern0.4em {Y}_1\kern0.4em 1\right]}^T\left[{X}_1^{\hbox{'}}\kern0.4em {Y}_1^{\hbox{'}}\kern0.3em 1\right]\right]}^T. $$
(11)

Then, the 1st candidate outliers are evaluated as to whether they belong to the misclassified keypoints. The coordinates of the keypoints of the outliers in a pair are exchanged to adjust possible misclassifications. The inverse transform of the affine matrix H1 can be employed to estimate the relationship between the inliers and the adjusted outliers. The exchanged coordinates of outliers are measured as follows.

$$ {\displaystyle \begin{array}{l}\mu =\mid \left[\begin{array}{c}{X}_2\\ {}{Y}_2\\ {}1\end{array}\right]-{H}_1^{-1}\left[\begin{array}{c}{X}_2^{\hbox{'}}\\ {}{Y}_2^{\hbox{'}}\\ {}1\end{array}\right]\kern0.4em \mid =\mid \left[\begin{array}{c}{X}_2\\ {}{Y}_2\\ {}1\end{array}\right]-{H}_1^{\hbox{'}}\left[\begin{array}{c}{X}_2^{\hbox{'}}\\ {}{Y}_2^{\hbox{'}}\\ {}1\end{array}\right]\kern0.4em \mid \\ {}=\mid \left[\begin{array}{c}{X}_2\\ {}{Y}_2\\ {}1\end{array}\right]-\left[\begin{array}{ccc}{h}_{11}^{\hbox{'}}& {h}_{12}^{\hbox{'}}& {h}_{13}^{\hbox{'}}\\ {}{h}_{21}^{\hbox{'}}& {h}_{22}^{\hbox{'}}& {h}_{23}^{\hbox{'}}\\ {}{h}_{31}^{\hbox{'}}& {h}_{32}^{\hbox{'}}& {h}_{33}^{\hbox{'}}\end{array}\right]\left[\begin{array}{c}{X}_2^{\hbox{'}}\\ {}{Y}_2^{\hbox{'}}\\ {}1\end{array}\right]\kern0.4em \mid =\left[\begin{array}{ccc}{\mu}_{1,1}& ...& {\mu}_{1,m2}\\ {}{\mu}_{2,1}& ...& {\mu}_{2,m2}\\ {}{\mu}_{3,1}& ...& {\mu}_{3,m2}\end{array}\right],\end{array}} $$
(12)

where X2 = [x2, 1,  … ,  x2, m2] and Y2 = [y2, 1,  … ,  y2, m2] are coordinates of the class \( {A}_1^{\hbox{'}} \) of the 1st candidate outliers, The coordinates of the outliers of the \( {A}_1^{\hbox{'}} \) class and the B1 class are exchanged, where \( {X}_2^{'}=\left[{x}_{2,1}^{\hbox{'}},\kern0.5em ...\kern0.3em ,\kern0.4em {x}_{2,m2}^{\hbox{'}}\right] \), \( {Y}_2^{'}=\left[{y}_{2,1}^{\hbox{'}},...\kern0.3em ,\kern0.4em {y}_{2,m2}^{\hbox{'}}\right] \) are coordinates of the class \( {B}_1^{\hbox{'}} \), m2 is the number of candidate outlier pairs, and μ1, j and μ2, j respectively represent the errors of the x-coordinate and y-coordinate. Here, μi, j represents the distance between the exchanged coordinates of the no.j outlier to its unchanged coordinate, \( {H}_1^{-1}={H}_1^{\hbox{'}} \), η2, j = (μ1, j + μ2, j) < ε = 8. This can be expressed in another form:

$$ {\eta}_{2,j}=\mid {x}_{2,j}-\frac{h_{11}^{\hbox{'}}{x}_{2,j}^{\hbox{'}}+{h}_{12}^{\hbox{'}}{y}_{2,j}^{\hbox{'}}+{h}_{13}^{\hbox{'}}}{h_{31}^{\hbox{'}}{x}_{2,j}^{\hbox{'}}+{h}_{32}^{\hbox{'}}{y}_{2,j}^{\hbox{'}}+{h}_{33}^{\hbox{'}}}\mid +\mid {y}_{2,i}-\frac{h_{21}^{\hbox{'}}{x}_{2,j}^{\hbox{'}}+{h}_{22}^{\hbox{'}}{y}_{2,j}^{\hbox{'}}+{h}_{23}^{\hbox{'}}}{h_{31}^{\hbox{'}}{x}_{2,j}^{\hbox{'}}+{h}_{32}^{\hbox{'}}{y}_{2,j}^{\hbox{'}}+{h}_{33}^{\hbox{'}}}\mid \le \varepsilon $$
(13)

where j represents the no.j exchanged outlier, if it satisfies the threshold. Specifically, this outlier pair satisfies the \( {H}_1^{-1} \) affine transform, when the two candidate keypoints of an outlier pair keep the exchanged coordinates, namely, point \( {A}_1^{\hbox{'}}\left({x}_{2,j},{y}_{2,j}\right) \) is exchanged to point \( {B}_1^{\hbox{'}}\left({x}_{2,j}^{\hbox{'}},{y}_{2,j}^{\hbox{'}}\right) \) and classified to class B1, and point \( {B}_1^{\hbox{'}}\left({x}_{2,j}^{\hbox{'}},{y}_{2,j}^{\hbox{'}}\right) \) is exchanged to point \( {A}_1^{\hbox{'}}\left({x}_{2,j},{y}_{2,j}\right) \) and reclassified as class A1. Others pairs that do not satisfy the threshold will keep their original coordinates. Figure 5 shows the exchange process. In Fig. 5-(b), the A1 and B1 classes are denoted as green and red color points, respectively. Others are regarded as the 1st outliers. All the outliers are re-estimated and redistributed to the correct classes. Then, the 1st redistributed inliers are estimated to modify an affine matrix \( \overleftrightarrow{H_1} \).

When a copy-move image exists in multiple forged regions, the iterative loop will continue until it meets the termination condition defined earlier. The filtered processing of multiple forgeries is shown in Fig. 5. The adaptively filtered algorithm accurately distinguishes the inliers.

3.3 Adaptive Region Iteration & Merging

Fig. 5-(d) shows the copy-move regions covered by the filtering keypoints (inliers). However, most of the inliers only roughly describe the suspected regions without exact region matting. Therefore, the region-filling algorithm must be performed to indicate the copy-move regions more clearly. Therefore, we proposed an iterative region-merging algorithm that contains adaptive region iteration and region merging for region filling. Figure 6 shows a flowchart of the iterative region-merging algorithm.

Fig. 6
figure 6

The framework of the iterative region-merging algorithm

To precisely describe the contours and contents of the forged regions, a high-density of matched inlier pairs are needed to cover the forged regions. First, the inlier pairs are loaded as labeled keypoints (LKs). The 8 neighboring keypoints (NKs) of the LKs belonging to the A1 class are generated. The affine matrix obtained from sub-section 3.3.2 is employed to calculate the 8 NKs of each matched LK of the B1 class. The LKs of the A1 and B1 classes are respectively denoted as green and red color points in Fig. 5-(b). The labeled keypoints of A1 and B1 classes represents 1st clusters (LK1). The circular block of each LK and NK is calculated with DPCET to determine whether the NK pairs match. The circular block of the matched NK is filled to generate the 1st suspected regions. Then the 1st matched NKs are loaded as new 2nd labeled keypoints. The preceding steps are repeated iteratively until the termination condition is reached. Second, to accurately display the copy-move forged regions, a morphological operation, SLIC with an adaptively initial threshold, is employed to segment the host image into superpixels. Then, the pixel percentage of suspected regions to the corresponding superpixels is calculated to measure whether the ratio satisfies the criterion. Finally, the superpixels are not only used to fill the whole but also merged with suspected regions to locate the copy-move regions accurately.

3.3.1 Adaptive region iteration

Assume that \( {\mathrm{LK}}_i=\left\{\kern0.1em \left(L{K}_{i,1},L{K}_{i,1}^{\hbox{'}}\right),...,\kern0.3em \left(L{K}_{i, mi},L{K}_{i, mi}^{\hbox{'}}\right)\right\} \), LKi represents the labeled keypoint pairs of the ith cluster, LKi, j and \( L{K}_{i,j}^{\hbox{'}} \) are the Ai class and Bi class of the ith cluster, respectively. An illustration is shown in the upper part of Fig. 6, where i = 1,2, …,n, and n is the number of clusters, and j = 1,2, … mi is the number of keypoint pairs of the ith cluster. The 1st neighboring keypoints (NK1, j) of labeled keypoints (LK1, j) are defined as shown in (14).

$$ N{K}_{1,j}=\cup N{K}_{1,j,\theta }, $$
(14)

where θ = (0o, 45o, 90o, 135o, 180o, 225o, 270o, 315o), and the distance between the LK1, j and NK1, j, 90 is r. Equation (10) is employed to calculate the NKs coordinates of the Bi class that correspond to the NKs of the Ai class. The radius r of the circular block is a multiple of 10. The definition of r is given as follows:

$$ r=\Big\{{\displaystyle \begin{array}{c}\mathrm{INT}\left(M\times N/{10}^6\right)\times 10,\kern0.3em others\kern0.2em \\ {}\kern3.299999em 10\kern1.8em ,\kern0.5em M\times N<{10}^6\end{array}}, $$
(15)

where M and N are the dimensions of the host image. The circular block is shown in Fig. 7. The center pixel of the circular block represents the corresponding labeled keypoint (LK) or neighboring keypoint (NK). When the calculated values of the NK pairs matched each other, the circular block is filled to generate a suspected region. Emam et al. [11] and Gan et al. [14] proposed using Discrete Polar Complex Exponential Transform (DPCET) to extract the rotation features of the image. DPCET is an algorithm effective against rotation distortions, but it fails to detect scaling operations. To detect scaling operations, DPCET is employed to calculate the features of the circular block with a variable radius [14, 35]. DPCET is defined as follows [11]:

$$ {M}_{kl}=\frac{4}{\pi {r}^2}\sum \limits_{y=-r}^r\sum \limits_{x=-r}^r{\left[\exp \left(i\times \left(2\pi k{r}^2+ l\theta \right)\right)\right]}^Tf\left(x,y\right),\kern1em {\left\Vert x,y\right\Vert}_2\le r, $$
(16)
$$ \mid {M}_{kl}\mid =\mid {M}_{kl}^{ROT}\mid, $$
(17)

where Mkl is the DPCET with kth order and lth repetition, θ =  arctan (y/x), ‖x, y2 ≤ r, r is the radius of the circular block and \( {M}_{kl}^{ROT} \) represents the DPCET coefficients of the rotation operation. Equation (16) gives the rotation invariant for extracting the rotation features of the circular block. Equation (15) defines the circular block radius of the Ai class. To better calculate the circular block feature of the corresponding to Bi class, the scaling dimension is defined as λ, where λ is equal to mean value of h11 and h22 in (10). When λ is not greater than 0.7, the λ is scale-invariant to prevent the calculated errors of the too small circular block. When λ is greater than 0.7, the initial radius r of the circular block is defined as shown in (18).

$$ r= Int\left(10\times \lambda \right), $$
(18)
Fig. 7
figure 7

The circular block example

A circular block example and an illustration of neighboring keypoint (NK) are shown in Fig. 7 and Fig. 8, respectively. The Eq. (14) yields the 1st NK of each keypoint. Equations (15)~(18) provide the extracted geometrical features of the circular block. The local color feature of the corresponding circular block is calculated using (19) and (20).

$$ {M}_{-}N{K}_{1,j,\theta }=0.3R\left(N{K}_{1,j,\theta}\right)+0.59G\left(N{K}_{1,j,\theta}\right)+0.11B\left(N{K}_{1,j,\theta}\right), $$
(19)
$$ {M}_{-}N{K}_{1,j,\theta}^{\hbox{'}}=0.3R\left(N{K}_{1,j,\theta}^{\hbox{'}}\right)+0.59G\left(N{K}_{1,j,\theta}^{\hbox{'}}\right)+0.11B\left(N{K}_{1,j,\theta}^{\hbox{'}}\right), $$
(20)

where R(), G(), and B() respectively denote the red, green and blue components of the corresponding circular block, MNK1, j, θ and \( {M}_{-}N{K}_{1,j,\theta}^{\hbox{'}} \) are the RGB feature of the neighboring keypoint (NK) in the A1 and B1 class, and 1 means the 1st NK. The circular block of each NK pair will be filled when they meet the criterion defined in (21):

$$ \mid {M}_{-}N{K}_{1,j,\theta }-{M}_{-}N{K}_{1,j,\theta}^{\hbox{'}}\mid <{T}_{NK}\times \mid {M}_{-}N{K}_{1,j,\theta}\mid, $$
(21)

where TNK is the threshold to measure the similarity between the compared NK pair. This paper proposes that TNK be set 0.04 based on the experiments.

Fig. 8
figure 8

The illustration of a labeled keypoints and 8 neighboring keypoints

Assume that 1st suspected region denotes the A1 and B1 classes of the 1st effective neighboring keypoints (NK1), that the features of the circular block satisfy (19) and (20), and that i = 1,2, …,n1, where n1 is the number of NK1. Then, repeat the above steps to achieve the optimal region matting. The NK is iterated until it satisfies the termination condition of Eq. 22. It is noted that the neighboring keypoints located in the filled blocks of the other keypoints, did not need to repeat the calculation.

$$ Termination=\frac{no.{i}^{th}\kern0.4em NK}{no.{\left(i+1\right)}^{th}\kern0.2em NK}<1, $$
(22)

It can be observed from Fig. 6 that the suspected regions cover the ground-truth regions precisely.

3.3.2 Region merging

To visually display the copy-move forged regions and restrict the suspected regions of the neighboring keypoint (NK), a morphological operation, Simple Linear Iterative Clustering (SLIC), is employed to segment the image into superpixels. The Tp in (8) is an adaptive coefficient based on calculating the distribution of keypoints and is used as the initial segmentation coefficient of SLIC. Then, the pixel percentage of the suspected region to the corresponding superpixel is calculated to measure whether the ratio satisfies the criterion. Finally, the superpixels and suspected regions are merged to fill the regions in three modes: the superpixel is completely filled, completely abandoned, or the suspected region preserves the pixels in its superpixel. These three filling modes are employed to indicate the detected region more accurately. The filling criterion is shown in (23).

$$ If\kern0.4em Percentage\kern0.3em \Big\{{\displaystyle \begin{array}{c}>\kern0.4em 70\%,\kern0.4em Fill\kern0.4em the\kern0.4em wh\kern0.1em ole\kern0.3em superpixel\kern6.699996em \\ {}<\kern0.4em 20\%,\kern0.3em Abandon\kern0.5em the\kern0.5em superpixel\kern7.099996em \\ {} others,\kern0.5em Keep\kern0.3em neighboring\ keypoints\kern0.4em in\kern0.3em superpixel\kern0.3em \end{array}}, $$
(23)

When the pixel percentage of the suspected region to the corresponding superpixel is over 70%, the entire superpixel is filled. When the pixel percentage of the suspected region to the corresponding superpixel is below 20%, the superpixel is abandoned. In the other case, the pixels of the suspected region will be preserved in its superpixel. Some small holes and isolated pixels are also eliminated by employing mathematical morphological operations. The superpixels are merged with the suspected regions to implement the copy-move region-filling operation as shown in Fig. 10-(a1)~(a5).

4 Experiments and analysis

In section 4, a wide variety of experiments are conducted to evaluate the performances of the proposed method and the state-of-the-art methods under the geometric transform and multiple region forgeries.

4.1 Evaluation criteria

In our experiments, to evaluate the performance of the compared CMFD methods, we use two main parameters, precision and recall [11, 18, 23, 27] as the two criteria to analyze of the experimental results. Precision and recall are defined in Eqs. (24) and (25), respectively.

$$ Precision= Tp/\left( Tp+ Fp\right), $$
(24)
$$ Recall= Tp/\left( Tp+ Fn\right), $$
(25)

Using (24) and (25), precision and recall are employed to test the CMFD methods at both image and pixel levels. The image-level evaluation distinguishes the performance of the method in detecting overall image forgeries, while the pixel-level evaluation is localized to detect the performance at the forged region area. Tp represents True Positive. At the image level, Tp represents a forged image that is correctly identified. At the pixel level, Tp represents that the correct number of detected copy-move pixels were detected as forged pixels. Fp means False Positive. At the image level, Fp represents detection errors in which a real image or authentic region was incorrectly detected as a forgery. In pixel level, Fp represents the ratio of authentic pixels erroneously detected as forged pixels. Fn means False Negative. At the image level, Fn represents undetected forged images or regions incorrectly detected. At the pixel level, Fn represents the proportion of forged pixels that are undetected. To comprehensively measure the performance of the CMFD methods, the F1 score combines both precision and recall:

$$ {F}_1=2\times \frac{Precision\times recall}{precision+ recall} $$
(26)

The closer F1 is to 1, the better the performance obtained by the CMFD method is.

4.2 Benchmark database for CMFD evaluation

Standard benchmark databases are used as uniform assessment criteria to compare the performance of different CMFD methods. The benchmark databases used here were compiled by the Department of Computer Science at Friedrich-Alexander University [9]. The basic dataset is composed of 48 high-resolution base images as well as copied and pasted snippets from these images to create copy-move forged images. The benchmark dataset contains rotated copies, scaled copies, down-sampled copies, splices with JPEG image compression, and so on. In our experiments, the existing state-of-the-art block-based method [11], keypoint-based method [1, 18, 23, 27] and the iterative interest-point method [32] were all tested to evaluate their performances. Figure 9 depicts the process used in the proposed method and Fig. 10 shows the detected results for the proposed method and the compared methods [11, 18]. The copy-move images contain several types of objects such as plants, animals, man-made objects and combinations of these. Figure 9-(a1)~(a5) shows the copy-move host images. Figure 9-(a1) shows the red tower image where the copied portion is rotated by 10°. Figure 9-(a2) shows the wood carvings image with a scaled-up 20% distortion. Figure 9-(a3) shows the fisherman image that contains multiple copy-move forged regions implemented by scaled-down 20% distortions. Figure 9-(a4) shows the jellyfish image with multiple copy-move regions in which each forged region is implemented a 20° rotation. Figure 9-(a5) shows the Christmas hedge image with multiple copy-move regions each of which is implemented by scaled-down 20% distortions. Figure 9-(a1) and (a2) show the single forged region. Figure 9-(a3) shows two separated copy regions corresponding to the two different forged regions. Figure 9-(a4) shows three separated copy regions corresponding to the three different forged regions respectively. Figure 9-(a5) shows one copy region corresponding to three forged regions. Figure 9-(b1)~(b5) shows the candidate keypoint pairs using a matching threshold of 0.5. Figure 9-(c1)~(c5) shows the results of the adaptive keypoint filtering. Figure 9-(d1)~(d5), (e1)~(e5) and (f1)~(f2) show the 1st, 3rd and ultimate iteration results of suspected regions, respectively. Figure 9-(g1)~(g5) show the relationship between the ultimate suspected regions and the superpixels. Figure 10-(a1)~(a5), (b1)~(b5), (c1)~(c5) shows the detected forgery results of the proposed method, the methods from [11, 18], respectively. Figure 10-(d1)~(d5) shows the ground-truth regions corresponding to the images in Fig. 9-(a1)~(a5), respectively. From the results shown in Fig. 10, it can be observed that our proposed method (shown in (a1) to (a5)) can achieve much better results. Figure 10-(a1)~(a3) and (a5) show that the matching between the iterative region-merging areas and the ground-truth areas can reach 90%. Figure 10-(a4) shows the correct region filling that occurred on the two forged regions, but the method missed the third forged region. It is shown in Figure 10-(b2), (b3) and (b5) that the method of [11] (block-based method) with the extracted feature from the unified block is unable to detect the large scaling transform. It is shown in Fig. 10-(c1)~(c5) that the method in [18] with a large segmentation easily ignores or misses the small-region forgeries. Therefore, The detection results from the method [18] can detect scaling transform forgeries, but its detection performance for small-region forgeries is weak.

Fig. 9
figure 9

The process of the proposed method for CMFD. The first row shows the copy-move forged images. The second row shows the matching results of the candidate keypoint pairs. The third row shows the results of the adaptive keypoint filtering. The fourth through the sixth rows show the 1st, 3rd and ultimate iteration results of suspected regions, respectively. The seventh row shows the relation between the ultimate suspected regions and superpixels

Fig. 10
figure 10

The CMFD results of the proposed method. The first to third rows show the detected forgery results of the proposed method, the method from [11] and the method from [18], respectively. The fourth row shows the ground-truth regions

4.3 Detection results under plain copy-move and authentic images

In this sub-section, the experimental results present a comparison of the performance of the proposed method with those of state-of-the-art methods at both the image level and the pixel level. The precision, recall and F1 scores are employed to evaluate the plain copy-move forgeries and the authentic image. These experiments were based on the orig and nul1 sub-datasets. The orig sub-dataset contains authentic images with no copy-move operations. The nul1 sub-dataset contains copied regions attacked by translation operations. The PCET method [11], SIFT [1, 18, 23] and SURF [27] method results are also provided to evaluate their performances quantitatively. Tables 1 and 2 show the detection results of precision, recall and F1 for the CMFD at the image and pixel levels, respectively.

Table 1 Detection results of the plain copy-move and authentic image in image level
Table 2 Pixel-level detection results of the plain copy-move forgery

Table 1 shows the detection results of the authentic image and plain copy-move image at the image level. As listed in Table 1, the proposed method achieves relatively high precision, recall and F1. Our proposed method achieved a precision of 96.9%, a recall of 93.8% and an F1 score of 95.3% at the image level. The precision of our proposed method was the best compared to the state-of-the-art methods, while the methods from [23, 27] tied for second place. The precision of the other methods all exceeded 90%. The recall of the methods in [5, 23] achieved the best performance; however, our proposed method is a bit lower than methods in [5, 23]. The F1 score of our proposed method was only slightly below that of the method from [23]. It was due to the proposed method may abandon some matched pairs, which is an isolated pair or fewer than 3 pairs. So the proposed method may miss some small-sized forgery regions and lower the recall score. Table 2 shows the detection results based on the same datasets described in Table 1 and our proposed method achieved the best recall score. It is due to the adaptive keypoint filtering procedure which corrects the misclassified keypoints and then sharply reduces the classification error of keypoints. It can obtain as many inliers as possible to get the accurate affine matrices for the accurate regions matting. The proposed method achieves a precision of 93.8% and the best F1 score of 90.5% at the pixel level. The F1 score reflects the overall quality and performance of a CMFD method. The method from [23] captured the highest precision and the second-best performance for recall and F1. Analysis of the above experiment was performed at both the image and pixel-level, the proposed method achieved the best performance at the pixel level and high quality at the image level.

4.4 Detection results under various post-processing conditions

Image-level detection is conducted to automatically detect copy-move forged images, while pixel-level detection is employed to measure the quality a CMFD achieves when detecting the copy-move regions. Therefore, performance at the pixel level is mainly employed to evaluate the performance of CMFD methods. To quantitatively evaluate the performance of the proposed method and the state-of-the-art methods, the measures precision, recall and F1 were employed to evaluate the algorithms’ performances on down-sampled images with, rotation transforms, scaled transforms and JPEG compression operations at the pixel level.

  1. 1)

    Detection results of down-sampling

These experiments were based on the nul_sd, scale_sd, and rot_sd sub-datasets. The copied regions were attacked only by translation (plain) or rotation or scaling distortions. The scaling factors employed in the scale_sd sub-dataset are 91%, 95%, 99%, 101%, 105% and 109%. The rotation factors employed in the rot_sd sub-dataset are 2°, 4°, 6°, 8° and, 10°. The host images in the sub-datasets down-sampled to 50% of the size of the original images. There were 48 × 12 = 576 tested images. Table 3 shows the down-sampling detection results of precision, recall and, F1 for the CMFD methods at the pixel level.

  1. 2)

    Detection results of rotation transform

Table 3 Detection results of down-sample

These experiments were based on the rot, rotExtra and rotExtra2 sub-datasets. The copied regions are attacked by rotation distortions. The attack angles are rotated by 2°, 4°, 6°, 8°, 10°, 20°, 60°, and 180°. There were 48 × 8 = 384 tested images in total. Figure 11 shows the detection results of the CMFD methods against rotation transforms.

  1. 3)

    Detection results of scaling transform

Fig. 11
figure 11

Detection results of the compared CMFD methods against rotation transform at the pixel level. (a) Precision; (b) Recall; (c) F1

These experiments were based on the scale and scaleExtra sub-datasets. The copied regions are attacked by scaling distortions, and the attacked regions are scaled by 80%, 91%, 93%, 95%, 97%, 99%, 101%, 103%, 105%, 107%, 109% and 120%. There were 48 × 12 = 576 tested images in total. Figure 12 shows the scaling detection results of the CMFD methods against scaling transforms.

  1. 4)

    Detection results of JPEG compression

Fig. 12
figure 12

Detection results of the compared CMFD methods against scaling transform at the pixel level. (a) Precision; (b) Recall; (c) F1

These experiments were based on the jpeg_sd sub-dataset. The copied regions are attacked by the JPEG compression distortion. The quality factor of the forged images reflects compression levels between 20% and 100% with a step size of 10%. There were 48 × 9 = 432 tested images in total. The copied regions are attacked by translation distortions. Figure 13 shows the detection results of the CMFD methods against JPEG compression.

Fig. 13
figure 13

Detection results of the compared CMFD methods against JPEG compression at the pixel level. (a) Precision; (b) Recall; (c) F1

Table 3 shows the detection results on down-sampled images at the pixel level. The ‘ / ’ means the result is not available. Compared to the other methods, our proposed method achieved the best precision of 84.4% and took second place in recall and F1 scores. The multi-level dense descriptor method [5] achieved the best recall and F1, but at the expense of running times. However, in the following experiments under various geometric distortions and post-processed operations, the performances of method [5] were weaker than the proposed method in pixel level. The SURF method [27] resulted in the weakest performance because it identifies too few keypoints to indicate the ground-truth region. Compared to Table 2, the performance of our proposed method decreases only slightly on down-sampled images. It is because the high matching threshold described in (3) can obtain sufficient keypoints to match the images.

In Figs. 11, 12, and 13, the curve drawn in purple and marked ‘Bi [5]’ represented dense multi-level descriptor of block-based method. The curve drawn in light blue and marked ‘Emam [11]’ represented the results of the block-based PCET method. The curve drawn in dark blue and marked ‘Zandi [32]’ represented the results of the iterative interest-point method. The curves drawn in khaki and black were marked as ‘Li [18]’ and ‘Pun [23]’, respectively, and represented the results of the SIFT methods. The curve drawn in pink and marked ‘Shivakumar [27]’ represented the results of the SURF method. Finally, the curve drawn in red and marked ‘Proposed’ represented the results of the proposed method. The X-axis coordinates represented the rotation degree, scaling factor, and quality factor, respectively. As shown in Fig. 11-(a), the proposed method achieved the best performances in some cases, such as 2°, 4°, 6°, and 10° rotation factors. In other cases, the method from [32] achieved the best performances. The precision of our proposed method and the method from [32] both achieved approximately equal performances and achieve the best performance in most of the cases. The method from [11] was slightly weaker than the proposed method and took second place. The proposed method and method [11] both achieved good performances because they used rotation-invariant features. In Fig. 11-(b), the recall of our proposed method and that of the method from [32] were much better than those of the other compared methods. The SIFT-based method [18] took third place. In Fig. 11-(c), the F1 score (which combined both precision and recall) of both our proposed method and the method from [32] achieved a superior performance compared to the other state-of-the-art methods. It was because these methods fuse the invariance of SURF and the rotation-invariant extraction of PCET. Figure 12 shows the scaling detection ability of the compared method. It can be observed that the precision, recall, and F1 score of the proposed method represented the best performance in most cases, especially for large-factor scaling. The recall performance greatly exceeded that of the block-based method from [11], the SIFT method from [23] and the other SURF from [27]. Figure 12 also shows that our proposed method was not sensitive under scaling attacks. Its good performance is because the proposed method calculated the affine matrix and then adaptively adjusts the radius of the circular block to resist scaling transforms. The method from [11] performed poorly on the scaling transforms as shown in Fig. 12-(a), (b) and (c); it was not able to address scaling transform especially for large-factor scaling. Figure 13 shows the detected results under JPEG compression attacked as the image quality factor varies. In Fig. 13-(a), the precision of our proposed method achieved the best performance when the quality factor is high. When the quality factor is below 70, the recall of our proposed method was similar to the precision results. Our proposed method achieved its best performance when the quality factor was high. However, even when the quality factor was low, our method took second place only to the method from [32]. As shown in Fig. 13-(c), the F1 scores of the proposed method were similar to its precision and recall scores. The F1, Precision, and recall of our proposed method all exceeded 70% against JPEG compression attacks, and it achieved the best performance in most cases. Even when the quality factor is below 60, our method achieved second place. Adaptive keypoint filtering algorithm of the proposed method can filter most of the outliers and get as many inliers as possible. The inliers can accurately locate the copy-move forgery regions. The affine matrices which are obtained from a large number of the inliers, accurately indicate the geometrical transformations of the forgery regions or the geometrical correlations between the pixel pairs. The iterative region algorithm uses the superpixels to complement the region matting. It is a precise region filling algorithm. Therefore, under various post-processing conditions, our method achieves superior performance compared with other state-of-the-art methods.

4.5 The experiments under CMH dataset and other evaluation criteria

There are some other evaluation criteria proposed in the state-of-the-art methods. The evaluation criteria contain True Positive Rate (TPR), False Positive Rate (FPR) and Accuracy (ACC) [28]. TPR is the same case to recall. FPR describes the ratio of the authentic pixels with incorrect detection to all authentic pixels. ACC describes the ratio of the pixels with correct detection to all pixels. ACC is defined in Eq. (27).

$$ ACC=\frac{TPR+\left(1- FPR\right)}{2}=\frac{T_P+{F}_N}{T_P+{F}_P+{T}_N+{F}_N} $$
(27)

The Copy-Move Hard (CMH) dataset proposed in [28] has four sub-dataset with total 108 (23 + 25 + 26 + 34) forgery images. TPR, FPR and ACC were proposed to evaluate the performances of the compared methods under CMH dataset. The compared results of the proposed method against other five methods were described in Table 4.

Table 4 Detection results of CMH dataset

Table 4 shows the detection results under CMH dataset. Compared to the other methods, FPR of the proposed method was a little bit weak, but the proposed method achieved the best ACC of 90.8% and TPR of 83.3%. The method from [28] captured the second-best ACC and TPR. The proposed method also got the superior performance to the-state-of-the-art methods at TPR, FPR and ACC criteria.

4.6 Comparison of running times

From the above analysis, the proposed method comprehensively achieved the best performances under various distortions with different datasets. Now, we turn on the attention for an analysis of the computational complexities between the proposed method and the state-of-the-art methods. To improve efficiency and reduce computational times, the proposed method was implemented by mixed-language programming based on MATLAB and C++. We have implemented the method by using MATLAB 2016b, VS2015 and Opencv 3.2 tools. The experiments were performed on a computer with one Intel(R) Xeon(R) E5–2650 @2.20 GHz CPU with 12 cores and 64 GB RAM. We chose the datasets of Friedrich-Alexander University [9] which were larger than the ones in CMH dataset. We have divided the proposed method into three parts which are feature extraction, adaptive keypoint filtering, and iterative region merging stage, respectively. Table 5 shows the average running times of the 3 stages and total times of the proposed method.

Table 5 Running time of the proposed method using dataset [9]

There are two difficult problems for complexity comparison between the proposed method and each analyzed method. First, most of the state-of-the-art methods have not provided the running times. Second, each compared method was not performed on a unified platform. It was hard to compare the time of each method exactly. We have tried to provide the running times of the compared methods with the available codes. We have also cited the running times of paper [28] and normalized the running time of the proposed method based on the approximate criteria of the method [28]. Table 6 shows the compared results.

Table 6 Running times of the proposed method and the state-of-the-art methods

Table 6 clearly shows that the running times of the keypoint-based methods ([7, 23, 28], the proposed method) were generally less than the block-based methods ([5, 11, 22, 25]). The block-based methods took expensive times for feature extraction and matching of each block (or each pixel). The keypoint-based methods extracted local image features as keypoints and only filtered the extracted keypoints as matches. As for the proposed method, the extracted feature stage and keypoint filtering stage occupied a small portion of the total time. It was owing to the mixed-language programming with various effective tools. The iterative region needed to iteratively calculate each neighboring keypoints feature. So the iterative region merging took the relatively expensive time.

5 Conclusions

In this paper, the proposed method fused the advantages of the keypoint-based and block-based-methods to perform CMFD. This novel method mainly involves the local maxima, minima extraction and matching algorithm, an adaptive keypoint filtering algorithm, an iterative region-merging algorithm and so on. First, the local maxima and minima are extracted as candidate keypoint pairs by Speeded Up Robust Features (SURF). The Best-Bin-First search (BBF) is employed to obtain the correlation between each pair of candidate keypoints. The candidate keypoints are then matched as pairs by employing the Nearest Neighbors (2NN) test. The adaptive keypoint filtering algorithm based on random sample consensus (RANSAC) is proposed to filter out the outliers and obtain the inlier (keypoint) pairs. The first sub-stage of the filtering algorithm removes nearest neighbor keypoint pairs. The second sub-stage evaluates the inliers and corrects keypoint misclassifications. The adaptive keypoint filtering can identify both single forgeries and multiple forgeries. Finally, the iterative region-merging algorithm is proposed to obtain the forged regions. Neighboring keypoints (NKs) are generated to obtain additional inliers to cover the forged regions, and DPCET is employed to extract the circular block features corresponding to the NKs and filter out any mismatched NKs. The suspected regions are generated by the circular blocks of the matched NKs. Simple Linear Iterative Clustering (SLIC) is employed to segment the host image into superpixels. The superpixels and suspected regions are merged to fill the detected regions more precisely. Compared with the state-of-the-art methods, a series of experiments demonstrated that our proposed method achieves the best performance for CMFD under various post-processing operations. In future work, the adaptive keypoint filtering and iterative region-merging of our proposed scheme will be applied to wider fields, focusing not only on CMFD but also other types of forgery detection such as splicing and image morphing.