1 Introduction

Image inpainting is the process of recovering an image from a degraded version according to the residual image information. Digital image inpainting technology emerged in the early twentieth century, and was initially used to restore ancient artworks, cultural relics, and paintings. With the explosive growth of digital information, image inpainting technology is now flexibly used on various occasions, including image retouching, image denoise [1], watermark erasing [2], film special effects, and background beautification.

In recent years, digital image inpainting technology has become a research hotspot [3,4,5,6,7,8,9,10], which can be broadly classified into traditional image inpainting technology and deep learning-based image inpainting technology. Traditional digital image inpainting techniques [3,4,5,6] include diffusion-based methods and exemplar-based methods. Traditional digital image inpainting techniques are limited by the use of local diffusion information and can not create complex textures or semantic components [11]. To learn semantic features of the image, deep learning-based image inpainting technologies [7,8,9,10] train inpainting models on large-scale datasets and achieve impressive results across a wide range of application scenarios [12]. Benefiting from the use of adversarial networks, these methods can adapt to different resolutions and generate missing objects automatically.

Fig. 1
figure 1

Examples of image inpainting. Left is the original image, and the right is the inpainted image

However, the image inpainting technique acts as a two-edged sword. One can either use image inpainting to restore the integrity of the image or tamper with the image by filling in the gaps left by object removal, as shown in Fig. 1. With the development of image editing equipment, the tampering traces hidden in inpainted images become more difficult to be identified by human eyes. If image inpainting technologies are maliciously used in news media, judicial forensics, or to mislead the public, they will cause great harm to people’s daily life and bring about social and political instability. To distinguish the authenticity of images, the in-depth research on digital image inpainting detection has attracted the attention of many researchers.

Early image inpainting detection methods mainly identify inpainting regions by calculating the similarity features between image blocks [13,14,15,16,17]. As convolutional neural networks become more prominent in computer vision, more researchers have turned to inpainting detection models incorporating convolutional neural networks [18,19,20,21,22,23,24]. These deep learning-based image inpainting detection models show apparent advantages in extracting image inpainting traces and reducing false alarm rates.

Although deep learning-based approaches have yielded positive results in image inpainting detection, they often have issues such as the lack of generalizability and relying on specific features and data distribution [25]. Current inpainting forensic technologies generally use CNN as the feature extractor, which tends to learn content features from the image, and it is easy to ignore tiny artifacts incurred during the inpainting process. At the same time, the convolution and downsampling operations in the forensic network may bring about information loss in feature representation, and the bilinear upsampling can also blur the precise mask predicted by the network.

In view of the above shortcomings, this paper constructs a hybrid CNN-Transformer network, which consists of the hybrid CNN-Transformer encoder, the feature enhancement module, and the decoder module, for image inpainting detection and localization. The key idea of the hybrid CNN-Transformer encoder is to capture long-range dependencies of images through self-attention mechanism of Transformer. The feature enhancement module utilizes hierarchically combined layers to extract deep frequency features, which the decoder incorporates into the upsampling process of the predicted mask as extra supervision to improve the generalization capability. We optimize the CTNet framework using a hybrid loss function consisting of pixel-level and image-level loss, reducing the impact of class imbalance in the training dataset. Lastly, we provide a comparative analysis to investigate both the performance and generalization of our model.

The main contributions of our work include the followings:

  1. 1)

    We introduce an encoder with a hybrid CNN-transformer architecture for image inpainting detection, which makes up for the defect of Transformer exclusively focusing on modeling the global context. The hybrid encoder takes full advantage of the characteristics of CNN and transformer to extract local and global inpainting features.

  2. 2)

    The high-frequency features are applied to supervise the upsampling process of the extracted feature map output by the hybrid encoder, which results in better accuracy in detecting inpainted regions.

  3. 3)

    We tackle the challenge of constructing an image inpainting detection method that attains good generalizability for a variety of unseen inpainting operations and, thus, is not limited to a specific inpainting method.

Experimental results show that our model achieves state-of-the-art image inpainting detection performance on the test set generated by ten commonly used inpainting methods. The comparative experiments demonstrate the effectiveness and generalization ability of our approach.

The rest of the paper is organized as follows. Section 2 briefly reviews several works related to image inpainting approaches, inpainting detection, and Transformer. Section 3 presents the details of our method. Experimental results are discussed in Section 4, including the experimental results of our model and comparison with state-of-art methods. Section 5 makes a summary of our work.

2 Related work

2.1 Image inpainting

Many methods have been proposed for image inpainting, including traditional methods based on diffusion or patches [26,27,28,29,30,31,32,33,34], and those based on deep learning [35,36,37,38,39,40,41,42,43]. Bertalmio et al. [26] proposed the first diffusion-based approach in 2000, which smoothly propagated the image information along the isophote to fill the missing region. Telea et al. [31] proposed an inpainting algorithm based on propagating an image smoothness estimator along the image gradient, which was simple to implement. Bertalmio et al. [32] involved Navier–Stokes equations for an incompressible fluid, which had the benefit of well-developed numerical results. Herling et al. [33] presented an approach based on high-quality image inpainting and enabled the realization of Diminished Reality applications. Huang et al. [34] proposed an algorithm for automatically guiding patch-based image completion leveraging mid-level structural cues.

Traditional image inpainting methods use the internal statistic information of the image, such as the edges of damaged parts or similar image patches, which lack global semantic information and can not produce non-continuous contents. To solve the shortcomings of traditional methods, image inpainting technology based on deep learning has gradually become a research hotspot. Deep learning-based techniques for image inpainting combine global semantic and texture information, significantly improving inpainting effects. Yu et al. [39] presented a novel generative network with a contextual attention layer, which can not only synthesize image structures but also use neighboring images features. Yan et al. [40] introduced a shift-connection to U-Net, which exhibited fast speed with fine details by deep feature rearrangement. Nazeri et al. [41] developed a two-stage adversarial model that comprised of an edge generator and an image completion network which can be used as an interactive image editing tool. To tackle the challenge of visually plausible results caused by deep learning-based methods, Wu et al. [42] suggested an end-to-end generative model through combining a local binary pattern. Considering the impact of the corrupted regions of the image on normalization, Yu et al. [43] proposed a spatial region-wise normalization that divided pixels into small regions and computed the mean and variance for normalization.

2.2 Inpainting detection

Many inpainting detection approaches have been developed to prevent the malicious use of inpainting operations. Wu et al. [13] proposed a blind detection method based on zero-connectivity features and fuzzy membership. However, the semi-automatic detection method requires manual selection of suspicious areas, which requires a large amount of computation. Bacchuwar et al. [14] improved the algorithm of [13] by converting the image to Y-Cb-Cr format and using only the luminance component Y for detection, which speeds up the computation. Aiming at the problem that the uniform region in the image background will interfere with the detection process, Chang et al. [15] applied a multi-region relation technique to identify suspicious image patches from homogeneous regions.

Zhu et al. [16] built a convolutional neural network to detect patch-based inpainting operations. The authors constructed a class label matrix for each pixel of the image in the process of training the encoder–decoder network. Li et al. [18] employed a fully convolutional network based on high-pass-filtered image residuals, which enhanced the difference between the inpainted and untouched region. Considering the spatial and channel correlations of feature maps, Xiao et al. [44] introduced a squeezed excitation block, which is applied in the feature extraction and upsampling stage to pay more attention on the spatial location and channel dependence. Zhang et al. [19] used a modified U-shaped feature pyramid network (FPN) to extract multi-scale inpainting features. Wang et al. [20] used MASK R-CNN [45] combined with FPN to extract features, which can detect images tampered by conventional inpainting methods and images modified by deep learning-based methods. Li et al. [21] proposed a method for generating a universal training dataset, which imitates the noise pattern discrepancies between the real and synthesized contents to train universal detectors. Chen et al. [22] proposed a multi-view feature learning with multi-view supervision network that contained novel elements designed for learning semantic-agnostic features. Wu et al. [23] introduced MT-Net, a unified deep neural architecture that can handle images of arbitrary sizes and many known forgery types. MT-Net detects forged pixels by identifying local anomalous features, thus it also performs well in image inpainting detection. The forensic model proposed by Wu et al. [24] consists of three parts, namely feature enhancement, feature extraction, and decision block. Their model is designed with the assistance of the NAS algorithm and the embedded attention modules to optimize the latent high-level features. Yang et al. [46] provides a near original image augmentation strategy to push the inpainted images closer to the original images. The authors add hard samples into the training set and as a result help improve the accuracy of their model.

However, many state-of-the-art image inpainting detection methods have problems such as heavy computation and time-consuming pre-processing, most of which also suffer from poor generalization ability. To address these issues, we propose a pixel-level inpainting detection method based on a hybrid CNN-Transformer encoder, which reduces computation and enhances the generalization ability to unseen inpainting methods.

2.3 Transformer

Transformer [47] is a classic NLP model proposed by the Google team in 2017 and has been widely used in NLP and machine translation. Based on the encoder-decoder architecture, Transformer ultimately uses self-attention mechanism which replaces the temporal structure of the recurrent neural network. Generally, Transformer-based models are trained on large text corpus and fine-tuned for specific tasks to achieve better computational efficiency and accuracy. Due to its remarkable ability to supporting parallel processing of sequences, the Transformer architecture has proven its potential in reducing training time significantly. Consequently, this technique has since become the state-of-the-art approach in many NLP tasks.

Inspired by Transformer, the Vision Transformer (ViT) proposed by Dosovitskiy et al. [48] breaks the isolation between NLP and computer vision (CV), and successfully applies Transformer to image block sequences for further image classification tasks. The pioneering work of ViT model uses a self-attention mechanism to capture global features from shallow networks, which solves the problem of CNN’s difficulty in capturing and storing long-range dependency information. When pre-trained on large-scale datasets and transferred to multiple mid-sized or small image recognition datasets (ImageNet [49], CIFAR-100 [50], VTAB [51], etc.), ViT demonstrates superior transferability on downstream tasks.

There are many follow-up studies [52,53,54,55,56] extending ViT. For example, Touvron et al. [52] introduced the pyramid structure into the Transformer, which performed better on dense prediction tasks. Wang et al. [53] proposed a teacher–student distillation training strategy for ViT, and added a distillation token as supplementary information for the classification token. The above studies reveal the effectiveness of Transformer in computer vision tasks; thus, we propose an image inpainting detection method that combines Transformer and CNN encoder.

3 Proposed method

Fig. 2
figure 2

The architecture of the proposed CTNet

In this study, we propose an image inpainting detection method based on the CNN-Transformer hybrid structure (abbreviated as CTNet). On the one hand, it effectively resolves the issue that a CNN network must transmit information onto the subsequent layers and cannot capture the long-term dependence relationship. On the other hand, the decoder upsamples the features encoded by the Transformer and incorporates residual noise as an auxiliary information into the upsampling process to improve the localization accuracy. The overview of the CTNet architecture is presented in Fig. 2. We elaborate three main blocks of the CTNet, i.e., the hybrid CNN-Transformer encoder, the feature enhancement module and the decoder module.

3.1 Hybrid CNN-Transformer encoder module

3.1.1 Convolution block

Fig. 3
figure 3

The CNN part of the hybrid encoder, (C, H, W) represents the input dimension. C and C1 are the number of convolution kernels, and S is the stride. In stage 1, \(\textrm{C}=\textrm{C} 1\), in stage 2, 3, 4, \(\textrm{C}=2 \times \textrm{C} 1\)

The original ViT encoder aggregates global features and ignores image details at low resolutions when used for classification tasks. This property is not conducive to reconstructing the full resolution image during the upsampling process, resulting in rough edges output by the decoder. Consequently, we first encode the image into high-level feature representations using a CNN, which are then reshaped into a 2D sequence as the input of the ViT encoder.

The CNN part of the hybrid encoder is shown in Fig. 3. Given an RGB image \(x \in {\mathbb {R}}^{H \times \textrm{W} \times C}\) with 3 channels and its corresponding binary mask y, we first feed image x into the ResNet50 [57] network pre-trained on the ImageNet dataset to extract the output feature representations of the fourth layer.

The stage 0 in Fig. 3 is the first layer of the ResNet 50 network, which performs convolution, regularization, activation functions, and max pooling on the input image. The stage 1–stage 3 in Fig. 3 are the second–fourth layers of the ResNet 50 network, each of them contains two residual blocks with skip connection, namely Conv Block and Identity Block. The Conv Block is applied to change the dimension of the feature maps, while the Identity Block corresponds to the case where the input has the same dimension as the output. The utilization of the residual module solves the problem of gradient vanishing or exploding when training deep neural networks, and it can also preserve and ensure data integrity to a certain extent.

The output feature representations of the first–fourth layers of ResNet50 are: \(F_0 \in {\mathbb {R}}^{\frac{H}{4} \times \frac{W}{4} \times 64}, F_1 \in {\mathbb {R}}^{\frac{H}{4} \times \frac{W}{4} \times 256}, \quad F_2 \in {\mathbb {R}}^{\frac{H}{8} \times \frac{W}{8} \times 512}, \quad F_3 \in {\mathbb {R}}^{\frac{H}{16} \times \frac{W}{16} \times 1024}\). We employ \(F_3\) as the input of the Transformer encoder.

3.1.2 Transformer Encoder Block

Fig. 4
figure 4

One layer of the Transformer encoder. Each consists of MSA and MLP blocks, with LN applied before every block, and residual connection applied after every block

The Transformer encoder employed in ViT first splits the input image \(x \in {\mathbb {R}}^{H \times \textrm{W} \times C}\) into N non-overlapping patches of the size P\(\times\)P, obtaining a flattened 2D sequence:\(\left\{ x_P^i \in {\mathbb {R}}^{P^2 \cdot c} \mid i=1,2 \ldots \textrm{N}\right\}\), where sequence length \(N=\frac{H \times W}{P^2}\). Then, the encoder uses a trainable linear projection to map the serialized \(x_P\) into the D-dimensional embedding space. To obtain the \(1 \times 1\) patch sequence \(z_0\), each block is randomly initialized and embedded through the convolution layer with a kernel size of \(P\times \textrm{P}\):

$$\begin{aligned} {\textbf{z}}_0=\left[ {x}_{cls} ;\,\,{x}_P^1 {\textbf{E}} ;\,\,{x}_P^2 {\textbf{E}} ;\,\,\cdots ;\,\,{x}_P^N {\textbf{E}}\right] +{\textbf{E}}_{p o s}, \end{aligned}$$
(1)

where \({\textbf{z}}_0\) is the \(1 \times 1\) patch sequence input to the L-layer Transformer encoder after linear mapping and position embedding, \({x}_{cls}\) is a class token that gathers information from all the patches, \({\textbf{E}} \in {\mathbb {R}}^{\left( P^2 \cdot C\right) \times D}\) denotes the trainable linear projection, and \({\textbf{E}}_{p o s} \in {\mathbb {R}}^{(N+1) \cdot D}\) denotes the position embedding.

The Transformer encoder consists of L layers of Multi-head Self-Attention (MSA) and Multi-Layer Perceptron (MLP) modules, as shown in Fig. 4. For the l-th layer of the encoder, we denote its input as \({\textbf{z}}_{\ell -1}\) and the output as \({\textbf{z}}_{\ell }\), and apply Layer Normalization (LN) before each MSA and MLP block. The residual connection is applied after the MSA and MLP blocks as follows:

$$\begin{aligned} \begin{aligned} {\textbf{z}}_{\ell }^{\prime }&={\text {MSA}}\left( {\text {LN}} \left( {\textbf{z}}_{\ell -1}\right) \right) +{\textbf{z}}_{\ell -1}, \\ {\textbf{z}}_{\ell }&={\text {MLP}}\left( {\text {LN}}\left( {\textbf{z}}_{\ell }^{\prime }\right) \right) +{\textbf{z}}_{\ell }^{\prime }. \end{aligned} \end{aligned}$$
(2)

In this paper, we utilize the output feature map \(F_3 \in {\mathbb {R}}^{\frac{H}{16} \times \frac{W}{16} \times 1024}\) of the convolution module as the input of the Transformer encoder. We first perform linear projection on \(F_3\) to convert it into a D (D = 768) dimensional sequence. Next, patch embedding is applied to get \(1 \times 1\) patches and then input to the L (L = 12) layer Transformer encoder for feature extraction. The final output feature map of hybrid encoder is \({\textbf{z}}_L \in (\textrm{N}, \textrm{D})\).

3.2 Feature enhancement module

The key to detecting inpainting areas in image detection is to obtain the tampering traces. For the inpainting traces mainly exist in the high-frequency components of digital images, the first step in inpainting detection is to suppress contents of input images and highlight inpainting traces. A common practice is to use high-pass filters to obtain high-frequency residuals, such as PF (pre-filtering) filter [18], Bayer filter [58], IRHP (Improved Random High-Pass) filter [59].

As shown in Fig. 5, the PF filter proposed by Li et al. [18] employs a learnable pre-filtering module to extract the residual noise of the image. The PF filter convolves each channel of the three-channel image with a \(3 \times 3\) high-pass filter kernel with a stride of 1. Then, the obtained noise residuals are concatenated as the input of the subsequent network.

Fig. 5
figure 5

The initialized convolution kernels for the PF filter

Bayar et al. [58] proposed a constrained convolution structure to suppress image content and can adaptively learn low-level forensic features. The authors add additional constraints to the first convolutional layer filter of the deep neural network to capture the changes in the dependencies among neighboring pixels caused by inpainting operation. The constraints of the first convolution layer are as follows:

$$\begin{aligned} \left\{ \begin{array}{l} {\varvec{w}}_k^{(1)}(0,0)=-1 \\ \sum _{m, n \ne 0} {\varvec{w}}_k^{(1)}(m, n)=1, \end{array}\right. \end{aligned}$$
(3)

where \({\varvec{w}}_k^{(\textrm{l})}(m, n)\) represents the weight of the k-th convolution kernel at position (mn) in the first convolutional layer, and \({\varvec{w}}_k^{(1)}(0,0)\) is the center position of the convolution kernel. The weights at the center of the convolution kernel are initialized to -1, and then the weights on the surrounding positions are normalized so that their sum equals 1. During the training phase, the constrained convolutional layer updates the weights through backpropagation after each iteration, enabling the network to extract image manipulation features adaptively.

IRHP [59] is an improved random high-pass filter initialization method that reduces computational costs while maintaining data flow stability at kernel’s input and output. During the weight initialization of the CNN’s first layer, a group of random high-pass filters are generated, and the data flow is kept stable at the input and output of the convolution kernel. Precisely, a simple high-pass filter template of size \(5 \times 5\) is depicted in Fig. 6:

Fig. 6
figure 6

Left is the initialization weight of the IRHP convolution kernel, and the right is the input feature matrix of the CNN


where \(w_{i, i=1,2, \ldots , N}\) are scalar random variables following the simple uniform distribution, C is an unknown constant. To initialize the convolution kernel as a high-pass filter, the mathematical expectation of \(w_{i, i=1,2, \ldots ,\quad N}\) should be equal to \(-\frac{C}{N}\), which compensates for the unknown constant C. Accordingly, the interval of the uniform distribution for the sampling of \(w_i\) is \(U\left( -\frac{2 C}{N}, 0\right)\) or \(U\left( 0,-\frac{2 C}{N}\right)\), depending on the value of C. From this distribution, we can calculate the variance of \(w_i\) as:

$$\begin{aligned} {\text {Var}}\left( w_i\right) =\frac{C^2}{3 N^2}. \end{aligned}$$
(4)

With the assumption of \(x_{i, i=1,2, \ldots , N}\) are mutually correlated random variables and \(w_{i, i=1,2, \ldots , N}\) are random variables with a non-zero expectation, the authors proved that the variance of \(w_i\) should be equal to \(\frac{1}{N}\). Combining the above two equations, we can derive the value of C as:

$$\begin{aligned} {\text {Var}}\left( w_i\right) =\frac{C^2}{3 N^2}=\frac{1}{N} \Rightarrow C=\pm \sqrt{3 N}. \end{aligned}$$
(5)

In the case of a 5\(\times\)5 filter, \(\textrm{3N}=3\times (5 \times 5-1)=72\), \(C=\pm \sqrt{72} \approx \pm 8.4853\). Because the distribution of \(w_i\) is \(U\left( -\frac{2 C}{N}, 0\right)\) or \(U\left( 0,-\frac{2 C}{N}\right)\), the uniform distribution interval of \(w_i\) is \(U(-0.7071,0)\) or U(0, 0.7071) (Fig. 7). Fig. 8 shows a randomly initialized IRHP convolution kernel:

Fig. 7
figure 7

Example of a 5\(\times\)5 IRHP convolution kernel. The center value C=8.4853, \(w_{i, i=1,2, \ldots , N}\) is a pseudo-random initialization parameter following the uniform distribution \(U(-0.7071,0)\)

Extensive experiments are conducted to find proper combinations of different high-pass filters. We find that the combination of PF filter, Bayer filter, and IRHP filter performs best on the test dataset. Therefore, we hierarchically combine PF filter, Bayer filter and IRHP filter as three sets of filters to extract image noise residual. We first input the three-channel image \({x} \in {\mathbb {R}}^{H \times \textrm{W} \times C}\) into the PF filter, Bayer filter and IRHP filter, respectively. Then, the enhanced features are concatenated along the channel to obtain \({\textbf{y}}_c \in (H, W, 15)\). Next, \({\textbf{y}}_c\) is passed through three standard convolutional layers. Each convolutional layer includes convolution, batch normalization, and activation functions. The outputs of each layer \({\textbf{y}}_1 \in \left( \frac{H}{2}, \frac{W}{2}, C_3\right) , {\textbf{y}}_2 \in \left( \frac{H}{4}, \frac{W}{4}, C_2\right) , {\textbf{y}}_3 \in \left( \frac{H}{8}, \frac{W}{8}, C_1\right)\) are used to supervise the upsampling process of the subsequent decoder network.

3.3 Decoder module

For the feature representation \({\textbf{z}}_L\) output by the encoder, the proposed decoder first reshapes it utilizing two convolution layers, and converts the hidden feature \({\textbf{z}}_L \in (\textrm{N}, \textrm{D})\) to \({\textbf{x}}_0 \in \left( \frac{H}{16}, \frac{W}{16}, C_0\right)\). Then, we adopt the simple progressive upsampling approach to export the prediction mask with with same resolution as the original image. Each upsampling block consists of three stages. The length and width of the original feature map are doubled by bilinear interpolation, and concatenated with three features output by the feature enhancement module along the channel, respectively. Then, two standard convolution layers are used to reduce the number of channels.

Specifically, the feature maps \({\textbf{x}}_1 \in \left( \frac{H}{8}, \frac{W}{8}, C_1\right) , \quad {\textbf{x}}_2 \in \left( \frac{H}{4}, \frac{W}{4}, C_2\right) , \quad {\textbf{x}}_3 \in \left( \frac{H}{2}, \frac{W}{2}, C_3\right)\) are concatenated with three features \({\textbf{y}}_1\), \({\textbf{y}}_2\), \({\textbf{y}}_3\) output by the feature enhancement module to regularize the upsampling operation of the predicted mask. Finally, the feature map obtained by the concatenation of \({\textbf{x}}_3\) and \({\textbf{y}}_1\) goes through a convolution and softmax layer, outputting the predicted mask with the same size as the input image.

3.4 Loss function

We consider a pixel-scale loss for enhancing the model’s sensitivity for detecting pixel-level inpainting, and an image-scale loss to reduce the false alarm rate. Typically, the missing regions of the inpainted images tend to be much smaller than the original regions, and merely using cross-entropy loss results in a low true-positive rate for the model to correctly identify tampered regions. Focal loss [60] is applied to make up for the defect of cross-entropy loss. Focal loss is a variant of cross-entropy loss:

$$\begin{aligned} \begin{aligned} L_{\text{ Focal } }(y, {\hat{y}})&=-\sum \alpha (1-{\hat{y}})^\gamma * y \log ({\hat{y}}) \\&\quad -\sum (1-\alpha ) {\hat{y}}^\gamma *(1-y) \log (1-{\hat{y}}), \end{aligned} \end{aligned}$$
(6)

where y is the ground-truth mask of the digital image, \({\hat{y}}\) is the predicted mask output by the proposed CTNet, \(\alpha , \gamma \in (0,1)\) are hyperparameters. In addition to the pixel-scale loss, we also employ an image-level binary cross-entropy loss [22] \(L_{C l s}\) to reduce the false alarm rate:

$$\begin{aligned} L_{\textrm{Cls}}(y, {\hat{y}})=-y * \log G({\hat{y}})+(1-y) * \log (1-G({\hat{y}})), \end{aligned}$$
(7)

where \(G({\hat{y}})\) is the output value of the predicted image mask \({\hat{y}}\) after global max pooling, \(G({\hat{y}})\) should tend to be zero when the input image is untampered. In summary, we use a combination of pixel-level loss and image-level loss:

$$\begin{aligned} L(y, {\hat{y}})=\lambda _1 L_{\text{ Focal } }(y, {\hat{y}})+\lambda _2 L_{C l s}(y, {\hat{y}}), \end{aligned}$$
(8)

where \(\lambda _1, \lambda _2 \in (0,1)\) are regularization hyperparameters.

4 Experiments and results

4.1 Datasets

Following the training and test set configuration of [24], the training set we chose contains a total of 48k inpainted images sampled from Dresden [61] and Places [62] datasets. The test set contains 10k images sampled from Dresden, Places, CelebA and ImageNet datasets to conduct various experiments. The training images are modified by the inpainting method proposed by [7]. The test images are inpainted by ten representative inpainting approaches, among which four are conventional methods, namely TE [31], NS [32], PM [33] and SG [34], and the remaining six methods are based on deep learning, which include GC [7], CA [39], SH [40], EC [41], LB [42], and RN [43]. Each of the above inpainting methods modifies 1k original images, and we obtain a total of 10k inpainting images and corresponding masks of the test set. The inpainting shapes include basic rectangles, circles, ellipses, etc.

To enhance the model robustness, we adopt following data augmentation methods for the training images:

  • Resize. The resolution of all images in the training set is adjusted to 256\(\times\)256.

  • Rotation. Half of the training samples are flipped horizontally, and the others are flipped vertically.

  • Normalization. We perform normalization on three channels of the input RGB image, mapping pixel values to a range between zero and one.

4.2 Implementation details

The experimental platform in this research is shown in Table 1.

Table 1 Experimental environment

We deploy the stochastic gradient descent(SGD) as the optimizer and set the learning rate, momentum, and weight decay as 0.001, 0.9, and 1e - 6 to update the CTNet. During the training process, the CTNet deploys ResNet-50 and ViT as backbone networks of the hybrid encoder, both of which are pre-trained on ImageNet dataset. The average F1 score and AUC (Area Under the Receiver operator characteristic curve) are selected as indicators to evaluate our model’s performance.

4.3 Benchmark experiments

A series of experiments are conducted to evaluate the detection performance of the CTNet comprehensively. In benchmark experiments, we compare CTNet with four state-of-the-art image inpainting detection methods on the test set. The involved detection algorithms include MVSS-Net [22], ManTra-Net [23], HP-FCN [18] and IID-Net [24]. The benchmark experiments in this paper include testing directly using the pre-trained model officially released by the original paper, and retraining on our training dataset [24] before testing with 10 unseen inpainting methods. The training strategies and parameter settings are consistent with the original paper.

Table 2 Benchmark experimental results

Table 2 shows the comparison results using AUC and pixel-level F1 scores as critics. In the “Retrain” column, “GC” indicates that the models are retrained on the training set synthesized by GC [7], while “−” means the pre-trained models are directly applied to evaluate the detection performance on the test set.

We can observe that our proposed CTNet has achieved the best detection results on the test set synthesized by 10 unseen inpainting methods. The average AUC value and F1 score have reached 99.11\(\%\) and 89.71\(\%\), leading the second-placed IID-Net model by 0.7\(\%\) and 2.14\(\%\). The advantage of CTNet is more prominent with F1 score, indicating its performance on heterogeneous images is very stable.

In benchmark experiments, it is not difficult to find that the pre-trained HP-FCN and MVSS-Net models have poor performance when directly used for testing. At the same time, their average AUCs are greater than 50\(\%\), and the binary classification results are higher than random predictions. However, their average F1 scores are lower than 1\(\%\), indicating that these methods are only designed for specific image inpainting processes, which causes problems such as inability to detect unexpected image inpainting methods and poor generalization ability.

Fig. 8
figure 8

Examples of the localization maps for inpainting detection

We also give some examples of the detection results in Fig. 8. It can be seen that the proposed algorithm obtains the most accurate results comparing with other deep learning-based digital image inpainting detection algorithms. Because our CTNet adopts a hybrid structural encoder that simultaneously considers the global information and high-resolution representation of the image, it has a stronger inpainting feature extraction ability than the CNN encoder used by our competitors. In the meantime, CTNet also shows great adaptability to unseen inpainting algorithms.

4.4 Ablation study

Table 3 Ablation experiment results

To verify the effectiveness of individual components on the model performance, ablation experiments are carried out in this paper. We modify some network structures in varied settings with the modules added progressively. First, we conduct experiments only utilizing one high-pass filter of the feature enhancement module. Second, we only use the CNN or Transformer module of the hybrid encoder. Third, we drop the image-level loss in the hybrid loss function. The results of the ablation experiment are shown in Table 3.

The average AUC values and F1 scores are shown in Table 3. In the part of feature enhancement module, the model utilizing a single high-pass filter cannot achieve good detection results. Experiments show that using multiple high-pass filters can effectively enhance the trace introduced by inpainting operations, significantly improving both mean AUC and F1 scores.

In the hybrid architecture encoder network part, employing only CNN or Transformer as the encoder caused the average AUC and F1 score to drop. In contrast, the hybrid encoder that combined the two improved the F1 score by 5\(\%\), proving that the hybrid encoder can effectively take high-resolution features and global context information into account. In the loss function part, adding image-level BCE loss on the basis of Focal Loss can effectively reduce the false alarm rate.

The qualitative results of pixel-level inpainting detection in varied setups is indicated in Fig. 9. For each row, the images from left to right are inpainted image, ground truth, without high-pass filter module, without transformer, without CNN, without focal loss and our model. From the third to seven columns of Fig. 9, we can observe that the combination of hybrid structure encoder, feature enhancement module and decoder can more precisely detect and locate inpainted regions. In summary, our model achieves a good balance between detection sensitivity and specificity.

Fig. 9
figure 9

Exemplar visual comparisons on test dataset, where first two columns are inpainted images and corresponding ground truth, the third to seven columns are results based on different settings

4.5 Robustness analysis

Image inpainting operations in real life are often accompanied by post-processing operations, which may diminish model performance in real-world detection scenarios. To further verify the robustness of CTNet against image post-processing operations, we perform the following three perturbations on the images in the test set, and use a unified perturbation factor p for different cases:

  1. 1)

    JPEG compression. We use the unified perturbation factor p as the quality factor p to compress the test images.

  2. 2)

    Gaussian noise. To simulate the low-light conditions, we add Gaussian noise that is normally distributed with a mean of zero and a standard deviation \(\textrm{STD}\) equals \((100-p ) / 10\).

  3. 3)

    Image resampling. We change the resolution of the digital image to p% of the original size.

Fig. 10
figure 10

Inpainting detection performance of our model under various disturbances in terms of average AUC values

Figure 10 shows the robustness analysis results of our model. The abscissa p represents the disturbance factor, and the values of p in the robustness experiments are 95, 90, and 85, respectively. As p decreases, the pixel-level inpainting detection becomes more difficult. With the decrease of perturbation factor p, the detection performance of the models dropped slightly. When the value of p is reduced from 95 to 90, we can observe that our model has better robustness to JPEG compression and Gaussian noise with small perturbations. The average AUC scores are all greater than 90 because we introduce frequency-domain features rich in high-frequency information.

As the perturbation strength increases, the detection performance of CTNet gradually decreases, but the average AUC score remains above 80. Therefore, we can speculate that the robustness of our model to image resampling perturbations will be significantly improved if image samples containing different sizes are utilized for training. The robustness test results show that the CTNet has good robustness to minor disturbances caused by image post-processing.

5 Conclusion

In this paper, we have proposed the CTNet, an image inpainting detection method based on a hybrid CNN-Transformer mechanism. Our CTNet exploits the inherent global self-attention mechanism of the Transformer encoder to obtain long-range dependencies. In addition, the proposed decoder uses the frequency-domain features output by the high-pass filter to supervise the upsampling process of the predicted mask, which improves the accuracy of locating inpainted regions. The experimental results show that the CTNet can generalize to ten commonly used inpainting methods after training on a single inpainting method. The comparative experimental results demonstrate the superiority of the detection performance of the proposed model compared to the current state-of-the-art repair detection methods.