1 Introduction

Nowadays, image and video are the most common forms of information needed in different domains of life [1, 2], and the processing of the content of the images/videos become a chllanging task because of the informations that can be in it [3, 4]. The extraction of such information is related to the purpose of the analysis [5,6,7]. In addition, it is a crucial tool for monitoring the security of people and objects [8,9,10,11,12]. But editing applications that can edit an image without leaving any traces, pose a problem to the public trust and confidence. Thus, the need for an automatic system to detect and extract the real image present in the available is an urgent demand. Meanwhile, the availability of original image from the given image is heavily dependent on the extraction mechanism of the original image; hence, object removal from images is one of the big concerns of the research and a hot topic for information security [13, 14].

Shared images in the social networks can contain many objects added to these images including signature, rectangles or emoticons. The addition of these objects can change the semantic of images. Removing these objects from the images is a widely recognized problem and a current track in computer vision research. Also, object removal is considered a solution for forgery of images. Object removal techniques that exist in literature can be divided into two categories: image inpainting and copy-move methods. The copy move based methods perform the undesired object removal by extracting a part from another image or another region from the same image, then pasting it to the object region that we are trying to remove. This technique is widely used for object removal due its simplicity, however; it is not suitable for some cases like face images or complicated scenes. Image inpainting , on the other hand, was applied on old images in order to remove scratches and enhance damaged images. Now, it is used for removing artifact objects that can be added to images by filling the target region with estimated values. Image inpainting is also used to remove any type of distortion including text, blocks, noise, scratch, lines or many types of masks [15,16,17]. Figure 1 represents the different existing types of distortion. By using recently developed algorithms, image inpainting can restore coherently both texture and structure components of the image. The obtained results demonstrate that these methods can remove undesirable objects from the images without leaving traces like artifacts ghosts. Until now, few methods are proposed for blind image inpainting regarding the massive number of published works with different techniques like sequential-based, CNN-based or GAN-based.

Removing objects from images using image inpainting can reach improved performance in the future, but when the image editors hide traces using sophisticated techniques, the detection of forgery and the inpainting of image become difficult. For that reason, almost all detection approaches attempt to handle this by detecting the abnormalities of similarity between blocks of the image that can be affected during the postprocessing operation. This work summarizes different methods for image inpainting using different techniques including sequential-based, CNN-based or GAN-based methods.

The remainder of the paper is organized as follows: the literature overview including sequential-based ,CNN-based and GAN-based methods is presented in Sect. 2. Employed datasets are presented in Sect. 3. Evaluations and metrics used are discussed in Sect. 4. The conclusion is provided in Sect. 5.

Fig. 1
figure 1

Types of distortion

Fig. 2
figure 2

Image inpainting applications and the purposes of each category

2 Literature Review

Image inpainting is the process of completing or recovering the missing region in the image or removing some objects added to it. The operation of inpainting depends on the type damaging in the image, and the application that caused this distortion. For example, in image restoration, we talk about removing the scratch or text that can be found in the images, whereas, in a photo-editing application, we are interested in object removal in image coding and transmission applications, the operation related to images inpainting is recovering the missing blocks. Finally, for virtual paintings restoration, the related operation is scratch removal. Figure 2 is a representation of the kind of application and the corresponding image inpainting operation.

To handle this, many methods have been proposed including sequential algorithms or deep learning techniques. For that, we categorize the existing methods for images inpainting into three categories: sequential-based approaches, CNN-based approaches, and GAN-based approaches. The sequential-abased approaches are the proposed method without deep learning using neural networks. Where the CNN-based approaches are the algorithms that use neural networks with automatically deep learning. GAN-based approaches represent the methods that use Generative adversarial networks (GANs) for taraining the images of inpainting models.

By the following, the image inpainting works related to each category of methods is presented.

2.1 Sequential-Based Methods

Approaches related to images inpainting can be classified into two categories: patch-based and diffusion-based methods.

Patch-based methods are based on techniques that fill in the missing region patch-by-patch by searching for well-matching replacement patches (i.e., candidate patches) in the undamaged part of the image and copying them to corresponding locations. Many methods have been proposed for image inpainting using patch-based method. Ružić and Piz̆urica [15] proposed a patch-based method consisting of searching the well-matched patch in the texture component using Markov random field (MRF). Jin and Ye [16] proposed a patch-based approach based on annihilation property filter and low rank structured matrix. In order to remove an object from an image, Kawai et al. [17] proposed an approach based on selecting the target object and limiting the search around the target by the background. Using two-stage low rank approximation (TSLRA) [18] and gradient-based low rank approximation [19], authors proposed patch-based methods for recovering the corrupted block in the image. On RGB-D images full of noise and text, Xue et al. [20] proposed a depth image inpainting method based on Low Gradient Regularizatio n. Liu et al. [21] used the statistical regularization and similarity between regions to extract dominant linear structures of target regions followed by repairing the missing regions using Markov random field model (MRF). Ding et al. [22] proposed a patch-based method for image inpainting using Nonlocal Texture Matching and Nonlinear Filtering (Alpha-trimmed mean filter). Duan et al. [23] proposed an image inpainting approach based on the Non-Local Mumford–Shah model (NL–MS). Fan and Zhang [24] proposed another image inpainting method based on measuring the similarity between patches using the Sum of Squared Differences (SSD). In order to remove blocks from an image, Jiang [25] proposed a method for image compression. Using Singular value decomposition and an approximation matrix, Alilou and Yaghmaee [26] proposed an approach to reconstruct the missing regions. Other notable research includes using texture analysis on Thangka images to recover missing block in an image [27] and using the structure information of images [28, 29]. In the same context, Zeng et al. [30] proposed the use of Saliency Map and Gray entropy. Zhang et al. [31] proposed an image inpainting method using a joint probability density matrix (JPDM) for object removal from images.

Wali et al. [32] proposed a denoising and inpainting method using total generalized variation (TGV). The authors analyze three types of distortion including text, noise, masks. In the same context, Zhang et al. [33] proposed an example-based image inpainting approach based on color distribution by restoring the missed regions using the neighboring regions. This work analyses the many types of distortions including objects, text, and scratch. The multiscale graph cuts technique is used for inpainting images in [34] by analyzing different types of distortion. In [35] the authors proposed a novel joint data-hiding and compression scheme for digital images using side match vector quantization (SMVQ) and image inpainting. The proposed approach is tested on six grayscale recognized images including Lena, airplane, peppers, sailboat, lake, and Tiffany. In order to preserving the texture consistency and structure coherence, the authors in [36] remove the added objects in the images using multiple pyramids method, local patch statistics and geometric feature-based sparse representation. For 3D stacked image sensor the authors in [37] proposed an image inpainting method using discrete wavelet transform (DWT). In order to fill the missed region, the authors in [38] proposed a patch-based method by search and fills-in these regions with the best matching information surrounding it. In the goal to reconstruct the Borehole images the authors in [39] proposed a method by analyzing the texture and structure component of the images. Helmholtz equation is used for inpainting images in [40], after the inpainting of the missed region the authors proposed a method for enhancing the quality of the images.

Diffusion-based methods fill in the missing region (i.e. hole), by smoothly propagating image content from the boundary to the interior of the missing region. For that, Li et al. [41] proposed a diffusion-based method for image inpainting by localizing the diffusion of inpainted regions followed by the construction of a feature set based on the intra-channel and inter-channel local variances of the changes to identify the inpainted regions. Another diffusion-based method of image inpainting proposed by the same authors in a later research [42] involves exploiting diffusion coefficients which were computed using the distance and direction between the damaged pixel and its neighborhood pixel. Sridevi et al. [43] proposed another diffusion-based image inpainting method based on Fractional-order derivative and Fourier transform. Table 1 depicts a summary of patch-based and diffusion-based sequential methods for image inpainting.

Jin et al. [44] proposed an approach called sparsity-based image inpainting detection based on canonical correlation analysis (CCA). Mo and Zhou [45] present a research-based on dictionary learning using sparse representation. These methods are robust for simple images, but when the image is complex like contains a lot of texture and object or the object cover a large region in the images, searching for similar patch can be difficult.

Table 1 Sequential-based method for image inpainting

2.2 Convolutional-Neural-Network-Based Methods

Recently, the strong potential of deep convolutional networks (CNNs) is being exhibited in all computer vision tasks, especially in image inpainting. CNNs are used specifically in order to improve the expected results in this field using large-scale training data. The sequential-based methods succeed in some parts of image inpainting like filling texture details with promising results, yet the problem of capturing the global structure is still a challenging task [46]. Several methods have been proposed for image inpainting using convolutional neural networks (CNNs) or encoder-decoder network based on CNN. Shift-Net based on U-Net architecture is one of these methods that recover the missing block with good accuracy in terms of structure and fine-detailed texture [46]. In the same context, Weerasekera et al. [47] use depth map of the image as input of the CNN architecture, whereas Zhao et al. [48] use the proposed architecture for inpainting X-ray medical images. VORNet [49] is another CNN-based approach for video inpainting for object removal. Most image inpainting methods know the reference of damaged pixels of blocks. Cai et al. [50] proposed a blind image inpainting method named (BICNN). Based on convolutional neural networks (CNNs) using encoder-decoder network structure many works have been proposed for image inpainting. Zhu et al. [51] proposed a patch-based inpainting method for forensics images. Using the same technique of encoder-decoder network, Sidorov and Hardeberg [52] proposed an architecture for denoising, inpainting, and super-resolution for noised, inpainted and low-resolution images, respectively. Zeng et al. [53] built a pyramidal-context architecture called PEN-NET for high-quality image inpainting. Liu et al. [54] proposed a layer to the encoder-decoder network called coherent semantic attention (SCA) layer for image inpainting method. This proposed architecture is presented in Fig. 3. Further, Pathak et al. [55] proposed encoder-decoder model for image inpainting. In order to fill the gap between lines drawing in an image, Sasaki et al. [56] used an encoder-decoder-based model. This work can be helpful for scanned data that can miss some parts. For the UAV data that can be affected in terms of resolution or containing some blindspots, Hsu et al. [57] proposed a solution using VGG architecture. Also, for removing some text from the images Nakamura et al. [58] proposed a text erasing method using CNN. In order to enhance the images of the damaged artwork, Xiang et al. [59] also proposed a CNN-based method. In the same context as [59] and using GRNN neural network, Alilou and Yaghmaee [60] proposed a non-texture image inpainting method. Unlike the previous methods, Liao et al. [61] proposed a method called Artist-Net for image inpainting. The same goal is reached by Cai et al. [62] who proposed a semantic object removal approach using CNN architecture. In order to remove motifs from single images, Hertz et al. [63] proposed a CNN-based approach. Table 2 summarizes the CNN-based methods with a description of the type of data used for image inpainting.

Fig. 3
figure 3

Encoder-decoder networks model in [54]

Table 2 CNN-based method for image inpainting

For the same purpose of image inpainting, but for replacing a region of an image by another region from another image, the authors in [64] based on VGG model trained their own model. In order to mitigate the effect of the gradient disappearance, the authors in [65] introduce a dense block for U-Net architecture that is used for inpainting the images. For medical purposes, the authors in [67] attempted to denoising the medical images using the principle of image inpainting using Residual U-Net architecture. To address the blurring and color discrepancy problems for image inpainting the authors in [66] proposed a method for missed region reconstruction using region-wise convolutions. As the authors in [68] add some layers named Interleaved Zooming Block in the encoder-decoder architecture for impainting the images. The authors in [69] Proposed a full-resolution residual block (FRRB) with an encoder-decoder model for the same purpose.

2.3 GAN-Based Methods

The much-used technique nowadays, was introduced for image generation in 2014 in [70]. Generative adversarial networks (GANs) are a framework which contains two feed-forward networks, a generator G and a discriminator D. The generative network, G, is trained to create a new image which is indistinguishable from real images, whereas a discriminative network, D is trained to differentiate between real and generated images. This relation can be considered as a two-player min-max game in which G and D compete. To this end, the G (D) tries to minimize (maximize) the loss function, i.e. adversarial loss, as follows:

$$\begin{aligned} \begin{array}{@{}l}\underset{G}{min\;} \underset{D}{max\;}E_{x\sim P_{data}(x)}\left[ \log D(x)\right] +E_{z\sim P_z(z)}\left[ \log (1-D(G(z)))\right] \end{array} \end{aligned}$$
(1)

where z and x denote a random noise vector and a real image sampled from the noise Pz(z) and real data distribution Pdata(x), respectively. Recently, the GAN has been applied to several semantic inpainting techniques in order to complete the hole region naturally.

GANs are a framework that contains two feed-forward networks, a generator G and a discriminator D, as shown in Fig. 4. The generator takes random noise z as input and generates some fake samples similar to real ones; while the discriminator has to learn to determine whether samples are real or fake. At present, Generative Adversarial Network (GAN) becomes the most used technique in all computer vision applications. GAN-based approaches use a coarse-to-fine network and contextual attention module gives good performance and is proven to be helpful for inpainting [71,72,73,74,75]. Existing image inpainting methods based on GAN are generally a few. Out of these, we find that in [71], Chen and Hu proposed a GAN-based semantic image inpainting method, named progressive inpainting, where a pyramid strategy from a low-resolution image to a higher one is performed for repairing the image. For handwritten images, Li et al. [72] proposed a method for inpainting and recognition of occluded characters. The methods use improved GoogLeNet and deep convolutional generative adversarial network (DCGAN). In an image inpainting method named PEPSI [76] the authors unify the two-stage cascade network of the coarse-to-fine network into a single-stage encoder-decoder network. Where PEPSI++ is the extended version of PEPSI [73]. In [74] the authors used Encoder-decoder network and multi-scale GAN for image inpainting. The same combination is used in [75] for image inpainting and image-to-image transformation purposes. On the RBG-D images, Dhamo et al. [77] used CNN and GAN model to generate the background of a scene by removing the object in the foreground image as performed by many methods of motion detection using background subtraction [78, 79]. In order to complete the missing regions in the image, Vitoria et al. [80] proposed an improved version of the Wasserstein GAN with the incorporation of Discriminator and Generator architecture. In the same context, but on sea surface temperature (SST) images, the Dong et al. [81] proposed a deep convolutional generative adversarial network (DCGAN) for filing the missing parts of the images. Also, Lou et al. [82] exploit a modifier GAN architecture for image inpainting whereas, Salem et al. [83] proposed a semantic image inpainting method using adversarial loss and self-learning encoder-decoder model. A good image restoration method requires preserving structural consistency and texture clarity. For this reason, Liu et al. [84] proposed a GAN-based method for image inpainting on face images. FiNet [85] is another approach found in the literature for fashion image inpainting that consists of completing the missing parts in fashion images.

Recently, several approaches are proposed by combining some additional techniques (GAN, CNN,...) for inpainting the images. Jiao et al. [86] combined an encoder-decoder, multi-layer convolutions layers and GAN for restoring the images. The authors in [87] proposed a two-stage adversarial model named EdgeConnect by providing a generator for edge followed by an image inpainting model. The first model attempt to provide an edge completion component and the second one, inpaint the RGB image. According to the fact that GAN-based image inpainting models do not care out to the consistency of the structural and textural values between the inpainted region and their neighboring, the authors in [88] attempts to handle this limitation by providing a GAN model for learning the alignment between the block around the restored region and the original region. For the same reason as [88], taking into consideration the semantic consistency between restored images and original images, Li et al. [89] provided a boosted GAN model comprising an inpainting network and a discriminative network. When the inpainting network discovers the segmentation information of the input images, the discriminative network discovers the regularizations of the overall realness and segmentation consistency with the originals images. In the same context and using GAN-based models for images inpainting, each work provides some prior processing on GAN networks to get the best inpainting results for different types of images including medical images [90], face images [91] or scenes images [92].

Fig. 4
figure 4

Framework of GANs

The GAN-based methods give a good addition to the performance of image inpainting algorithms, but the speed of training is lower and needs very good performance machines, and this is due to computational resources requirements including network parameters and convolution operations.

3 Image Inpainting Datasets

Image inpainting methods use well known and large datasets for evaluating their algorithms and comparing the performance. The categories of images determine the effectiveness of each proposed method. These categories include natural images, artificial images, face images, and many other categories. In this work, we attempt to collect the most used datasets for image inpainting including Paris StreetView [93], Places [94], depth image dataset [20], Foreground-aware [95], Berkeley segmentation [96], ImageNet [97] and others. We also try to cite the types of used data such as RGB images, RGB-D images, and SST images. Figure 5 represents some frame examples from the cited datasets. Where Table 3 describes various datasets used for image inpainting approaches.

Fig. 5
figure 5

Examples from Image inpainting datasets

Paris StreetView [93] is collected from Google StreetView and represents a large-scale dataset containing street images for several cities around the world. The Paris StreetView comprises 15,000 images. The resolution of images is \(936\times 537\) pixels.

PlacesFootnote 1 [94] dataset is built for human visual cognition and visual understanding purposes. The dataset contains many scene categories such as bedrooms, streets, synagogue, canyon, and others. The dataset is composed of 10 million images including 400+ images for each scene category. It allows the deep learning methods to train their architecture with large-scale data.

A depth image datasetFootnote 2 is introduced by Xue et al. [20] for evaluating depth image inpainting methods. The dataset is composed of two types of images: RGB-D images and grayscale depth images. Also, 14 scene categories are included such as Adirondack, Jade plant, Motorcycle, Piano, Playable and others. The masks for damaged images are created including textual makes (text in the images) and random missing masks.

Table 3 Datasets description

Foreground-aware dataset [95] is different from the other datasets. It contains the masks that can be added to any image for damaging it. It is named an irregular hole mask dataset for image inpainting. Foreground-aware datasets contain 100,000 masks with irregular holes for training, and 10,000 masks for testing. Each mask is a \(256 \times 256\) gray image with 255 indicating the hole pixels and 0 indicating the valid pixels. The masks can be added to any image for damaging it, which can be used for creating a large dataset of damaged images.

Berkeley segmentation databaseFootnote 3 [96] is composed of 12,000 images segmented manually. The images collected from other datasets consist of 30 human subjects. The dataset is a combination of RGB and Grayscale images.

ImageNetFootnote 4 [97] is a large-scale dataset with thousands of images of each subnets. Each subnet is represented by 1000 images. The current version of the dataset contains more than 14,197,122 images where the 1,034,908 images of the human body are annotated with a bounding box.

USC-SIPI image databaseFootnote 5 contains several volumes representing many types of images. The resolution in each volume can vary between \(256 \times 256\), \(512 \times 512\) and \(1024 \times 1024\) pixels. Generally, the datasets contain 300 images representing four volumes including texture, aerials, miscellaneous and sequences.

CelebFaces Attributes DatasetFootnote 6 (CelebA) [98] is a recognized and public datasets for face recognition. It contains more that 200,000 celebrity images representing 10,000 identities with a large pose variations.

Indian PinesFootnote 7 [99] consist of images representing images of three scenes including agriculture, forest and natural perennial vegetation with resolution of \(145 \times 145\) pixels.

Microsoft COCO val2014 datasetFootnote 8 [100] is a new image recognition, segmentation, and captioning dataset. Microsoft COCO has several features with a total of 2.5 million labeled instances in 328,000 images.

ICDAR 2013 datasetFootnote 9 [101] is a handwritten dataset with two languages including Arabic and English. The total number of writers is 475 whose handwritten page images have been scanned. The datasets contain 27 GB of data.

SceneNet datasetFootnote 10 [102] is a dataset for scene understanding tasks including semantic segmentation, object detection and 3D reconstruction. It contains RGB image and the corresponding RGB-D images, which form, in total, 5 million images.

Stanford Cars datasetFootnote 11 [103] is a set of car images representing 196 categories of cars of different sizes. The datasets contain 16,200 images in total.

Cityscapes datasetFootnote 12 [104] is a large-scale dataset of stereo videos of street scenes of 50 cities. The images contain about 30 classes of objects. Also, it contains about 20,000 annotated frames with coarse annotations.

Middlebury StereoFootnote 13 datasets contain many versions we present the two new ones [105] and [106]. Middlebury 2006 [105] is a depth grayscale dataset that contains images captured from 7 view with different illuminations and exposures. The images resolution is defined by three categories full-size with \(1240 \times 1110\) pixels, half size with \(690 \times 555\) pixels and the third resolution with \(413 \times 370\). Middlebury 2014 [106] is an RGB-D datasets unlike the other version.

4 Evaluation and Discussion

Due to the unavailability of a large dataset of damaged painting images and the novelty of the image inpainting topic, researchers find it difficult to obtain datasets for training their methods [107]. For that, most researchers use existing datasets like USC-SIPI, Paris StreetView, Places, ImageNet and others, and damage a set of images from these datasets for training their models and algorithms. The available methods in literature generate their own image inpainting datasets by adding some artificial distortion including noise [20], text [24], scratch [30], objects (shapes) [93], masks [95, 97].

Table 4 Summarization of sequential-based methods evaluations

The evaluation metrics for image inpainting algorithms differ according to the using technique. In order to evaluate the efficiency of the proposed methods, researchers use some evaluation measures including mean squared error (MSE), Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index (SSIM) [108]. For example, Zeng et al. [30] used these evaluation parameters to demonstrate the obtained results of repairing scratch and text in the images. Whereas Mo et al. [45] used the same metrics for evaluating the experiments for text and noise where Duan et al. [23] used the same metrics for evaluating the proposed method of removing the added objects to the images. In addition to the metrics used for evaluation, the category of images in the used dataset for evaluation can be different from one method to another. For example, some methods use RGB images while others evaluate their methods in RGB-D images or historical images. For that reason, we attempt to summarize the obtained results regarding each category of images used and the type of damaging in the images. In addition, this summarization is for sequential-based methods, which use the common evaluation metrics as shown in Table 4. Form the table we can observe that most of these methods are evaluated on grayscale images like in [16, 18, 19, 24, 25, 43, 45] with some type of distortion including text, Gaussian noise (named Random in the papers) or some type of objects. We can also find some methods that analyze the three types of distortion like in [19], whereas other processes just for two types (text and noise) as presented in [18, 43, 45]. Also from the table, we can detect that the proposed methods use some recognized images in computer vision like Lena and Barbara images, which are used for testing the effectiveness of several methods. Methods that propose their approaches for image inpainting on RGB images use the same distortion categories including text, noise, and objects [13, 15, 22, 23, 30]. In addition, some researchers proposed methods for scratch analysis, which is a process of restoring the old images or images with damaged by some lines like in [15] and [29]. As shown in the tables, state-of-the-art methods use different images from the internet or some datasets and this is because of the lack of datasets for image inpainting.

We summarize the results according to the evaluation metrics used in each papers. In addition, in some works SNR or SSIM are used like in [32] and = others did not illustrate any evaluation metrics like in [34]. For that, in this papers we atttemtp to represent the PSNR metric that is used in the magority of related works [37, 38].

Table 5 Performance of CNN-based methods

With deep learning techniques, any task in computer vision can be performed with automatic learning using different unsupervised features, unlike the sequential-based method. The learning is made using convolutional neural networks (CNNs) that makes several computer vision tasks more improved in terms of robustness and most simple in terms of features suitable for each task. For image inpainting methods that use CNN, as described in the previous section, the effectiveness of each approach is related to the size and type of the data used and the architecture implemented. The evaluation of these methods is the same as for sequential-based methods. PSNR (the distance at the pixel level) and SSIM (similarity between two images) are used for evaluating the robustness of repairing damaged images under different categories of distortion including scratch, text, noise or random region (Blocks) added to the image. Table 5 represents CNN-based methods for image inpainting and their performance evaluation and the datasets used, the type of distortion, evaluation metrics and the resolution of the images used in training. It is obvious that the performance of such methods is related to the type of distortion. For example, images damaged by blocks are less accurate in terms of PSNR values. The algorithms [50, 52, 60, 63] can handle the added visual motifs like text or lines with a good performance. In addition, the performance is influenced by the percentage of added noise to the images. For the new dataset used for image inpainting, including Paris StreetView, Places or ImageNet that contains a large scale of data which are also different types of images, the algorithm’s accuracy can be less than the others approaches using another dataset [46, 55, 61, 62]. This change of accuracy is related to the diversity of the images in these datasets.

Some proposed method presents their obtained results with a description of the different parameters used in the training phase which facilitate the comparison process. for example, the proposed architectures in [66, 68, 69] use the same maks for damaging the images before training their models. the obtained results depend on the area damaged in the images using masks. All the methods succeed to inpaint the images by a good quality when the images in masked by 10–20%. when the mask covers more than 40% the performance decrease. For example in [66] the PSNR value become 22.04 for 50–60% while it was 29.52, 10–20%.

Each method either makes a visual or qualitative evaluation; or an evaluation using metrics or quantitative evaluation. The quantitative evaluation, using PSNR and SSIM metrics, is performed also for image inpainting with GAN-based methods. In some cases, these metrics do not mean that qualitative results are better. This is related to the ground truth that should be unique [71]. Also, some methods for image inpainting are better for a certain category of images as well as the type of distortion used. Table 6 shows a number of GAN-based methods for image processing with a description of the used datasets and the evaluation metrics used for each method. In [75] the evaluation is made using many metrics depending on the position of the damaged region (block) including center, left, right, up, and down. But here we choose to present just PSNR and SSSIM of the image inpainting results on the images where the block is located in the center. In [73], two datasets are used with two types of distortion including blocks and free-form masks, which are categories of scratch painted with bold lines. For this example, we can see that the inpainting of scratch is more accurate that repairing the blocks. This becomes obvious from the fact that the blocks can take a region of the images where the scratch can take distributed small regions in the image.

Also from Table 6, we can observe that all quoted methods can recover missing regions with some difference in the accuracy of each method in terms of PSNR and SSIM metrics. For example, the methods presented in [91] and [92] are very close in terms of obtained values of PSNR. Also for the method [73, 76] and [88], the PSNR values on CelebA dataset are 25.6 and 25.56 respectively. The convergence of the results is caused by the use of the same techniques (GAN) with some differences in the models.

As montioned above, the unavailability of the datasets for image inpainting make the comparison between these methods difficult. Also, each author uses different masks and type of distortion.

Table 6 GAN-based performance results

4.1 Computational Time

Computational time represents a challenge for many tasks in computer vision, especially for real-time applications. Also with the speed of development of deep learning methods (i.e. from CNN to GAN), the training time, training speed, inference time, becomes a concern for image/video processing methods. For the image inpainting which presents a new challenge in computer vision, the computational time, or the other term related to it, is not much analyzed by the authors in the state-of-the-art methods except some of them. The existing methods describe either the time of training, interference or the training speed for inpainting an image. By the following, each one of the methods that considers the concept of time is presented:

  • Running time In [51] the authors compute the average running time of each tested image with a resolution of 256 \(\times \) 256. Where the results are 2 s for each image. in [54] the proposed architecture takes 0.82 s per image because of the use of a CSA layer that increases the computational time.

  • Training speed The training speed is the evaluation time metric presented in [49] for describing the computational cost of the proposed architecture for training which was 7 fps (frame per second).

  • Inference time The inference time is presented in [47] for each inpainting the RGB-D images using different Dept sensor including ORB-SLAM and Kinect depth map, and LIDAR depth map. For ORB-SLAM and Kinect depth map the inference time is about 30 ms where the resolution of the images is 147 \(\times \) 109 and about 200ms for the full image resolution of 640 \(\times \) 480. For the LIDAR depth map, the inference time is about 100ms at 608 \(\times \) 160 image resolution. In the same context, the inference time takes 38 s to complete the image inpainting in [71].

  • Training time In some works, the authors declare the time needed for training their model. for example, in [71] the training process costs 169m for CelebA dataset and 66m for Stanford Cars dataset where the architecture is implemented on an NVIDIA GTX 1080Ti GPU. The same architecture is trained on a CPU (i5-7400, 3.00GHz) and the process takes about 42 h.

5 Conclusions

Image inpainting is an important task for computer vision applications, due to large modified data using image editing tools. From these applications, we can find wireless image coding and transmission, image quality enhancement, image restoration and others. In this paper, a brief image inpainting review is presented. Different categories of approaches have been presented including sequential-based (approaches without learning), CNN-based approach and GAN-based approaches. We also attempt to collect the approaches that handle different types of distortion in images such as text, objects added, scratch, and noise as well as several categories of data like RGB, RGB-D, historical images. A good alternative to these conventional features is the learned ones, e.g. deep learning, which has more generalization ability in more complicated scenarios. To be effective, these models need to be trained on a large amount of data. For that, we summarized the used datasets for training these models. In order to summarize the different analyzed cases and their performance, tabulated the evaluation performing the types of data, the datasets and the metrics used for each approach for each category of methods.

To conclude, there is no method that can inpaint all types of distortion in images, but using learning techniques provides some promising results for each category of the analyzed cases.