Keywords

1 Introduction

There is a need to recover missing contents in corrupted images for visual aesthetics improvement. Deep neural networks have advanced image inpainting by introducing semantic guidance to fill hole regions. Different from the traditional methods  [2, 3, 7, 8] that propagate uncorrupted image contents to the hole regions via patch-based image matching, deep inpainting methods  [13, 25] utilize CNN features in different levels (i.e., from low-level features to high-level semantics) to produce more meaningful and globally consistent results.

Fig. 1.
figure 1

Visual comparison on the Paris StreetView dataset  [6]. GT is the ground truth image. The proposed inpainting method is effective to reduce blur and artifacts within and around the hole regions, which are brought by inconsistent structure and texture features.

The encoder-decoder architecture is prevalent in existing deep inpainting methods  [13, 19, 25, 38]. However, a direct utilization of the end-to-end training and prediction processes generates limited results. This is due to the challenging factor that the hole region is completely empty. Without sufficient image guidance, an encoder-decoder is not able to reconstruct the whole missing content. An alternative is to use two encoder-decoders to separately learn missing structures and textures in a step-by-step manner. These two-stage methods  [21, 24, 26, 27, 29, 40, 41] typically generate an intermediate image with recovered structures in the first stage (i.e., encoder-decoder), and send this image to the second stage for texture generation. Although structures and textures are produced on the output image, their appearances are not consistent. Figure 1 shows an example. The inconsistent structures and textures within hole regions produce blur and artifacts as shown in (b) and (c). Meanwhile, the recovered contents are not coherent to the uncorrupted contents around the hole boundaries (e.g., the leaves). This limitation is because of the independent learning of CNN features representing structures and textures. In practice, the structures and textures correlate with each other to formulate the image contents. Without considering their coherence, existing methods are not able to produce visually pleasing results.

In this work, we propose a mutual encoder-decoder to jointly learn CNN features representing structures and textures. The features from the deep layers of the encoder contain structure semantics while the features from the shallow layers contain texture details. The hole regions of these two features are filled via two separate branches. In the CNN feature space, we use a multi-scale filling block within each branch for hole filling. Each block consists of 3 partial convolution streams with progressively increased kernel sizes. After hole filling in these two features, we propose a feature equalization method to ensure the structure and texture features consistent with each other. Meanwhile, the equalized features are coherent with the features of uncorrupted image content around the hole boundaries. The proposed feature equalization consists of channel reweighing and bilateral propagation. We concatenate two features first and perform channel reweighing via attention exploration  [12]. The attentions across two features are set to be consistent after channel equalization. Then, we propose a bilateral propagation activation function to equalize the feature consistency in the whole feature maps. This activation function uses elements on the global feature maps to propagate channel consistency (i.e., feature coherence across the hole boundaries), while using elements within local neighboring regions to maintain channel similarities (i.e., feature consistency within the hole). To this end, we fuse the texture and structure features together to reduce inconsistency in the CNN feature maps. The equalized features then supplement the decoder features in all the feature levels via encoder-decoder skip connections. The feature consistency is then reflected in the reconstructed output image, where the blur and artifacts are effectively removed around the hole regions as shown in Fig. 1(d). Experiments on the benchmark datasets show that the proposed method performs favorably against state-of-the-art approaches.

We summarize the contributions of this work as follows:

  • We propose a mutual encoder-decoder network for image inpainting. The CNN features from the shallow layer are learned to represent textures and the features from deep layers represent structures.

  • We propose a feature equalization method to make structure and texture features consistent with each other. We first reweigh channels after feature concatenation and propose a bilateral propagation activation function to make the whole feature consistent.

  • Extensive experiments on the benchmark datasets show the effectiveness of the proposed inpainting method in removing blur and artifacts caused by inconsistent structure and texture features. The proposed method performs favorably against state-of-the-art inpainting approaches.

2 Related Works

Empirical Image Inpainting. The empirical image inpainting methods  [1, 3, 18] based on diffusion techniques propagate the neighborhood appearances to the missing regions. However, they only consider surrounding pixels of missing regions, which can only deal with small holes in background inpainting tasks and may fail to generate meaningful structures. In contrast, methods  [2, 4, 5, 28, 36] based on patch match fill missing regions by transferring similar and relevant patches from the remaining image region to the hole region. Although empirical methods perform well to handle small holes on the background inpainting task, they are not able to generate semantically meaningful content. When the hole region is large, these methods suffer from a lack of semantic guidance.

Deep Image Inpainting. Image inpainting based on deep learning typically involves the generative adversarial network  [9] to supplement visual perceptual guidance for hole filling. Pathak et al.  [25] first bring adversarial training  [9] to inpainting and demonstrate semantic hole-filling. Iizuka et al.  [13] propose local and global discriminators, assisted by dilated convolution  [39] to improve the inpainting quality. Nazeri et al.  [24] propose EdgeConnect that predicts salient edges for inpainting guidance. Song et al.  [29] utilize a segmentation prediction network to generate segmentation guidance for detail refinement around the hole region. Xiong et al.  [34] present foreground-aware inpainting, which involves three stages, i.e., contour detection, contour completion and image completion, for the disentanglement of structure inference and content hallucination. Ren et al.  [26] introduce a structure-aware network, which splits the inpainting task into two parts: structure reconstruction and texture generation. It uses appearance flow to sample features from contextual regions. Yan et al.  [37] speculate the relationship between the contextual regions in the encoder layer and the associated hole region in the decoder layer for better predictions. Yu et al.  [40] and Song et al.  [27] search for a collection of background patches with the highest similarity to the generated contents in the first stage prediction. Liu et al.  [20] address this inpainting task via exploiting the partial convolutional layer and mask-update operation. Following the  [20], Yu et al.  [41] present gate convolution that learns a dynamic mask-updating mechanism and combines with the SN-PatchGAN discriminator to achieve better predictions. Liu et al.  [21] propose coherent semantic attention, which considers the feature coherency of hole regions to guarantee the pixel continuity in image level. Wang et al.  [32] propose a generative multi-column convolutional neural network (GMCNN) that uses varying receptive fields in branches. Different from existing deep inpainting methods, our method produces CNN features to consistently represent structures and textures to reduce blur and artifacts around the hole region.

Fig. 2.
figure 2

The overview of the proposed pipeline. We use a mutual encoder-decoder to jointly recover structures and textures during hole filling. The deep layer features of the encoder are reorganized as structure features, while the shallow layer features are reorganized as texture features. We fill holes in multi-scales within the CNN feature space and equalize output features in both channel and spatial domains. The equalized features contain consistent structure and texture features at different CNN feature levels, and supplement the decoder via skip connections for output image generation.

3 Proposed Algorithm

Figure 2 shows the pipeline of the proposed method. We use one mutual encoder-decoder to jointly learn structure and texture features and equalize them for consistent representation. The details are presented in the following:

3.1 Mutual Encoder-Decoder

We use an encoder-decoder for end-to-end image generation to fill holes. The structure of this encoder-decoder is a simplified generative network  [14], where there are 6 convolutional layers in the encoder and 5 convolutional layers in the decoder, respectively. Meanwhile, 4 residual blocks  [10] with dilated convolutions are set between the encoders and decoders. The dilated convolutions  [13, 24] increase the size of the receptive field to perceive encoder features.

In the encoder, we reorganize the CNN features from deep layers as structure features where the semantics reside. Meanwhile, we reorganize the CNN features from shallow layers as texture features to represent image details. We denote the structure features as \(F_{st}\) and the texture features as \(F_{te}\) as shown in Fig. 2. The reorganization process is to resize and transform the CNN feature maps from different convolutional layers to the same size, and concatenate them accordingly.

After CNN feature reorganization, we design two branches (i.e., the structure branch and the texture branch) to separately perform hole filling on \(F_{te}\) and \(F_{st}\). The architectures of these two branches are the same. In each branch, there are 3 parallel streams to fill holes in multiple scales. Each stream consists of 5 partial convolutions  [20] with the same kernel size while the kernel size differs among different streams. By using different kernel sizes, we perform multi-scale filling in each branch for the input CNN features. The filled features from 3 streams (i.e., 3 scales) are concatenated and mapped to the same size of the input feature map via a \(1\times 1\) convolution. We denote the output of the structure branch as \(F_{fst}\), and the output of the texture branch as \(F_{fte}\). To ensure the hole filling to focus on the textures and structures, we incorporate supervisions on \(F_{fst}\) and \(F_{fte}\). We use a \(1\times 1\) convolution to separately map \(F_{fst}\) and \(F_{fte}\) to a color image \(I_{ost}\) and a color image \(I_{ote}\), respectively. The pixel-wise \(L_1\) loss can be written as follows:

$$\begin{aligned} \begin{aligned} L_{rst} = \Vert I_{ost}-I_{st} \Vert _1\\ L_{rte} = \Vert I_{ote}-I_{gt} \Vert _1 \end{aligned} \end{aligned}$$
(1)

where \(I_{gt}\) is the ground truth image and \(I_{st}\) is the structure image of \(I_{gt}\). We use an edge-preserving smoothing method RTV  [35] to generate \(I_{st}\) following  [26].

The hole regions in \(F_{te}\) and \(F_{st}\) are filled via structure and texture branches, individually. The feature representations in \(F_{fte}\) and \(F_{fst}\) are not consistent to reflect the recovered structures and textures. This inconsistency leads to blur and artifacts within and around the hole regions as shown in Fig. 1. To mitigate these effects, we concatenate \(F_{fte}\) and \(F_{fst}\) first, and make a simple fusion to generate \(F_{sf}\) via a \(1\times 1\) convolutional layer. The texture and structure representations in \(F_{sf}\) are corrected via feature equalization at different CNN feature levels (i.e., across shallow to deep CNN layers).

3.2 Feature Equalizations

We equalize the fused CNN features \(F_{sf}\) in both channel and spatial domains. The channel equalization follows the squeeze and excitation operation  [12] to ensure that the attentions within each channel of \(F_{sf}\) are the same. As the reweighed channels are influenced by both structure and texture representations in \(F_{sf}\), the consistent attentions indicate that these representations are set to be consistent as well. We propagate channel equalization to the spatial domain via the proposed bilateral propagation activation function (BPA).

Formulation. BPA is inspired by the edge-preserving image smoothing  [30] to generate response values based on spatial and range distances. It can be written as follows:

$$\begin{aligned} y^s_i&=\frac{1}{C(x)} \sum _{j\in s} g_{\alpha _s} (\Vert j-i \Vert ) x_j \end{aligned}$$
(2)
$$\begin{aligned} y^r_i&=\frac{1}{C(x)} \sum _{j\in v} f(x_i,x_j) x_j \end{aligned}$$
(3)
$$\begin{aligned} y_i&=q (y^s_i, y^r_i) \end{aligned}$$
(4)

where \(x_i\) is the feature channel at position i of input feature x, \(x_j\) is a neighboring feature channel around i at position j, \(y^s_i\) and \(y^r_i\) are the feature channels after spatial and range similarity measurements. We set the normalization factor as \(C(x) = N\), where N is the number of positions in x. We use q to denote the concatenation and channel reduction of \(y^s_i\) and \(y^r_i\) via a \(1\times 1\) convolutional layer.

The bilateral propagation utilizes the distances of feature channels from both spatial and range domains. We explore j within a neighboring region s, which is set as the same spatial size as the input feature for global propagation. The spatial contributions from neighboring feature channels are adjusted via a Gaussian function \(g_{\alpha _s}\). When computing \(y_i^r\), we measure the similarities between feature channels \(x_i\) and \(x_j\) via f(.) within a neighboring region v around i. The size of v is \(3\times 3\). To this end, the bilateral propagation considers both global continuity via \(y^i_s\) and local consistency via \(y^i_r\).

Fig. 3.
figure 3

The pipeline of the bilateral propagation activation function. We denote the broadcast dot product operation as \(\otimes \), element-wise addition in the selected channel as \(\oplus \), and the concatenation as \(\bigtriangleup \). For two matrices with different dimensions, broadcast operations first broadcast features in each dimension to match the dimensions of the two matrices.

During the range similarity computation step, we define the pairwise function f(.) as a dot product operation, which can be written as follows:

$$\begin{aligned} f(x_i,x_j)= (x_i)^T(x_j). \end{aligned}$$
(5)

The proposed bilateral propagation shares similarity to the non-local block  [31] that for each i, \(\frac{1}{C(x)}f(x_i,x_j)\) becomes the softmax computation along dimension j. The difference resides on the region design of propagation. The non-local block uses feature channels from all the positions to generate \(y_i\) and the similarity is only measured between \(x_i\) and \(x_j\). In contrast, BPA considers both feature channel similarity and spatial distance between \(x_i\) and \(x_j\) during bilateral weight computation. In addition, we use a global region s to compute spatial distance while using a local region v to compute range distance. The advantage of global and local region selections is that we ensure both long-term continuity in the whole spatial region and local consistency around the current feature channel. The boundaries of hole regions are unified with the neighboring image content and the contents within the hole regions are set to be consistent.

Implementations. Figure 3 shows how bilateral propagation operates in the network. The range step corresponds to the computation of \(y_i^r\) in Eq. 3 and the spatial step corresponds to \(y_i^s\) in Eq. 2. During range computation, the operations until the element-wise multiplication P\(_1\) represent Eq. 5 at all spatial locations. We use the unfold function in PyTorch to reshape the features to vectors (i.e., \(HW\times 3\times 3\times C\)) for obtaining all the neighboring \(x_j\) for each \(x_i\), so that we can make efficient element-wise matrix multiplications. Similarly, the operations until P\(_2\) represent the term \(\sum _j f(x_i,x_j)\cdot x_j\) in Eq. 3. During spatial computation, the operations until P\(_3\) represent the term \(\sum _{j} g_{\alpha _s} (\Vert j-i \Vert ) x_j\). As a result, the bilateral propagation operation can be efficiently executed via the element-wise matrix multiplications and additions shown in Fig. 3.

3.3 Loss Functions

We introduce several loss functions to measure structure and texture differences including pixel reconstruction loss, perceptual loss, style loss, and relativistic average LS adversarial loss  [16] during training. We also employ a discriminator with local and global operations to ensure local-global contents consistency. And the spectral normalization  [23] is applied in both local and global discriminators to achieve stable training.

Pixel Reconstruction Loss. We measure the pixel-wise difference from two aspects. The first one is the loss terms illustrated in Eq. 1 where we add supervisions on the texture and structure branches. The second one measures the similarity between the network output and the ground truth, which can be written as follows:

$$\begin{aligned} L_{re} = \Vert I_{out}-I_{gt} \Vert _1 \end{aligned}$$
(6)

where \(I_{out}\) is the finally predicted image by the network.

Perceptual Loss. To capture the high-level semantics and simulate human perception of images quality, we utilize the perceptual loss  [15] \(L_{perc}\) defined on the ImageNet-pretrained VGG-16 feature backbone:

$$\begin{aligned} \begin{aligned} L_{prec}=\mathbb {E}\Big [\sum _i \frac{1}{N_i} \Vert \varPhi _{i}(I_{out})-\varPhi _{i} (I_{gt}) \Vert _1 \Big ]\end{aligned} \end{aligned}$$
(7)

where \( \varPhi _{i}\) is the activation map of the i-th layer of the VGG-16 backbone. In our work, \(\varPhi _{i}\) corresponds to the activation maps from layers ReLu1_1, ReLu2_1, ReLu3_1, ReLu4_1, and ReLu5_1.

Style Loss. The transposed convolutional layers from the decoder will bring artifacts that resemble checkerboard. To mitigate this effect, we introduce the style loss. Given feature maps of size \(C_j \times H_j \times W_j\), we compute the style loss as follows:

$$\begin{aligned} \begin{aligned} L_{style}=\mathbb {E}_j \Big [\Vert G_j^\varPhi (I_{out})- G_j^\varPhi (I_{gt}) \Vert _1 \Big ]\end{aligned} \end{aligned}$$
(8)

where \( G_j^\varPhi \) is a \(C_j \times C_j\) Gram matrix constructed from the selected activation maps. These activation maps are the same as those used in the perceptual loss.

Fig. 4.
figure 4

Visualization of the feature map response. The input and output images are shown in (a) and (e), respectively. We use a \(1\times 1\) convolutional layer to map high dimensional feature maps to the color images as shown in (b)–(d) and (f)–(h).

Relativistic Average LS Adversarial Loss. We follow  [40] to utilize global and local discriminators for perception enhancement. The relativistic average LS adversarial loss is adopted for our discriminators. For the generator, the adversarial loss is defined as:

$$\begin{aligned} \begin{aligned} L_{adv}=-\mathbb {E}_{x_{r}}[{\text {log}}(1-D_{ra}(x_r,x_f))]-\mathbb {E}_{x_{f}}[{\text {log}}(D_{ra}(x_f,x_r))] \end{aligned} \end{aligned}$$
(9)

where \(D_{ra}(x_r,x_f)={\text {sigmoid}}(C(x_r)-\mathbb {E}_{x_{f}}[C(x_f)])\) and C(.) indicates the local or global discriminator without the last sigmoid function. To this end, real and fake data pairs \((x_r,x_f)\) are sampled from the ground-truth and output images.

Total Losses. The whole objective function of the proposed network can be written as:

$$\begin{aligned} \begin{gathered} L_{total} = \lambda _r L_{re}+ \lambda _pL_{prec}+\lambda _sL_{style}+\lambda _{adv}L_{adv}+\lambda _{st}L_{rst}+\lambda _{te}L_{rte} \end{gathered} \end{aligned}$$
(10)

where \(\lambda _r\), \(\lambda _p\), \(\lambda _s\), \(\lambda _{adv}\), \(\lambda _{st}\) and \(\lambda _{te}\) are the tradeoff parameters. In our implementation, we empirically set \(\lambda _r=1\), \(\lambda _p=0.1\), \(\lambda _s=250\), \(\lambda _{adv}=0.2\), \(\lambda _{st}=1\), \(\lambda _{te}=1\).

3.4 Visualizations

We use a structure branch and a texture branch to separately fill holes in CNN feature space. Then, we perform feature equalization to enable consistent feature representations in different feature levels for output image reconstruction. In this section, we visualize the feature maps during different steps to show whether they correspond to our objectives. We use a \(1\times 1\) convolutional layer to map CNN feature maps to color images for a clear display.

Figure 4 shows the visualization results. The input image is shown in (a) with a mask in the center. The visualized \(F_{te}\) and \(F_{st}\) are shown in (b) and (f), respectively. We observe that textures are preserved in (b) while the structures are in (f). By multi-scale hole filling, the hole regions in \(F_{fte}\) and \(F_{fst}\) are effectively reduced as shown in (c) and (g). After equalization, the hole regions in (h) are effectively filled and the equalized features contribute to the decoders to generate the output image as shown in (e).

Fig. 5.
figure 5

Visual evaluations for filling center holes. Our method performs favorably against existing approaches to retain both structures and textures.

4 Experiments

We evaluate our method on three datasets: Paris StreetView  [6], Place2  [43] and CelebA  [22]. We follow the training, testing, and validation splits of these three datasets. Data augmentation such as flipping is also adopted during training. Our model is optimized by the Adam optimizer  [17] with a learning rate of \(2\times 10^{-4}\) on a single NVIDIA 2080TI GPU. The training process of the CelebA model, Paris StreetView model and Place2 model are stopped after 6 epochs, 30 epochs and 60 epochs, respectively. All the masks and images for training and testing are with the size of \(256\times 256\).

We compare our method with six state-of-the-art method: CE  [25], CA  [40], SH  [37], CSA  [21], SF  [26] and GC  [41]. For a fair evaluation on model generalization abilities, we conduct experiments on filling center holes and irregular holes on the input images. The center hole is brought by a mask that covers the image center with a size of \(128\times 128\). We obtain irregular masks from PConv  [20]. These masks are in different categories according to the ratios of the hole regions versus the entire image size (i.e., below 10%, from 10% to 20%, etc.). For holes in the image center, we compare with CA  [40], SH  [37] and CE  [25] on the CelebA  [22] validation set. We choose these three methods because they are more effective to fill holes in the image center than fill irregular holes. When handling irregular holes on the input images, we compare with CSA  [21], SF  [26] and GC  [41] using Paris StreetView  [6] and Place2  [43] validation datasets.

Fig. 6.
figure 6

Visual evaluations for filling irregular holes. Our method performs favorably against existing approaches to retain both structures and textures.

4.1 Visual Evaluations

The visual comparison on the results for filling center holes are in Fig. 5 and the results for filling irregular holes are in Fig. 6. We also display ground truth images in (f) to show the actual image content. In Fig. 5, the input images are shown in (a). The results produced by CE and CA contain distorted structures and blurry textures as shown in (b) and (c). Although more visually pleasing contents are generated in (d), the semantics remain unreasonable. By utilizing consistent structure and texture features, our method is effective to generate results with realistic textures.

Figure 6 shows the comparison for filling irregular holes, which is more challenging than filling centering holes. The results from GC contain noisy patterns shown in (b). The details are missing and the structures are distorted in (c) and (d). These methods are not effective to recover image contents without bringing in obvious artifacts (i.e., the second row around the door regions). In contrast, our method learns to represent structures and textures in a consistent formation. The results shown in (e) indicate the effectiveness of our method to produce visually pleasing contents. The evaluations on filling both centering holes and irregular holes indicate that our method performs favorably against existing hole filling approaches.

Table 1. Numerical evaluations on the CelebA dataset where the inputs are with centering hole regions. \(\downarrow \) indicates lower is better while \(\uparrow \) indicates higher is better.
Table 2. Numerical comparisons on the Place2 dataset. \({\downarrow }\) indicates lower is better while \({\uparrow }\) indicates higher is better.

4.2 Numerical Evaluations

We conduct numerical evaluations on the Place2 dataset with different mask ratios. Besides, we evaluate numerically on the CelebA dataset with centering holes in the input images. There are 100 validation images from the “valley” scene category chosen for evaluations. In CelebA, we randomly choose 500 images for evaluation. For the evaluation metrics, we follow  [26] to use SSIM  [33] and PSNR. Moreover, we introduce FID metric  [11] as it indicates the perceptual quality of the results. The evaluation results are shown in Tables 1 and 2. Our method outperforms existing methods to fill centering holes. Meanwhile, favorable performance is achieved by our method to fill irregular holes under various hole versus image ratios.

Human Subject Evaluation. We follow  [42] to involve over 35 volunteers for evaluating the results on CelebA, Place2 and Paris StreetView datasets. The volunteers are all image experts with image processing background. There are 20 questions for each subject. In each question, the subject needs to select the most realistic result from 4 results generated by different methods without knowing the hole region in advance. We tally the votes and show the statistics in Table 3. Our method performs favorably against existing methods.

Table 3. Human Subject Evaluation results. Each subject selects the most realistic result without knowing hole regions in advance.
Table 4. Ablation study on the Paris StreetView dataset. Our performance is improved by using structure and texture branches.
Table 5. Ablation study on the Place2 dataset. Non-local aggregation improves our baseline while feature equalization makes further improvement.

5 Ablation Study

Structure and Texture Branches. To evaluate the effects of structure and texture branches, we use each of these branches separately for network training. For fair comparisons, we expand the channel number of the texture and structure branch outputs via additional convolutions. So the single branch output contains the same size as that of \(F_{sf}\). As shown in Fig. 7, the output of our method without a texture branch contains rich structure information (i.e., the window in the red and green boxes) while the textures are missing. In comparison, the output of our method without a structure branch does not contain meaningful structure (i.e., the window in the red and green boxes). By utilizing both branches, our method achieves favorable results on both structures and textures. Table 4 shows the similar numerical performance on the Paris StreetView dataset where these two branches improve our method significantly.

Fig. 7.
figure 7

Abalation studies on structure and texture branches. A joint utilization of these two branches improves the content quality.

Fig. 8.
figure 8

Ablation studies on feature equalizations. More realistic and visually pleasing contents are generated via feature equalizations.

Feature Equalizations. We show the contributions of feature equalizations by removing them from the pipeline and showing the performance degradation. Moreover, we show that the bilateral propagation activation function (BPA) is more effective to fill hole regions than the Non-local attentions  [31]. As shown in Fig. 8, without using equalization our method generates visually unpleasant contents and visible artifacts. In comparison, the contents generated by [31] are more natural. However, the recovered contents are still blurry and inconsistent because the Non-local block ignores the local coherency and global distance of features. This limitation is effectively solved via our method with feature equalizations. Similar performance has been shown numerically in Table 5 where our method achieves favorable results.

6 Concluding Remarks

We propose a mutual encoder-decoder with feature equalizations to correlate filled structures with textures during image inpainting. The shallow and deep layer features are reorganized as texture and structure features, respectively. In the CNN feature space, we introduce a texture branch and a structure branch to fill holes in multi-scales and fuse the outputs together via feature equalizations. During equalization, we first ensure consistent attentions among individual channels and propagate them to the whole spatial feature map region via the proposed bilateral propagation activation function. The experiments carried out over the benchmark datasets have shown the effectiveness of the proposed method when compared to state-of-the-art approaches on filling both regular and irregular hole regions.