Keywords

1 Introduction

Image inpainting aims to fill in the missing areas of damaged images to achieve the maximum possible authenticity. These algorithms are usually used for picture editing tasks, such as filtering unwanted objects [2, 16] or repairing old pictures [22]. When the missing area of the image is large, less information can be obtained; thus, increasing the available information of the image is critical to the restoration task. Moreover, the generated portion of the image may not have the same style as the real image; consequently, restorations tasks require both partial continuity and overall similarity.

Fig. 1.
figure 1

Some inpainting results using the proposed framework on different datasets. Among them, (Up) Input image with missing areas. The missing areas are shown in white. (Down) Repaired picture using MOPR-GAN(ours).

Most traditional image inpainting methods diffuse the background data to the missing area using a differential operator [4]. Subsequently, with the explosive growth in data volume, patch-based inpainting methods were introduced, where the algorithm identifies patches in several source pictures to fill in the missing areas, to obtain the maximum similarity [10]. However, these methods perform poorly when restoring a complex detailed texture in a missing area.

Recently, with the rapid development of deep learning technologies, deep convolutional networks have begun to demonstrate extraordinary capabilities in the field of image inpainting. These methods replace the missing parts of the content through continuous learning of existing data, thereby generating a coherent structure in the missing area, which is difficult to achieve using traditional methods. However, images generated by these method are often blurred or have artifacts, which are dissatisfying in terms of visual performance.

To solve this challenging problem, in 2014, Goodfellow et al. proposed a generative adversarial network (GAN) [9], which is a network model composed of two deep convolutional networks called generator and discriminator. The generator uses data to generate images to deceive the discriminator, whereas the discriminator learns the difference between the real and generated images to efficiently identify the generated image, and returns an adversarial loss to improve the generator. Subsequently, several GAN-based models [7, 17] have been developed, many of which use encoder-decoder architectures as their generator.

Researchers began to make bolder attempts. They identified that global inpainting does not directly conform to the way in which the human brain thinks; therefore, they developed several model frameworks for multi-stage inpainting [14, 20]. For example, the EdgeConnect framework, which was proposed by Kamyar et al. [14], divides the task into two stages: edge prediction and edge-based repair. Although using the edge as the pre-information for repairing the image has a good effect, obtaining a good edge is a difficult task. Almost all these algorithms have similar shortcomings.

In this paper, we propose an image inpainting network called Multi-Stage Optimized Progressive Restoration with GAN (MOPR-GAN). The framework follows the basic principles of GAN, and is divided into generators and discriminators. The generator consists of two parts: (1) a Progressive Inpainting Module (PIM) and (2) an Image Optimization Module (IOM). The PIM is responsible for progressively repairing the missing area and generating an attention map. The IOM refers to the attention map to optimize the details of the initial repaired image, so that it can generate high-quality images. The discriminator is a hybrid of a local discriminator (patchGAN) and a global discriminator (globalGAN). In contrast to the discriminator model proposed by Iizuka et al. [17], we added the attention map generated by the PIM and the IOM; consequently, the discriminator can identify the repaired area more efficiently.

We verified the performance of our model using three standard datasets CelebA [21], Paris Street View [5], and Place365 [3] (certain results are shown in Fig. 1), and compared our method with some of the most advanced frameworks. The primary contributions of this paper are as follows:

  • We propose an Image Optimization Module (IOM), that uses a form of competition within the region to extract the best matching distribution of the real image.

  • An attention mechanism, called Global Adaptive Attention (GAA), is developed, which acts on the entire network. Under the joint constraints of structure and texture, it gradually updates the attention score following the progress of the network to obtain finer details, thereby enhancing the potential and efficiency of the network.

  • We propose a Multi-Stage Optimized Progressive Restoration GAN (MOPR-GAN) framework, which can generate images with better results in terms of details and overall performance; the framework has achieved good performance in both qualitative and quantitative analysis.

Moreover, we performed certain ablation experiments to verify the accuracy of the contributions.

2 Method

In this section, we first introduce the structure of each part of the proposed network framework. And then, we introduce the GAA scheme based on a multi-stage network. Finally, we explain the loss function. The pipeline of our network model is shown in Fig. 2.

2.1 Generator

The generator comprises two modules: 1) PIM, which is used for preliminary image restoration, and 2) IOM, which is used for updating image details. Meanwhile, the generator of the entire network is divided into two forms in the two phases: (1) contains only PIM modules; (2) contains PIM and IOM modules. We will introduce specific training strategy in Sect. 3.1. Here, we explain the two modules in detail.

Fig. 2.
figure 2

The overall architecture of the proposed network model is divided into two phases, which we use the red line to represent the first stage and the green line to represent the second one in the diagram(the specific process is described in Sect. 3.1). And then, the network inputs are images and masks, Progressive Inpainting Module (PIM) is used to generate pre-inpainting images recursively, the Image Optimization Module (IOM) optimizes details, and the global and local discriminator (Up is local, Down is global) enhances the network repair potential. (Color figure online)

PIM. The design idea for PIM is derived from RFR [13], which using recursive reasoning to achieve a gradual inpainting process from the edge to the center of the missing area. The difference is that we modified the reasoning network structure and used our own attention mechanism. The details of this part are as follows:

Referring to the network architecture proposed by Johnson et al. [11], we built an encoding-decoding network for the reasoning network. The encoder is composed of three convolutions. The decoder is composed of three deconvolutions with strides of 1/2, 1/2, and 1. Three residual blocks, a GAA and a convolution were added to the middle. The residual blocks are used to avoid the disappearance of the gradient, and the GAA is used to calculate the current attention score and update the feature map. The GAA calculation process is described in Sect. 2.3. Furthermore, the role of the convolutional layer is concatenating the reconstructed feature map with the input one in order to get the final feature.

In general, the PIM iterates multiple times until the feature map is completely filled. Thus, a good pre-inpainting image can be generated.

IOM. Because of the limitation of the repair method based on the partial progressive algorithm, it is inevitable that the output of the preliminary repair image will have a certain degree of local artifacts or chromatic aberration especially the center of the missing area, even though we adaptively mix the previous attention score when computing the new one. Therefore, we launched the IOM to resolve this problem effectively, and generate images with good performance in terms of both details and overall.

Fig. 3.
figure 3

The main part of the IOM. It can improve the image details in the feature space.

Inspired by the multi-scale convolutional fusion block proposed by Yu et al. [19], the IOM of this paper is divided into three parts.

The first stage is the initial extraction of the image which comes from PIM. Through three convolutional layers, it can learn the basic feature information. We retained this current feature map for later use.

The second stage comprises four MIE blocks. The MIE block is shown in Fig 3, whose main body is a multi-scale competitive convolution with a scale of 3. Multi-scale convolution allows the convolution calculations to have a wider field of view, and can capture longer dependencies so that we can improve the network capability. Meanwhile, using Maxout prevents overfitting and provides a lightweight constraint, as well as selects the best feature value of the current position in multiple feature regions. Subsequently, we added a GAA and concatenate the resulting feature map with the input one through a convolution. The role of GAA is the same as in PIM, responsible for calculating the attention score and updating the feature map. The GAA calculation process is described in Sect. 2.3.

Finally, we skip concatenating the feature maps after multiple multi-scale convolutions with the retained feature map in the first stage, which can strengthen the consistency of the image structure and prevent the gradient from disappearing after optimization.

2.2 Discriminator

Iizuka et al. [17] proposed a multi-scale discriminator architecture including a local discriminator and global discriminator, which can enhance the detailed performance of the repaired area and improve the global consistency of the image. Therefore, the discriminator of our network also uses this scheme. The local discriminator uses four convolutional layers and one fully connected layer, whereas the global discriminator uses five convolutional layers and one fully connected layer. Spectral normalization and LeakyReLU [1] with a slope of 0.2 were added to each layer of the two discriminators. In addition, we also added the attention mechanism to the discriminator, which is mainly manifested in the loss function as shown in Sect. 2.4.

2.3 GAA

The attention mechanism can fill the missing area by determining the similarity in the background texture. However, the traditional attention mechanism only calculates the attention score of the feature map, and lacks direct supervision of the attention; therefore, the information learned is unreliable. Conversely, the self-attention method proposed by Peng et al. [15] specified this, but ignored the rich texture information. To solve this problem, we propose the GAA module, which first calculates the structural and texture attention scores, makes them constrain each other, and then performs adaptive accumulation.

Our attention mechanism is divided into two parts: attention calculation and attention transfer. First, in the attention calculation step, we calculate the truncated distance similarity [18] of structure attention with 3 \({\times }\) 3 patches in the input structure feature \((\bar{d}^s)\) and the cosine similarity of texture attention \((\bar{d}^i)\).

$$\begin{aligned} \bar{d}^s_{(x,y,x',y')} = \tanh (-(\frac{d^s_{(x,y,x',y')} - v}{\sigma })), \end{aligned}$$
(1)

where \(d^s_{(x,y,x',y')}\) is the Euclidean distance between the patches at (xy) and \((x', y')\); v and \(\sigma \) are the mean value and standard deviation of \(d^s_{(x,y,x',y')}\), respectively.

$$\begin{aligned} \bar{d}^i_{(n,m,n',m')} = \frac{\sum _{i,j\in (-k,...,k)} d^i_{(n+i,m+j,n',m')}}{k^2} \end{aligned}$$
(2)

where \(d^i_{(n,m,n',m')}\) is the cosine similarity of each part of the feature pixels between (nm) and \((n', m')\).

Then, we used the softmax function to generate structure and texture attention score maps respectively, which are referred to as \(score^s\) and \(score^i\).

$$\begin{aligned} score^s_{(x,y,x',y')} = softmax(\lambda \bar{d}^s_{(x,y,x',y')}) \end{aligned}$$
(3)
$$\begin{aligned} score^i_{(n,m,n',m')} = softmax(\lambda \bar{d}^i_{(n,m,n',m')}) \end{aligned}$$
(4)

where \(\lambda \) is set to 50. After that, we used \(score^s\) as a constraint to adjust the value of \(score^i\). The calculation process is as follows:

$$\begin{aligned} score'_{(n,m,n',m')} = softmax(2score^i_{(n,m,n',m')}score^s_{(x,y,x',y')}) \end{aligned}$$
(5)

where \(pixel(n,m) \in patch(x,y)\), \(pixel(n',m') \in patch(x',y')\). At this point, we can calculate the final attention map. If \(\bar{score}^{i-1}_{(n,m,n',m')}\) represents the attention score computed at the previous iteration, \(\lambda \) is a learnable parameter, so the final attention map \(score_{(n,m,n',m')}\) is:

$$\begin{aligned} score_{(n,m,n',m')} = \left\{ \begin{array}{ll} \lambda score'_{(n,m,n',m')} + (1-\lambda )\bar{score}^{i-1}_{(n,m,n',m')}&{}if\quad \exists \bar{score}^{i-1}\\ score'_{(n,m,n',m')}, &{} \text {otherwise} \end{array} \right. \end{aligned}$$
(6)

In particular, \(\bar{score}^{i-1}_{(n,m,n',m')}\) in the first MIE submodule equals to the attention scores generated by the last iteration of PIM.

The next step is attention transfer. We used the score map to reconstruct the feature map. If f represents the input feature and \(f'\) represents the reconstructed feature, the formula is

$$\begin{aligned} f'_{(n,m)} = \sum _{n'\in 1,...W m'\in 1,...H} score_{(n,m,n',m')}f_{(n',m')} \end{aligned}$$
(7)

Finally, we reserve the attention score calculated for the next iteration.

2.4 Loss Function

Because our network is a GAN-based model, the loss function is divided into two parts: the loss of the discriminator and the generator. The first we calculated is the loss of the discriminator. Because the attention mechanism is added to it to increase the accuracy when identifying the authenticity of the image, this part is composed of adversarial and attention losses. If D is the discriminator, G is the generator, A is the attention map, \(I_R\) represents the real image, and \(I_G\) represents the generated image, then its loss function(\(L_D\)) is calculated as follows:

$$\begin{aligned} L_{attention} = L_{MSE}(D(I_R), 0) + L_{MSE}(D(I_G), A); \end{aligned}$$
(8)
$$\begin{aligned} L_D = -log(D(I_R)) - log(1 - D(I_G)) + \lambda _{att} L_{attention}, \end{aligned}$$
(9)

where \(\lambda _{att} = 0.2\) in our experiments. Meanwhile, the attention map A will be explained in Sect. 3.1.

Refer to the idea in [14], our generator is trained using a joint loss comprising l1, adversarial, perceptual, and style losses. The l1 loss is normalized by the mask size, while the adversarial loss is provided by the mapping output of the generated image in the discriminator, which is a part of \(L_D\), as

$$\begin{aligned} L_{adv} = log(1 - D(I_G)). \end{aligned}$$
(10)

The perceptual loss \(L_{per}\) and style loss \(L_{style}\) are two loss functions that were proposed by Johnson et al. [11]. Finally, the overall loss function of our generator is

$$\begin{aligned} L_G = \lambda _{l_1} L_{l_1} + \lambda _{adv} L_{adv} + \lambda _{per} L_{per} + \lambda _{style} L_{style}. \end{aligned}$$
(11)

We finally set \(\lambda _{l_1} = 1\), \(\lambda _{adv} = 0.01\), \(\lambda _{per} = 0.2\), and \(\lambda _{style} = 200\). The generator’s loss function combination in our model is similar to [14] and has been shown to be effective.

3 Experiments

In this section, we elaborately explain certain related settings and strategies to facilitate the reproduction of the network.

3.1 Training Setting and Strategy

We trained our model with a batch size of six. As this is a GAN-based architecture, there is a generator and discriminator that must be updated; therefore, we used the Adam optimizer. The entire network design strategy is divided into two steps: (1)We only use PIM as the generator Until PIM converges. In this case, the attention map in the discriminator loss function is the one generated by the last iteration of the PIM. (2)We join IOM and make PIM and IOM work together as generator. At this time, the attention map becomes the one generated by IOM’s last MIE. During each step, we both used learning rates of \(2e^{-4}\) and \(2e^{-5}\) to train the generator and discriminator, respectively. Then, we used \(5e^{-5}\) to fine-tune the generator, while the discriminator used \(5e^{-6}\). During the fine-tuning, we did not want to relearn all other network parameters; therefore, we froze all the batch normalization layers of our generator. It is worth noting that we don’t unfreeze PIM’s batch normalization layers during the second step. All experiments were performed using Python 3.7 on an Ubuntu 20.04 system, with an 11 G NVIDIA GeForce RTX 2080 GPU and Intel Xeon E5-1650 v4 3.60 GHz CPU.

3.2 Datasets

We used datasets CelebA [21], Paris StreetView [5] and Place365 [3] to verify our model. Meanwhile, the irregular masks are automatically generated by scripts.

3.3 Comparison Models

We compared our experimental results with those of certain state-of-the-art methods, both qualitatively and quantitatively. These methods include CA [12], GLCIC [17], PIC [6], EC(EdgeConnect) [14], FE [8], and RFR [13].

4 Results

We conducted experiments on the three datasets, and compared the results with the methods mentioned in the previous section both qualitatively and quantitatively. Moreover, we conducted ablation tests to verify the necessity of the proposed module.

Fig. 4.
figure 4

Results on Place365 [3]

Fig. 5.
figure 5

Results on CelebA [21]

Fig. 6.
figure 6

Results on Paris StreetView [5]

4.1 Qualitative Comparison

Figure 4, 5, and 6 show the visualization of our approach when compared to the four state-of-the-art methods on the three datasets. In comparison, our model showed excellent visual effects, and when the missing area became larger, the effect was more apparent. This shows the superiority of our Network.

Table 1. Quantitative results over three standard datasets with six models: Contextual Attention(CA) [12], Globally and Locally Consistent Image Completion(GLCIC) [17], Pluralistic Image Completion(PIC) [6], EdgeConnect(EC) [14], Recurrent Feature Reasoning(RFR) [13], MOPR-GAN(Ours). The best result of each group is bolded. \(^\star \)Higher is better. \(^\dagger \)Lower is better.

4.2 Quantitative Comparisons

We also performed quantitative comparisons from three aspects: 1) structural similarity index (SSIM), 2) peak signal-to-noise ratio (PSNR), and 3) mean l1 loss, to evaluate our model and compare the results with other methods. Table 1 lists the results of the six methods for different irregular mask ratios for the three standard datasets. It can be observed from the table that our method shows superior results under different irregular mask ratios on the Places2, CelebA, and Paris StreetView datasets in most cases, particularly for large holes. The data of CA [12] and RFR [13] in the table were obtained from their papers, and the remaining data were obtained using the pre-model provided by their author.

4.3 Ablation Studies

The preceding content illustrates the effectiveness of the overall architecture of our model. In this section, we present a verification of the validity of our proposed contributions. Here, we describe the proposed IOM and GAA.

Fig. 7.
figure 7

Qualitative comparison of the two methods on the Paris Streetview dataset: 1)without IOM, 2)with IOM.

Table 2. Quantitative comparison of the two methods on the Paris Streetview dataset with different irregular mask ratios: 1)without IOM, 2)with IOM.

Capabilities of the IOM. To demonstrate the function of the IOM, we compared the network repair effects of using and deprecating this module. It can be observed from Fig. 7 that it is lacking in details, although an almost complete image can be repaired without the IOM. The addition of the IOM can significantly improve the local performance of the image. From the perspective of a quantitative comparison, as shown in Table 2, we tested two cases with different sizes of irregular masks, and identified that IOM can improve the network performance significantly. Furthermore, the missing area is larger, the more pronounced the effect.

Fig. 8.
figure 8

The comparison results with different attention. (1) Masked Image; (2) Traditional Attention; (3) Existing Progressive Attention; (4)GAA(ours)

Capabilities of the GAA. As mentioned in Sect. 2.3, our GAA module is a progressive attention mechanism with equal emphasis on the structure and texture. We compared it with other existing attention mechanisms, and the results are shown in Fig. 8. From this, we can determine that the progressive accumulation of the attention mechanism is more suitable with global consistency, which can make it more deceiving to the eye, particularly for large holes. Further, allowing the structure to constrain the details (shown in Eq. 5) can effectively prevent local artifacts and generate a more realistic image.

5 Conclusion

In this paper, we built a new GAN-based image inpainting framework (MOPR-GAN), which first repairs the missing areas from the outside to the inside step by step, and then, uses the proposed IOM to correct the details of the generated pre-repair images to obtain more accurate results. Moreover, we proposed a GAA module that acts on the entire network. Through qualitative and quantitative analyses on three standard datasets, plus several ablation experiments, the superiority of our network was proved.