Bring Ancient Murals Back to Life

Zhu, Xingeng; Yu, Ying; Deng, Xiaochao; Yang, Linxia

doi:10.1007/978-3-031-30108-7_20

Xingeng Zhu¹²,
Ying Yu¹²,
Xiaochao Deng¹² &
…
Linxia Yang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13624))

Included in the following conference series:

International Conference on Neural Information Processing

669 Accesses
1 Citations

Abstract

Digital inpainting of murals has always been a challenging problem. The damage forms in real murals are complex, such as cracks, flaking, and fading. There are many difficulties in applying deep learning technology to mural inpainting. First, data sets are often difficult to obtain. Second, the network based on supervised learning is unfit to be applied to the real multiple mural damages, which makes the network unpromotable. Third, the output of deep neural network is the combination of the unmasked area in the label image and the corresponding masked area in the generated image, so there is no change in the unmasked area. Murals often fade or change color after a hundred years or more, which leads to the lack of aesthetic feeling in the repaired images. We propose a mural inpainting model based on the translation method with three domains, including a SVD block and a dense spatial attention with mask block. Specifically, the model trains two Variational Auto-Encoders to respectively map the real mural images and the clean mural images to two deep spaces, the mapping network learns the transformation between the two deep spaces by paired data. This transformation can well extend to real mural images. Experiments show that the performance of our model is better than the comparative methods, and the visual quality is improved.

Access provided by Autonomous University of Puebla. Download conference paper PDF

C3N: content-constrained convolutional network for mural image completion

Article 06 October 2022

An Ancient Murals Inpainting Method Based on Bidirectional Feature Adaptation and Adversarial Generative Networks

Damage Sensitive and Original Restoration Driven Thanka Mural Inpainting

Keywords

1 Introduction

As an artistic entity, Chinese ancient mural paintings have rich historical and scientific values. However, due to natural weathering and destruction by human factors, ancient murals show considerable signs of deterioration. Digital inpainting of damaged murals can avoid the irreversible defects of manual inpainting and improve efficiency. Therefore, the digital restoration of ancient murals has great practical significance for preserving cultural relics.

In digital inpainting solutions for murals, these methods can be divided into two categories: traditional methods and learnable methods. Traditional methods are always based on diffusion-based methods [2] or patch-based methods [1]. Traditional methods are prone to matching errors, blurring, and structural disorder when applied to murals inpainting. Data-driven deep neural network models bring more possibilities. Ren et al. [9] proposed using generalized regression neural networks for the digital inpainting of Dunhuang murals. Cao et al. [3] proposed an enhanced consistency generative adversarial network, which mainly solves the inconsistency between global and local repaired images. Meanwhile, attentional mechanisms are widely used for image inpainting. Yang et al. [10] proposed dilated multi-scale channel attention to perceive image information at different scales. He et al. [5] proposed a residual attention fusion block that enhances the utilization of practical information in the broken image and reduces the interference of redundant information. The inpainting models based on supervised learning only focus on the damaged parts of murals and cannot solve the global color problem. For this reason, we propose a mural inpainting model based on the translation method with three domains [7]. In addition, embedding a SVD block and a dense spatial attention with mask block to the model. The two branches improve the ability of the model to restore murals. Specifically, the main contributions of this paper are as follows:

For the first time, we apply weak supervised learning to the mural inpainting task, which can restore the missing parts and perfect the overall appearance of murals.
We design a dense spatial attention with mask block embedded in the mapping net, which further enhances the network’s ability to capture the long-distance mapping relationship of deep features; The SVD effectively filters the high-frequency information while retaining most of the structure and detail information, further expanding the overlap between image features.
Experiments show that our model not only has an excellent ability to repair the cracks, spots, and scratches but also improves the visual effect.

2 Proposed Approach

In this section, we first introduce the basic network architecture of the inpainting network. Then we describe the two effective blocks, i.e., the Singular Value Decomposition (SVD) and the Dense Spatial Attention with Mask (DSAWM).

2.1 Principle of Inpainting Network

The model formulates mural inpainting as an image transformation process. The training images consist of three parts: the set of real murals $\mathcal {R}$, the synthetic set of $\mathcal {X}$ where images suffer from artificial degradation, and the corresponding set of clean murals $\mathcal {Y}$ that comprises images without degradation. $r\in \mathcal {R}$, $x\in \mathcal {X}$ and $y\in \mathcal {Y}$ represent the image in the three sets. x and y are paired by data synthesizing, i.e., x is degraded from y. The artificial degradation forms include holes, blurs, scratches, and low resolution.

First, $r\in \mathcal {R}$, $x\in \mathcal {X}$ and $y\in \mathcal {Y}$ are mapped respectively to the three deep spaces by ${ {E}}_\mathcal {R}:\mathcal {R}\rightarrow {\mathcal {Z}_\mathcal {R}}$, ${ {E}}_\mathcal {X}:\mathcal {X}\rightarrow {\mathcal {Z}_\mathcal {X}}$ and ${ {E}}_\mathcal {Y}:\mathcal {Y}\rightarrow {\mathcal {Z}_\mathcal {Y}}$. Since $r\in \mathcal {R}$ and $x\in \mathcal {X}$ are both corrupted, we align their spatial features into a shared space by mandatory strategies. As shown in Fig. 1, let the overlap of the two deep spaces (the part between black dotted lines) as large as possible, so there is ${\mathcal {Z}_\mathcal {R}}\approx {\mathcal {Z}_\mathcal {X}}$.

Then we learn the transformation from the spatial features of corrupted murals, $\mathcal {Z}_\mathcal {X}$, to the spatial features of clean murals, $\mathcal {Z}_\mathcal {Y}$, through the mapping ${ {T}}_\mathcal {Z}:\mathcal {Z}_\mathcal {X}\rightarrow {\mathcal {Z}_\mathcal {Y}}$, where $\mathcal {Z}_\mathcal {Y}$ can be further reversed to y through generator ${ {G}}_\mathcal {Y}:{\mathcal {Z}_\mathcal {Y}}\rightarrow {\mathcal {Y}}$. By learning the spatial transformation. real ancient mural r can be restored by sequentially performing the mappings,

$$\begin{aligned} r_{\mathcal {R}\rightarrow {y}}= {E}_\mathcal {R}(r)\circ {T}_\mathcal {Z}\circ {G}_\mathcal {Y} \end{aligned}$$

(1)

We use the network shown in Fig. 2 to implement the inpainting process. The model is trained in two stages. In the first stage, two $\mathrm{{VAEs}}$ are trained to recover the input by unsupervised learning, where $\mathrm{{VAE}}_1$ takes r and x as input, $\mathrm{{VAE}}_2$ takes y as input. In the second stage, the mapping network is trained by fixing the weight of the $\mathrm{{VAEs}}$ trained in the first stage, the training input is x, which first enters the encoder of the $\mathrm{{VAE}}_1$, then passes through the mapping network, and finally is decoded by ${G}_\mathcal {Y}$ of the $\mathrm{{VAE}}_2$. The loss function of $\mathrm{{VAE}}_1$ is defined as

$$\begin{aligned} \mathop {\mathrm{{min}}}\limits _{{E_{\mathcal{R},\mathcal{X}}},{G_{\mathcal{R},\mathcal{X}}}} \mathop {\mathrm{{ max}}}\limits _{{D_{\mathcal{R},\mathcal{X}}}} \mathrm{{ }}{\mathcal{L}_{\mathrm{{VA}}{\mathrm{{E}}_\mathrm{{1}}}}}(r) + {\mathcal{L}_{\mathrm{{VA}}{\mathrm{{E}}_\mathrm{{1}}}}}(x) + \mathcal{L}_{\mathrm{{VA}}{\mathrm{{E}}_\mathrm{{1}}}\mathrm{{,GAN}}}(r,x) \end{aligned}$$

(2)

where KL divergence, $\mathrm{{L}}_1$ distance loss and the least-square loss (LSGAN) [6] are included in ${\mathcal{L}_{\mathrm{{VA}}{\mathrm{{E}}_\mathrm{{1}}}}}(r)$. ${\mathcal{L}_{\mathrm{{VA}}{\mathrm{{E}}_\mathrm{{1}}}}}(x)$ and ${\mathcal{L}_{\mathrm{{VA}}{\mathrm{{E}}_\mathrm{{1}}}}}(r)$ are in the same form and will not be repeated. $\mathcal{L}_{\mathrm{{VA}}{\mathrm{{E}}_\mathrm{{1}}}\mathrm{{,GAN}}}(r,x)$ means training another discriminator ${D_{\mathcal{R},\mathcal{X}}}$ that differentiates ${\mathcal{Z}_\mathcal{R}}$ and ${\mathcal{Z}_\mathcal{X}}$. The loss function of the mapping network can be expressed as

$$\begin{aligned} {\mathcal{L}_{ \mathcal{T}}}\mathrm{{(}}x,y\mathrm{{) = }}{\lambda _1}{\mathcal{L}_{ \mathcal{T}\mathrm{{,}}{\mathrm{{L}}_1}}}\mathrm{{ + }}{\mathcal{L}_{ \mathcal{T}\mathrm{{,GAN}}}} + {\lambda _2}{\mathcal{L}_{ \mathrm{{FM}}}} \end{aligned}$$

(3)

where ${\mathcal{L}_{ \mathcal{T}\mathrm{{,}}{\mathrm{{L}}_1}}}$, ${\mathcal{L}_{ \mathcal{T}\mathrm{{,GAN}}}}$ and ${\mathcal{L}_{ \mathrm{{FM}}}}$ represent $\mathrm{{L}}_1$ distance loss, the least-square loss (LSGAN) [6] and feature matching loss, respectively.

2.2 Singular Value Decomposition (SVD)

Visually, as shown in Fig. 1, the larger the overlap between spatial features ${\mathcal {Z}_\mathcal {R}}$ and ${\mathcal {Z}_\mathcal {X}}$(the part between black dotted lines), the better. To achieve the goal, we propose to add the SVD to the encoder of $\mathrm{{VAE}}_1$. SVD is the generalization of eigenvalue decomposition on any matrix. Let a matrix ${A} \in {{M}_{m \times n}}$ with rank r, then define the SVD of the matrix as

$$\begin{aligned} {{A}_{m \times n}} = {{U}_{m \times m}}{{\varSigma }_{m \times n}}{V}_{n \times n}^\mathrm{{T}} = {{U}_{m \times m}}{\left( {\begin{array}{*{20}{c}} {{{D}_{r \times r}}}&{}{O}\\ {O}&{}{O} \end{array}} \right) _{m \times n}}{V}_{n \times n}^\mathrm{{T}} \end{aligned}$$

(4)

where, ${U}_{m \times m}={A}\times {A}^\mathrm{{T}}$, ${V}_{n \times n}={{A}^\mathrm{{T}}\times {A}}$, $\varSigma $ is a matrix of ${m \times n}$, ${{D}_{r \times r}} = {\left( {\begin{array}{*{20}{c}} {\sqrt{{\lambda _1}} }&{}{}&{}{}&{}{}\\ {}&{}{\sqrt{{\lambda _2}} }&{}{}&{}{}\\ {}&{}{}&{} \ddots &{}{}\\ {}&{}{}&{}{}&{}{\sqrt{{\lambda _r}} } \end{array}} \right) _{r \times r,}}$ ${\lambda _1} \ge {\lambda _2} \ge \cdots {\lambda _{\min (m,n)}} = {\lambda _r} > 0$ is the non-zero eigenvalue of ${{A}^\mathrm{{T}}\times {A}}$, so the SVD can be written as follows:

$$\begin{aligned} \begin{array}{l} {{A}_{m \times n}} = {{U}_{m \times m}}{\varSigma V}_{n \times n}^\mathrm{{T}} = ({u_1}, \ldots ,{u_m})\left( {\begin{array}{*{20}{c}} {\sqrt{{\lambda _1}} }&{}{}&{}{}\\ {}&{}{\sqrt{{\lambda _2}} }&{}{}\\ {}&{}{}&{} \ddots \end{array}} \right) \left( {\begin{array}{*{20}{c}} {v_1^\mathrm{{T}}}\\ \vdots \\ {v_n^\mathrm{{T}}} \end{array}} \right) = \sqrt{{\lambda _1}} {u_{1 \cdots }}v_1^\mathrm{{T}} + \\ \sqrt{{\lambda _2}} {u_2}v_2^\mathrm{{T}} + \cdots \end{array} \end{aligned}$$

(5)

at this point, each eigenvector $v_i$ in V is called the right singular vector of A, each eigenvector $u_i$ in U is called the left singular vector of A. The singular values are ordered from largest to smallest, so the top N larger singular values and its corresponding singular vectors can approximate the matrix.

Using SVD, the gap between the spatial features at high frequencies reduces to a certain extent, and the overlap between the two spatial features further expands. Figure 3 shows the effect of the SVD.

2.3 Dense Spatial Attention with Mask

For the image inpainting task, the ordinary spatial attention mechanism is not applicable because the information in the masked region propagates from the adjacent regions, which results in inaccurate attention scores after normalization. Considering this problem, we propose the Spatial Attention with Mask (SAWM), as shown in Fig. 4. The difference between SAWM and ordinary spatial attention is that the mask is added to the feature map before normalization, ensuring that the damaged areas do not affect the attention score, but the output is missing information. To this end, we propose to solve the problem by multi-level fusion of Dense-net. The proposed mechanism is called Dense Spatial Attention with Mask (DSAWM).

Figure 5 shows the DSAWM. Given the input and its mask, three convolution kernels of 1$\times $1, 3$\times $3, and 5$\times $5 are used to extract multi-scale features, the input image and the multi-scale feature maps are densely connected. This process with x as input can be expressed as

$$\begin{aligned} y_1, y_2, y_3 =Conv_{1\times 1}(x), Conv_{3\times 3}(x), Conv_{5\times 5}(x) \end{aligned}$$

(6)

$$\begin{aligned} h_1=\textrm{SA}(x)\times {fc_3[cat(x,y_1)\times mask]} \end{aligned}$$

(7)

$$\begin{aligned} h_2=\textrm{SAWM}(h_1)\times {fc_4[cat(x,y_1, y_2)]} \end{aligned}$$

(8)

$$\begin{aligned} h_3=\textrm{SA}(h_2)\times {fc_5[cat(x,y_1,y_2,y_3)]} \end{aligned}$$

(9)

$$\begin{aligned} y=\textrm{SA}(h_3)\times (x) \end{aligned}$$

(10)

where, cat is channel concatenating, $fc_3$, $fc_4$, and $fc_5$ indicate that the number of channels will be $\frac{1}{2}$, $\frac{1}{3}$ and $\frac{1}{4}$ of the number of input data channels by the convolution kernel $1\times 1$ respectively. The mask is added to SAWM for calculating the attention score; SA calculates the ordinary spatial attention score. The last ordinary spatial attention score times the input image to get the final output. In this way, the information in the unmasked region of the image is used multiple times, and the advantage of dense connectivity to preserve information is also incorporated.

3 Experiments

We select 1535 mural images to form $\mathcal {Y}$, which contains 300 modern murals. $\mathcal {R}$ includes 1632 real damaged murals. The image size is adjusted to 256$\times $256 pixels. The learning rate is set to 0.0002, and batchsize is set to 15. We use the Pytorch framework to train and test the model. The experimental platform equipment configuration: Intel Core i7-6850K 3.60GHz CPU, NVIDIA GeForce GTX 1080Ti GPU.

3.1 Artificial Destruction Murals Repair Comparison

Table 1. Comparison of PSNR and SSIM of inpainting results.

Full size table

Eight murals are selected for inpainting experiments with artificially added masks, random masks are added to the ones numbered 1–4, and the last four murals are masked with center holes. We compare our model to three state-of-the-art models in experiments: Criminisi [4], RN [11] and DS-net [8]. RN and DS-net use the same way of training $\mathrm{{VAE}}_2$ in our model. We used peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) to evaluate the image inpainting quality. Table 1 shows the results of the quantitative analysis.

3.2 Experiment in Inpainting Real Damaged Murals

To further verify the effectiveness of our model in restoring the real murals, eight damaged murals are selected for the inpainting experiments. The experimental results are shown in Fig. 6.

The 1st image has long cracks in the background. It can be observed that the Criminisi algorithm suffers from disruption of the line and fuzzy extension of texture information. RN repair result has blurry area. DS-net has almost no repair ability for the masked area. Note that our repair result is better, which is visually indistinguishable from the inpainting marks. Moreover, the structural consistency is better than the other 3 comparative algorithms. The 2nd and 3rd mural images have some facial cracks. It can be seen that the 3 comparative algorithms have evident artifacts and texture blur in their repair results. Our model achieves better coordination in line fitting, and the contrast of the repaired image is enhanced. For the 4th mural image, some mildew areas can be observed in the lower part. Both the Criminisi and DS-net have varying degrees of structural disorder, and the RN has large fuzzy block effect in its repair result. All these 3 comparative algorithms have obvious repair traces. By comparison, our method can produce better continuity and more reasonable restoration for the deteriorated areas. In the 5th and 6th mural images, there exist lots of scratches. Criminisi, DS-net and our model have visible repair effect, whereas RN performs poorly. Although the Criminisi and DS-net yield noticeable restoration for the scratches, they produced some inpainting errors and residual artifacts. For instance, in the 5th mural image, the restoration result of the Criminisi has a matching error near the corner of the bodhisattva’s eye. The 7th mural image has some color falling-off in the hair bun. The Criminisi fails to repair these color falling-offs in this test. RN and DS-net cannot restore color-consistent areas with the surrounding region. By comparison, the repaired areas of our proposed model are visually satisfactory and semantically reasonable. The 8th mural image looks somewhat blurry and has a color inconsistent area in the bodhisattva’s face. In this test, The Criminisi can restore this color inconsistency, whereas RN and DS-net cannot produce satisfactory results. Our proposed model can restore the mural image successfully. Moreover, the overall appearance of our result is considerably clearer than the other 3 approaches.

3.3 Ablation Study

To verify the utility of the SVD and the DSAWM, the original translation method with three domains [7] is used as the baseline model, “B” denotes it, “S” denotes the model after adding the SVD to the baseline model, “D” denotes the model after adding the DSAWM to the baseline model, “F” denotes our model. Figure 7 shows the variation of the quantitative index with respect to the different number of singularities in “F”.

In Fig. 7(a) and Fig. 7(c), the optimal values of PSNR and mean square error (MSE) are received when the number of synthesized singularities is 112; the suboptimal value of SSIM is obtained at the same number 112 in Fig. 7(b). Therefore, the selected number of singularities in all experiments is 112.

The purpose of adding the SVD is to expand the overlap of $\mathcal {Z}_\mathcal {R}$ and $\mathcal {Z}_\mathcal {X}$, so that the network has better generalization performance. The images of building murals not included in the training dataset are selected for visual analysis. Figure 8 and Fig. 9 show qualitative and quantitative evaluation of the ablation experiment results, respectively. The comparison between Fig. 8(c) and Fig. 8(d) shows that the restoration result is relatively clearer after adding the SVD. From the comparison between Fig. 8(c) and Fig. 8(e), we can see that the output result of adding the DSAWM module has a better ability to capture colour information, structure information and detail information. The output of “F” combines the above two advantages. From Fig. 9(a) and Fig. 9(c), it can be seen that adding the SVD and the DSAWM can improve the PSNR value and reduce the MSE value, which indicates that the above blocks improve the restoration quality at the pixel level and perception level.

4 Conclusion

This paper proposed a novel DSAWM block and added the SVD to the inpainting model. The DSAWM enhances the ability of the network to capture the long-distance mapping relationships of deep spatial features, the SVD makes the images decomposed and then reorganized, effectively filtering high-frequency information while retaining most of the structure and detail information, further expanding the overlap of image features. Experiments show that our model based on weak supervised learning not only has good restoration ability for cracks, spots and scratches but also has some improvement in visual effects. However, there are still problems of blurred restoration in large damaged areas, which will be studied from the perspectives of obtaining high-quality data sets, reasonable image enhancement algorithm and optimized network.

References

Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28(3), 24 (2009)
Article Google Scholar
Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques. pp. 417–424 (2000)
Google Scholar
Cao, J., Zhang, Z., Zhao, A., Cui, H., Zhang, Q.: Application of enhanced consistent generative adversarial network in mural repairing. Journal of Computer-Aided Design & Computer Graphics 32(8), 1315–1323
Google Scholar
Criminisi, A., Pérez, P., Toyama, K.: Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on image processing 13(9), 1200–1212 (2004)
Article Google Scholar
He, P., Yu, Y., Xu, C., Yang, H.: Raidu-net: Image inpainting via residual attention fusion and gated information distillation. In: International Conference on Neural Information Processing. pp. 141–151. Springer (2021)
Google Scholar
Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision. pp. 2794–2802 (2017)
Google Scholar
Wan, Z., Zhang, B., Chen, D., Zhang, P., Chen, D., Liao, J., Wen, F.: Bringing old photos back to life. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2747–2757 (2020)
Google Scholar
Wang, N., Zhang, Y., Zhang, L.: Dynamic selection network for image inpainting. IEEE Transactions on Image Processing 30, 1784–1798 (2021)
Article Google Scholar
Xiaokang, R., Peilin, C.: Murals inpainting based on generalized regression neural network. Computer Engineering & Science 39(10), 1884–1889 (2017)
Google Scholar
Yang, H., Yu, Y.: Res2u-net: image inpainting via multi-scale backbone and channel attention. In: International Conference on Neural Information Processing. pp. 498–508. Springer (2020)
Google Scholar
Yu, T., Guo, Z., Jin, X., Wu, S., Chen, Z., Li, W., Zhang, Z., Liu, S.: Region normalization for image inpainting. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 12733–12740 (2020)
Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No. 62166048, Grant No. 61263048) and by the Applied Basic Research Project of Yunnan Province (Grant No. 2018FB102).

Author information

Authors and Affiliations

School of Information Science and Engineering, Yunnan University, Kunming, 650500, China
Xingeng Zhu, Ying Yu, Xiaochao Deng & Linxia Yang

Authors

Xingeng Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Ying Yu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaochao Deng
View author publications
You can also search for this author in PubMed Google Scholar
Linxia Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xingeng Zhu or Ying Yu .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, X., Yu, Y., Deng, X., Yang, L. (2023). Bring Ancient Murals Back to Life. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13624. Springer, Cham. https://doi.org/10.1007/978-3-031-30108-7_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-30108-7_20
Published: 13 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30107-0
Online ISBN: 978-3-031-30108-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics