Deep Learning-Based Filtering of Images for 3D Reconstruction of Heritage Sites

Tabib, Ramesh Ashok; Kulkarni, Sujaykumar; Kagalkar, Abhay; Hurakadli, Vaishnavi; Ganapule, Abhijeet; Dhanakshirur, Rohan Raju; Mudenagudi, Uma

doi:10.1007/978-3-030-57907-4_8

Ramesh Ashok Tabib⁶,
Sujaykumar Kulkarni⁶,
Abhay Kagalkar⁶,
Vaishnavi Hurakadli⁶,
Abhijeet Ganapule⁶,
Rohan Raju Dhanakshirur⁶ &
…
Uma Mudenagudi⁶

377 Accesses
1 Citations

Abstract

In this chapter, we propose a deep learning-based pipeline for filtering of internet-sourced images towards 3D reconstruction of heritage sites. The 3D reconstruction of heritage sites facilitates creation of virtual walk-through, digital museum and augmented reality. Using internet-sourced images for 3D reconstruction of heritage sites is challenging, as these images may contain blur, text, occlusion, shadow and many other noises. We propose to include pruning and selection of images in the pipeline to select a suitable set of images for 3D reconstruction. We propose a method for pruning of images using learning-based classification models to eliminate the contribution of unwanted images in 3D reconstruction. We also propose a method to select a suitable set of images using a combination of mean-shift and hierarchical clustering algorithms. We demonstrate the proposed pipeline by generating various 3D models of cultural heritage sites.

Access provided by Autonomous University of Puebla. Download chapter PDF

Learning Based Image Selection for 3D Reconstruction of Heritage Sites

MonuNet: a high performance deep learning network for Kolkata heritage image classification

Article Open access 16 July 2024

Digital Heritage Reconstruction Using Deep Learning-Based Super-Resolution

Keywords

1 Introduction

In this chapter, we propose a deep learning-based filtering pipeline to address the issues in 3D reconstruction of heritage sites using internet-sourced images. With time and climatic changes, the texture, shape and colour of monuments fade or lose information. Monuments are ruined due to attacks (during war) and natural calamities. In order to preserve our cultural heritage and pass it on to the next generation, there is a need to store all information in digital format. One of the effective ways of storing information on heritage sites is through 3D models. 3D reconstruction using images collected from the internet is challenging due to varying captured conditions and with different sensors. Most of the image-based reconstructions for the generation of a detailed and informative 3D model rely on the images chosen [1]. The input images to the reconstruction algorithm may contain artefacts like occlusion and shadow. These artefacts influence 3D reconstruction and result in distortion of shape and texture in reconstructed 3D models.

Most of the works in the literature on 3D reconstruction carry out pre-processing of input data towards better reconstruction [2,3,4]. The authors in [2] propose enhancement of images before 3D reconstruction and use the tone-mapping approach with Contrast Limited Adaptive Histogram Equalization (CLAHE). This results in amplification of local contrast adaptively and prevents amplification of local noise. The authors in [3] propose colour balancing, denoising of image and enhancement of raw images using adaptive median filters before 3D reconstruction. The authors in [4] use Semi-Global Matching (SGM) as an image matching technique, which is applied to Unmanned Aerial Vehicle (UAV) images to generate dense point cloud. These methods find challenges when applied to images with text, blur, occlusion and shadow. The following are proposed to address these challenges:

We propose a deep-learning pipeline for filtering internet-sourced images towards better 3D reconstruction.
We propose to prune internet-sourced images to eliminate unwanted images towards better 3D reconstruction.
We propose to select a suitable set of images for a given query image by combining mean-shift and hierarchical clustering algorithms towards 3D reconstruction.
We demonstrate our results using internet-sourced images and compare with existing reconstruction methods.

In Sect. 2, we discuss the proposed pipeline for filtering of images towards 3D reconstruction. In Sect. 3, we demonstrate the results of the proposed pipeline and its effect on 3D reconstruction and conclude in Sect. 4.

2 Filtering of Images Towards 3D Reconstruction

In this section, we discuss the proposed learning-based pipeline for filtering of internet-sourced images towards 3D reconstruction. The proposed pipeline includes pruning, selection of images and 3D reconstruction modules as shown in Fig. 1. Internet-sourced images with blur, shadow, text and occlusion are eliminated during the pruning process. A subset of filtered images is further selected for 3D reconstruction.

2.1 Pruning of Images

Internet-sourced images with text, blur, occlusion and shadow are input to pruning module as shown in Fig. 2. Internet-sourced images with text are filtered using a text detection algorithm (Tesseract Optical Character Recognition [OCR]) [5].

Blur detection in images using traditional methods is computationally expensive [6]. In order to reduce the complexity of blur detection, we use a binary classifier to classify the input images into blur and non-blur images. This classification includes feature extraction using stacked autoencoders [7, 8] and using the features as input for the binary classifier. The encoder consists of 64 × 64 × 3 nodes as input, two intermediate stacked layers with 1024 and 512 nodes and 64 nodes as output. The input data is encoded to 64 nodes. These 64 nodes are decoded with the intermediate stacked layer of 512 and 1024 nodes to output 64 × 64 × 3 nodes. The decoder reconstructs the input image. Initially, the stacked autoencoder is trained to extract the features, and then the decoder is replaced with a binary classification layer for classifying blur and non-blur images. The non-blur images are given as input to the occlusion detection module.

Most of the internet-sourced images with respect to cultural heritage comprise occluded objects in front of the monuments, which might affect the 3D reconstruction. Thus, we propose to detect occluded portions in order to eliminate these images. You Only Look Once (YOLO) [9] is used to generate bounding boxes on each object over input images. Our proposed algorithm computes the area of the bounding box, and depending on the effect of the area, the percentage of occlusion is calculated. If the percentage is greater than the particular threshold (heuristically we set the threshold to 20%), the algorithm discards the images. If there are overlapping multiple bounding boxes, we find the union of all multiple bounding boxes given as

$$\displaystyle \begin{aligned} \cup_{i=1}^N A = \{ x \in U: \ni i \in \{1,2,3\ldots N\}, x \in A_i \}. \end{aligned} $$

(1)

In Fig. 3, we observe that the threshold does not affect the 3D model if occlusion is less than 20%. In Fig. 3a and b, we show that the occlusion percentage is small, and by experiment it is observed that there is no significant effect on 3D models. In Fig. 3c and d, we see that the occlusion percentage is greater than 20% and covers the major part of the monument area, and by experiment it is observed that there is significant effect on 3D models. If the occlusion percentage is greater than 20%, then the 3D model contains hole in the occlusion area resulting in an incomplete 3D model (see Fig. 8).

The images with shadow usually affect the texture of reconstructed 3D models, which is an open problem to be solved. Some of the shadow detection algorithms in the literature are detailed in [10, 11]. However, these techniques do not provide desirable 3D models. Thus, we propose to use a convolutional autoencoder [12] as shown in Fig. 4 to eliminate the shadow images by classifying the images into shadow and non-shadow images. The convolutional autoencoder has 64 × 64 × 3 size input layer, which is convolved by a 5 × 5 kernel with three channels. The convolved output is max-pooled with kernel size 2 × 2 and stride 2. This max-pooled output of size 30 × 30 × 3 is convolved by a 5 × 5 kernel with six channels. The convolved output is max-pooled with kernel size 2 × 2 and stride 2. This max-pooled output of size 13 × 13 × 6 is flattened and provided as input to the fully connected network with three hidden layers of size 512, 360 and 84 nodes, respectively. The last layer, i.e. fully connected output, is fed to the binary classifier to classify the images into shadow and non-shadow classes.

2.2 Selection of Images for 3D Reconstruction

The filtered images are processed to choose appropriate images as shown in Fig. 5 for 3D reconstruction. The autoencoder is trained to extract features that are mapped to the latent space. We use stacked autoencoder in Sect. 2.1 to represent data as latent points towards clustering. The two types of clustering algorithms considered are Meanshift [1] and Hierarchical [13]. We use content-based image retrieval (CBIR) [14] technique with considered clustering algorithms and compare the clusters with the input query image. The query image is obtained from curator or user. The intersection of obtained image clusters from mean-shift and hierarchical algorithms is considered for 3D reconstruction as shown in Fig. 5.

3 Results and Discussions

The implementation is carried out in the Intel Xeon i5 processor, 64 GB RAM, Nvidia Quadro K5000 graphic processor. In this section, we demonstrate the results of our pipeline and compare the results with existing 3D reconstruction techniques. We used openMVG and openMVS[15] pipeline for 3D reconstruction.

We used 300,000 internet-sourced images as dataset, which comprises 60 heritage sites in India. Approximately 150,000 were discarded by our pipeline. We used 50,000 synthetically generated blurred and real blurred images for training stacked autoencoder. We obtained 95.432% test accuracy and 97.213% cross-validation accuracy from Stacked Autoencoder used for blur and non-blur classification (Table 1). We used standard shadow detection dataset [10] for training convolutional autoencoder, and we obtained 96.156% cross-validation accuracy and 94.591% testing accuracy from convolutional autoencoder used for shadow and non-shadow classification. We used openMVG and openMVS pipeline for 3D reconstruction. We performed subjective analysis for the obtained results with 100 volunteers and the ratings (rating between 1 and 5, 1 being the least and 5 being the highest) are recorded, as shown in Table 2.

Table 1 Results of the stacked autoencoder and convolutional autoencoder

Full size table

Table 2 Subjective quality analysis for the obtained results with 100 volunteers and the corresponding ratings (rating between 1 and 5, 1 being the least and 5 being the highest)

Full size table

In Figs. 6, 7, 8 and 9, we show the results of the individual stages of the proposed pipeline with the 3D reconstruction of the Pattadakal temple situated in Badami (Tq), Karnataka, India, and Stone Chariot, Hampi, Bellary, Karnataka, India. Figure 6a shows the 3D model reconstructed using 100 sample images of Pattadakal from the dataset among which 36 images contain text. Figure 6b corresponds to the 3D model reconstructed using 64 images after eliminating images with text. Figure 7a shows the 3D model reconstructed from 200 sample images of Pattadakal from the dataset among which 42 images contain blur. Figure 7b shows the 3D model reconstructed with 158 images after eliminating images with blur. Similarly, Figs. 8a and 9a correspond to the 3D models reconstructed using 200 and 150 images among which 14 and 27 images are occluded and contain shadow, respectively. Figures 8b and 9b represent the 3D models reconstructed using 186 and 123 images after eliminating occluded and shadow images, respectively. Figure 10 shows a comparison of 3D models reconstructed with and without the proposed filtering pipeline.

4 Conclusions

In this chapter, we have proposed a deep learning-based filtering pipeline for processing internet-sourced images of heritage sites for better 3D reconstruction. 3D reconstruction of heritage sites, using images collected from the internet, is challenging since images may contain blur, text, occlusion and shadow artefacts with reported pre-processing methods. To improve the results for the internet-sourced images, we have proposed a pipeline with pruning and selection modules in order to select a suitable set of images for 3D reconstruction. We have also proposed a method to select the suitable set of images using a combination of mean-shift and hierarchical clustering algorithms. We have demonstrated the results of the proposed pipeline by generating various 3D models of cultural heritage sites and have performed subjective qualitative analysis on the obtained results.

References

Xiao, C., Liu, M.: Efficient mean-shift clustering using Gaussian KDTree. Comput. Graph. Forum 29, 2065–2073 (2010)
Article Google Scholar
Aldeeb, N., Hellwich, O.: Reconstructing textureless objects – image enhancement for 3D reconstruction of weakly-textured surfaces. In: Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications – Volume 5: VISAPP, ISBN 978-989-758-290-5, pp. 572–580 (2018). https://doi.org/10.5220/0006628805720580
Ballabeni, A., Apollonio, F., Gaiani, M., Remondino, F.: Advances in image pre-processing to improve automated 3d reconstruction. In: ISPRS – International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences (2015). XL-5/W4.315-323. https://doi.org/10.5194/isprsarchives-XL-5-W4-315-2015
Alidoost, Fatemeh & Arefi, Hossein. (2015). An image-based technique for 3D building reconstruction using multi-view UAV images. In: ISPRS – International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. XL-1-W5, pp. 43–46. https://doi.org/10.5194/isprsarchives-XL-1-W5-43-2015
Smith, R.: An overview of the Tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), September, vol. 2, pp. 629–633 (2007)
Google Scholar
Landge, R.Y., Sharma, R.: Blur detection methods for digital images – a survey. Int. J. Comput. Appl. Technol. Res. 2, 495–498 (2013). https://doi.org/10.7753/IJCATR0204.1019
Google Scholar
Matsumoto, K., Tajima, Y., Saito, R., Nakata, M., Sato, H., Kovacs, T., Takadama, K.: Learning classifier system with deep autoencoder. In: 2016 IEEE Congress on Evolutionary Computation (CEC), July, pp. 4739–4746 (2016)
Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)
MathSciNet MATH Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection (2015). CoRR, abs/1506.02640
Google Scholar
Al-Najdawi, N., Bez, H., Singhai, J., Edirisinghe, E.: A survey of cast shadow detection algorithms. Pattern Recogn. Lett. 33, 752–764 (2012). https://doi.org/10.1016/j.patrec.2011.12.013
Article Google Scholar
Sharma, P., Sharma, R.: Shadow detection and its removal in images: a review. Res. Cell 17, 2229–6913 (2016)
Google Scholar
Turchenko, V., Chalmers, E., Luczak, A.: A deep convolutional auto-encoder with pooling–unpooling layers in caffe (2016). CoRR, abs/1701.04949
Google Scholar
Nazari, Z., Kang, D., Asharif, M.R., Sung, Y., Ogawa, S.: A new hierarchical clustering algorithm. In: 2015 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), November, pp. 148–152 (2015)
Google Scholar
Beaudoin, J.E.: Content-based image retrieval methods and professional image users. J. Assoc. Inf. Sci. Technol. 67(2), 350–365 (2016)
Article Google Scholar
Moulon, P., Monasse, P., Perrot, R., Marlet, R.: OpenMVG: Open multiple view geometry. RRPR@ICPR (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

KLE Technological University, Hubballi, Karnataka, India
Ramesh Ashok Tabib, Sujaykumar Kulkarni, Abhay Kagalkar, Vaishnavi Hurakadli, Abhijeet Ganapule, Rohan Raju Dhanakshirur & Uma Mudenagudi

Authors

Ramesh Ashok Tabib
View author publications
You can also search for this author in PubMed Google Scholar
Sujaykumar Kulkarni
View author publications
You can also search for this author in PubMed Google Scholar
Abhay Kagalkar
View author publications
You can also search for this author in PubMed Google Scholar
Vaishnavi Hurakadli
View author publications
You can also search for this author in PubMed Google Scholar
Abhijeet Ganapule
View author publications
You can also search for this author in PubMed Google Scholar
Rohan Raju Dhanakshirur
View author publications
You can also search for this author in PubMed Google Scholar
Uma Mudenagudi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Indian Institute of Technology Kharagpur, Kharagpur, India
Jayanta Mukhopadhyay
Delhi Technological University, Delhi, India
Indu Sreedevi
Indian Statistical Institute Kolkata, Kolkata, West Bengal, India
Bhabatosh Chanda
Indian Institute of Technology Jodhpur, Jodhpur, India
Santanu Chaudhury
Indian Institute of Technology Kanpur, Kanpur, Uttar Pradesh, India
Vinay P. Namboodiri

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Tabib, R.A. et al. (2021). Deep Learning-Based Filtering of Images for 3D Reconstruction of Heritage Sites. In: Mukhopadhyay, J., Sreedevi, I., Chanda, B., Chaudhury, S., Namboodiri, V.P. (eds) Digital Techniques for Heritage Presentation and Preservation. Springer, Cham. https://doi.org/10.1007/978-3-030-57907-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-57907-4_8
Published: 18 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57906-7
Online ISBN: 978-3-030-57907-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics