1 Introduction

Authorities tasked with preserving historical sites use the visual inspection method to evaluate the state of preservation of cultural heritage (CH) structures and provide appropriate inputs for planning repair works. Damage assessments are mostly conducted when the severity of damage is evident from visual inspection and remedial measures become mandatory. Rapid urbanization, air pollution leading to discoloration of CH assets, vibrations, fires, humidity, high solar radiation, prevailing winds, floods, storms, and vandalism are all-natural and societal elements. These variables wreak havoc on CH by altering their inherent character and producing discoloration, abrasion, efflorescence, spalling, fissures, stains, and fungal development.

Traditional visual inspection employs highly skilled and experienced inspectors who document findings on-site using only their observations. Inspection staff visually evaluate the building, record each problem on paper or electronic devices, and photograph the flaws to validate the recording based on the codal standards and templates. Due to the structure and architecture of heritage sites, it is cumbersome to regularly perform an inspection of the entire CH [1]. Moreover, manual assessment is time-consuming and expensive due to the high labor input [2].

Innovative tools for inspecting CH structures are being utilized in combination with the traditional method of visual inspection such as laser scanning [3,4,5,6] and internet-of-things based sensors [7]. While equipment-based techniques benefit data operation and maintenance, almost all types of equipment have substantial installation and operating costs. This technology is fragile and difficult to install and deploy when investigating large and complex heritage structures, and it requires prior permission from authorities to inspect CH structures. Following data gathering, the data must be further processed and analyzed.

Data processing and analysis require highly qualified and experienced personnel and specialized equipment; both are costly and cannot provide timely data for heritage structure inspection. Thus, primary assessments of CH structures take longer and are not performed frequently [8]. Many researchers have attempted to address the limitations of manual inspections by utilizing computer vision-based technologies such as digital image correlation (DIC) [9,10,11] and digital image processing (DIP) [12,13,14,15]. These methods rely on manual feature extraction and are affected by the lighting conditions. New inspection systems are required to aid conservation authority workers in expediting the inspection process and restoring historic structures. Developing automated digital systems, tools, and technologies is the focus of the present phase of the industry 4.0 revolution [16, 17]. One of the newest digital technologies in historic conservation is artificial intelligence (AI)-based automation [18,19,20]. Using deep learning (DL)-based convolutional neural networks (CNNs) to detect defects in photographs, AI can help overcome manual, and machine-based inspection limitations [21]. With advancements in technologies and computer science, new and improved techniques must be adopted for detecting damage in heritage structures.

The research work addresses the problem of CH assets inspection, which is a traditionally human-intensive activity. The AI-based model can provide ease for human effort in continuous monitoring of these CH and help identify defects in complex architectures by direct expert-based observations to aid the SHM process. This study proposes a new inspection method based on modern technologies and digital data to identify damage in CH structures. In particular, the damage is identified from a set of images focused on the various parts of the tombs considered. Since it is a real-time detection model, it can be used in conjunction with video captured from UAVs and mobile cameras. The pathologies of damages are identified by bounding boxes displaying probability values for the different classifications. The results obtained using the developed method are used for identifying suitable approaches for the preventive conservation and SHM of CH structures.

2 Previous related works

Machine learning (ML) techniques have been utilized to evaluate the structural health condition of CH buildings and identification of structural components in CH buildings [22,23,24,25,26,27]. Yao et al. [28] deployed the YOLOv4 network model for real-time, detection of concrete surface cracks. Their basic model can process 16 frames per second (FPS), which does not suffice for real-time but their YOLOv4 tiny network and improved YOLO achieved processing rates of 56 and 44 FPS, meeting the requirements of real-time damage detection. The current research addresses such visual-inspection systems, which can even aid manual inspection at locations not accessible by SHM inspection professionals. Furthermore, the research used all the image data gathered from the CH building, rather than using partial damage data and partial synthetic datasets generated from physics-based models in various researches [29, 30].

Some research works are carried out for CH inspection at various levels of SHM, ranging from defect detection in CH to detecting missing CH assets. Mansuri and Patel [31, 32] developed an automatic defect detection technique using a faster region-based CNN (R-CNN) model and Inception v2 DL architecture. A dataset of 880 images was considered, containing three types of surface defects: exposed brickworks, spalling damage, and cracks. This dataset was annotated using LabelImg software [33] for use as a ground truth images dataset for learning defects in images. The system can detect defects in CH structures with the highest detection accuracy (maximum average precision, mAP) of 0.915. Chaiyasarn et al. [34] deployed CNN to detect cracks in historical structures on dataset collected by camera and unmanned aerial vehicle (UAV). Dais et al. [35] used DL techniques for detecting cracks in masonry surfaces with complex backgrounds. Wang et al. [36] used CNN-based classification techniques with the sliding window algorithm to identify and locate different damage classes in historic masonry structures. In a structure in China that houses a historical museum, Wang et al. [37] used faster R-CNN to identify and detect several damage types from 100 roof images. Guo et al. [38] applied a rule-based Mask R-CNN model for assessing plastered and painted facade defects such as cracks, peeling, spalling, biological growth, delimitation and blistering in CH. Monna et al. [39] applied a combination of Faster R-CNN with artificial data augmentation to detect vernacular buildings from satellite imaginary with a correlation R\(^2\) of 0.88 in one best-settings of the classifier. Wang et al. [1] developed a smartphone-based damage detection system that uses ML techniques to detect spalling and efflorescence in brick masonry walls in real-time. Mondal et al. [40] used R-CNN to classify four forms of earthquake-related damage, including exposed rebars, cracks, buckling, and spalling, in data collected from earthquake-damaged buildings. They used finite element techniques to develop a numerical model for localizing post-earthquake damage in masonry structures. Sharma et al. [41] used CNN to detect the quantity of dust deposited on heritage structures and to determine the level of damage caused. Bouchama [42] used deep CNN to detect damage caused by a variety of pathologies that can affect the surfaces of historic structures. Zou et al. [43] used CNN as an intelligent inspection system so that they could detect missing components in CH buildings. Conservators who conduct routine inspections of historic structures are particularly interested in these missing components. Masrour et al. [44] pre-trained DL-CNN models with transfer learning for detecting seven damage pathologies in old buildings in Morocco.

Automated approaches based on CNNs with different DL frameworks have been extensively used for detecting cracks in concrete [45, 46] and metal structures [47, 48]. These approaches can detect intrinsic details that cannot be observed visually. Drone-based inspection techniques can be used to detect structural problems in inaccessible areas, such as rooftops. Zhou et al. [49] deployed an R-CNN algorithm to detect cracks in crane steel structures in which data collection was conducted using a UAV with a detection accuracy of 95.4%. Most contemporary research is based on damage detection in concrete datasets, specifically for two types of images, cracked and uncracked. Kung et al. [50] also employed unmanned aerial vehicles (UAV) and DL to detect building deterioration due to surface defects with high precision and recall values. Chen et al. [51] divided the training dataset of images of concrete spaces (227\(\times\)227\(\times\)3) into cracked and uncracked images (20000 images of each type). The network was batch normalized after each pooling layer, and a large convolution kernel and pooling size were used [51]. The program automatically stopped training after 110,000 iterations, and model success rate was 99.71%. Cha et al. [52] used a DL technique for object detection and image classification. First, a rectangular box enclosure was used to detect irregularly shaped cracks. Simultaneously, window sliding strategies and region splitting were used. Semantic segmentation (e.g., FCN) was used to classify each image at the pixel level based on cracked and uncracked regions [53]. They proposed a U-net network structure based on FCN, which uses fewer training images and provides better results for some FCN metrics; they compared it with Cha’s CNN method. Deng et al. [54] used the object detection network YOLO v2 to detect cracks in concrete with complex backgrounds (handwritten scripts), achieving a maximum mAP of 77%. Feng et al. [55] proposed an STDD network, a DL-based real-time exposed rebar detection method for spillway tunnels, that is efficient and lightweight. The STDD network outperformed SDDNet (a crack detection network in particular), with 1.7 million parameters and a 14.08 ms average inference time.

The literature review revealed two critical roadblocks in detecting defects in CH structures. First, most applications focus on concrete crack identification instead of CH defects. Secondly, application typology is the first thing to be considered in CH, as damage typologies in CH vary a lot. A typology is the application of defect detection, namely, the type of structure and components thereof. Some studies, for example, have only focused on detecting defects in brick walls, while some only damages due to dust deposition. These models failed to discover defects in the other components of the CH structure, such as a column or dome. It is hard to generalize one kind of defect among all structural elements because defects are unique to structural elements. Previous studies have been limited to a particular type of heritage structure and cannot be applied to other structures. Moreover, carrying out SHM must not be limited to a few expert individuals but should be easily carried out by someone with basic knowledge about SHM.

3 Data collection and preprocessing

The CH image dataset is collected from the Dadi-Poti tombs in Hauz Khas Village, New Delhi (Fig. 1). The larger tomb (Dadi) belongs to the Lodi era (1451-1526 AD), and measures 15.86 m \(\times\) 15.86 m. The tomb has Quranic inscriptions engraved on the walls and ceiling in the form of medallions. The smaller tomb (Poti) is located 20 feet away. It belongs to the Tughlaq era (1321-1414 AD), measuring 11.8 m \(\times\) 11.8 m.

Fig. 1
figure 1

Data collected from the Dadi-Poti tombs, Hauz Khas, New Delhi [96]

Like all DL methods, proper supervision of labeled data with corresponding damage classes, image quality, sufficient images of each damage class and their quality, and tuning the DL model to perform for the custom dataset is challenging. In many cases, researchers have leaned on the generation of synthetic datasets and mixed several images with objects from the real world to populate their datasets. Still, we have only used the real-image datasets for our study. The collected image dataset comprises 3500 photographs of the Dadi and Poti tombs, which reveal various flaws, including spalling, efflorescence, exposed bricks, cracks, discoloration, algae growth, and missing parts. In this study, four of these classes of defects are identified and analyzed as there is fewer data available corresponding to the other defects. The images are annotated by drawing bounding boxes using roboflow software (Fig. 2). The annotated images are then augmented to a total of 10291 images by flipping, cropping, gray scaling, and rotation, as illustrated in Fig. 3, and the final dataset contains a total of 14757 labels, of which 975, 5896, 7752, and 134 are crack, discoloration, exposed brick, and spalling labels, respectively, as shown in Fig. 4. These 3500 original images were then shrunk to 416\(\times\)416 pixels to reduce the training time.

Fig. 2
figure 2

Examples of data annotation. The data were annotated to identify four types of defects; spalling, crack, exposed brick, and discoloration

Fig. 3
figure 3

Examples of data augmentation in CH images. Augmentation includes gray scaling, flipping, rotation and cropping of images

Fig. 4
figure 4

Bar graph representing the total instances of defects in each class in the dataset with 10291 images

For training the defect detection system, first, the defects available in the image should be manually identified using expert feedback from CH inspectors. Ground truth images are used to learn about image defects through manual identification. All features and information in a picture must be identified, and rectangular boxes must be drawn around each CH defect typology. The bounding box around each fault is carefully identified in all photos. For example, many images may have multiple bounding boxes for capturing various types of defects.

However, some limitations and problems still exist in model preparation, such as (1) the lack of availability of a sufficient amount of data to train the DL model for each class of defects. (2) time spent on cropping the images, splitting them and then augmenting them, (3) choosing the best algorithm for the particular case study in terms of detecting defects accurately as well as providing real-time or quasi-real-time defect detection, (4) image dataset containing instances of shadows and windows, and some of the data did not contain any prominent defects, so the model testing gave inaccurate predictions. Additionally, images containing no defects were marked as null and not discarded (which improves accuracy), and further augmentation, such as gray scaling.

4 Methodology

A DL advanced object detection model named YOLO [56, 57], is used to train a DL model on collected datasets. The method followed in this research is described below, and a schematic diagram of the steps is shown in Fig. 5. YOLO and its several versions have been successfully used in several object-detection applications in civil engineering such as structural crack detection [58,59,60], crack detection on concrete structures [28, 61,62,63] and pavement maintenance [64,65,66,67], spilled loads in traffic scene [68], identifying the parts of a building [69] and potholes-detection in pavement [70, 71], traffic load distribution in bridge infrastructure [72], safety-helmet detection in construction sites [73, 74], counting steel-pipes in construction sites [75], measure diameter of reinforcement bars [76], loosening of bolts in wind turbines [77], recognising structural components from 2D drawings [78] etc. YOLO is able to respond in real-time, even on devices with limited processing power, since it uses only one forward propagation through the neural network to make a forecast. An application of YOLO in CH automatic detection of spalling zones in limestone walls was performed by Idjaton et al. [79] on 1000 high-resolution images of CH in France.

4.1 Convolutional neural networks

A CNN comprises an input layer, a convolutional layer, a pooled layer, an input layer (from flattening), a fully connected layer, and an output layer. A convolution comprises three elements over an operation: an input image, a feature detector (kernel), and a feature map. The feature detector filters the information in the input image. It filters the integral parts, excluding the rest, to get the corresponding feature map. The decrease in input image size depends on the size of the kernel and the number of strides taken in traversing the pixels. CNN develops multiple feature detectors, developing several feature maps called a convolutional layer. The next component in the first step of CNN is the rectified linear unit (ReLU). ReLU increases the non-linearity of images as they are naturally non-linear. Therefore, the rectifier further breaks the linearity to compensate for the linearity imposed on an image while it is undergoing convolution.

4.2 Pooling layer

There are various types of pooling, including mean, max, and random. In this study, max pooling is used. It enables the CNN to detect certain features, specifically, the maximum value in the pooling area. We extract the maximum value to account for distortions, i.e., to provide CNN with spatial variance capability. Pooling also decreases image sizes and the number of parameters, thus preventing overfitting. The pooled feature map is then inserted into an artificial neural network as an input layer after flattening it. Meanwhile, the network’s building blocks, such as the weights and the feature maps, are trained and continuously updated to achieve optimal performance, enabling it to classify images and objects with maximum accuracy. Each pooling layer prevents overfitting, hence expediting convergence and enhancing training stability.

4.3 Adam algorithm

The Adam algorithm is an optimizer for model training because it can perform stochastic gradient descent to set a learning rate for each weight parameter [80]. A lower learning rate is provided for weights updated more frequently, and a higher learning rate is provided for weights updated the least often while changing parameters. The algorithm can also provide a lower learning rate for weights that are updated less frequently.

4.4 Implementation

The current research is focused on constructing an AI-based state-of-the-art defect detection algorithm that leverages DL. However, the development of such a model necessitates the use of advanced DL techniques for object detection. The YOLOv5 method [81] is used in this project to develop a new object detection model for detecting four distinct defects.

Validation of the model can be inferred from the accuracy with which bounding boxes are detected. Moreover, any model requires comparison with its peers for validation and to test its robustness. The model selected for comparison is the state-of-the-art, commonly used deep learning-based method named, faster R-CNN [82, 83] based on the ResNet 101 architecture.

Originally, YOLOv5 was a single-stage object detector with three main parts, i.e., backbone, neck, and head. The model backbone comprises cross-stage partial (CSP) networks that are used to extract essential features from the input image. The neck is mainly used to generate feature pyramids that help models generalize on object scaling. It helps identify the same object in different sizes and scales. For the most part, the head is used for final detection. Finally, it produces final output vectors that include class probabilities, objectness scores, and bounding boxes for each feature.

Thus, an object detection model based on the state-of-the-art object detection method, the YOLOv5 algorithm, is developed whereby the CNNs of the YOLO network is designed to give the best performance on the image dataset. The C3 convolution layer is replaced by bottleneck CSP networks, further improving the model’s performance.

First, the images are tagged using roboflow online software. A total of 3550 photos are labeled to train the model to identify four different types of flaws, such as spalling, discoloration, exposed bricks, and cracks. This dataset is then divided into three subsets: a training set, a test set, and a validation set, each with a 80:10:10 split [84, 85]. To expedite the learning process, photos are resized to 416 \(\times\) 416 pixels during the preprocessing stage. A total of 10291 photos are added to the dataset post-labeling. Flipping, cropping, gray scaling, and a 90\(^\circ\) rotation are used to improve the dataset. After training on the COCO dataset [86], the batch size is set to 16, the number of epochs is 50, and pre-trained weights are used.

The training dataset is fed into the custom YOLO model with pre-trained weights and hyperparameters tuned in to give the best performance. Here, the custom YOLO model means that it is optimized for our particular application and the custom data poti dataset. In the custom YOLOv5s network used, this C3 convolution network is replaced with a bottleneck CSP layer that is used to make residual blocks thinner to increase depth and have fewer parameters, to increase computation speed as well as accuracy. The YOLOv5 has four types of model architecture that are used for different sizes of datasets, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The YOLOv5s is suitable for a smaller dataset and requires considerably less computation time than the YOLOv5l, which requires a large dataset and higher computation time but gives highly accurate results when the prior two conditions are satisfied. The performance of the custom YOLOv5s and YOLOv5l models is evaluated, and mAP is calculated to be 93.7% for the custom YOLOv5s model. In contrast, the YOLOv5l model does not give satisfactory results and requires three times the computation time. The YOLOv5s model is initially trained for 50 epochs, as higher epochs mean higher processing time, and the free version of Google Collaboratory (Google Colab) gives runtime disconnect errors. Each iteration’s best and last weights are saved and used while running the code for the subsequent 50 epochs. Higher values of epochs do not entirely change the accuracy, and it saturates over 90 epochs up to 100 epochs.

The dataset is trained with custom YOLOv5 algorithms, namely, custom YOLOv5s and custom YOLOv5l. Compared to the YOLOv5s model, the YOLOv5l model has more parameters, requires more CUDA memory to train, and is slower. Thus, after training on the dataset a few times, the latter is discarded, and YOLOv5s, as shown in Fig 6, is adopted. The YOLOv5s algorithm is deployed because it is fast and provides real-time detection of defects. It contains a total of 191 layers, 7.46816 million parameters, and gradients. After confirming the correctness of the model, it is selected to be used for the principal dataset, as shown in the Fig. 6, following the same procedure as used for the custom dataset. The performance of the model is improved by providing more data, optimizing the hyperparameters, and using different weights.

Fig. 5
figure 5

Overall methodology of implementing defect detection models based on DL

Fig. 6
figure 6

Custom YOLOv5 network use to train, validate, and test the dataset (modified from Mishra et al. and Park et al. [63, 71])

4.5 Environment

Google Collab is an integrated development environment (IDE) that supports research and learning related to AI. Collab provides a code environment similar to Jupyter Notebook, and it offers a free graphics processing unit (GPU) and a tensor processing unit (TPU). Google Collab has popular pre-installed libraries such as PyTorch, TensorFlow [87], Keras [88], and OpenCV. ML or DL algorithms require systems with high speed and processing power (usually based on a GPU); standard computers are not equipped with a GPU, and buying one is expensive. Hence, Google Collab supplies GPUs (Tesla V100) and TPUs (TPUv2) over the cloud to assist AI researchers.

4.6 Performance metrics

Based on relevance, the performance of the custom YOLOv5 is evaluated using the parameters of precision and recall. Precision, also known as a positive predictive value, and recall, also known as sensitivity, are given using Eqs. 1 and 2, respectively. The robustness of the proposed YOLOv5 algorithm is measured using the value of AP, which is the area between the precision and recall curves. A single value is obtained for AP, which shows that the detector can classify correctly and identify all objects that fit these classifications. The mAP is calculated by taking each class’s average of the AP values.

$${\text{Precision = }}\frac{{{\text{TP}}}}{{{\text{TP + FP}}}}{\text{ }}$$
(1)
$${\text{Recall = }}\frac{{{\text{TP}}}}{{{\text{TP + FN}}}}{}$$
(2)

where TP, FP, TN, and FN denote true positive, false positive, true negative, and false negative, respectively.

4.7 Loss function

The YOLOv5 network is trained using stochastic gradient descent with momentum (SGDM), a training optimization approach that helps accelerate gradient vectors, resulting in faster convergence and achieving the minimum loss function value. Per grid cell, YOLOv5 predicts numerous bounding boxes. However, only one box is necessary, which is decided by the intersection over union (IoU) with the ground truth with the highest value. During the training phase, the probability of objects in an image patch region is modified to minimize the difference using offsetting. To quantify the loss for bounding box refinement, YOLOv5 adds sum-squared errors between model estimations and the actual value. Finally, the total loss function [54] given in Eq. 3 is used to translate the predicted bounding box to a ground truth bounding box using the geometrical coordinates and the confidence score [81].

$${\text{Total}}\,{\text{loss = classification}}\,{\text{loss}}\,{\text{(CL)}}\,{\text{ + }}\,{\text{localisation}}\,{\text{loss}}\,{\text{(LL)}}\,{\text{ + }}\,{\text{confidence}}\,{\text{loss}}\,{\text{(CL)}}$$
(3)
$$\begin{aligned}&CL = \sum _{i = 0}^{S^2}\coprod _{i}^{obj}(P_{i}(C) - \hat{P}_{i}(C))^2, \end{aligned}$$
(4)

where \(\coprod _{i}^{\text{obj}}(P_{i}(C)\) = 1 if an object is in cell i, otherwise = 0 and \(\hat{P}_{i}(C)\) represents the conditional class probability for class C in cell i. The second term localization loss in Eq. 3 is defined as [54]:

$$\begin{aligned} \begin{aligned} LL = \lambda _{coord}\sum _{i = 0}^{S^2}\sum _{j = 0}^{B}\coprod _{ij}^{obj}[(X_{i} - \hat{X}_{i})^2 + (Y_{i} - \hat{Y}_{i})^2] + \\ \lambda _{coord}\sum _{i = 0}^{S^2}\sum _{j = 0}^{B}\coprod _{ij}^{obj}[(\sqrt{W_{i}} - \sqrt{\hat{W}_{i}})^2 + (\sqrt{H_{i}} - \sqrt{\hat{H}_{i}})^2], \end{aligned} \end{aligned}$$
(5)

where \(\coprod _{ij}^{obj}\) = 1 if the jth bounding box detects an object in cell i, otherwise = 0. \(\lambda _{coord}\) was set to 5 to compensate for the loss of the bounding box coordinates. This was done to place more focus on the correctness of the box. \({W}_{i}\) and \({H}_{i}\) are the width and height of the ground truth bounding box, while \(\hat{W}_{i}\) and \(\hat{H}_{i}\) are the width and height of the predicted bounding box by the model, respectively. Similarly, \({X}_{i}\), \({Y}_{i}\), and \(\hat{X}_{i}\), \(\hat{Y}_{i}\) are the coordinates of the center of the ground truth and predicted bounding boxes, respectively. The third term confidence loss in Eq. 3 is defined as [54]:

$$\begin{aligned}&CL = \sum _{i = 0}^{S^2}\sum _{j = 0}^{B}\coprod _{ij}^{obj}[(C_{i} - \hat{C}_{i})^2 \nonumber \\&\quad + \lambda _{noobj}\sum _{i = 0}^{S^2}\sum _{i = 0}^{B}\coprod _{ij}^{noobj}[(C_{i} - \hat{C}_{i})^2], \end{aligned}$$
(6)

where \(\coprod _{ij}^{noobj}\) is the complement of \(\coprod _{ij}^{obj}\). \({C}_{i}\) is the confidence score decided before training the model, for the model to identify objects based on this threshold. \(\hat{C}_{i}\) denotes the confidence score of the box j in cell i. \(\lambda _{noobj}\) = 0.5 to weight down the non-object loss to rectify the class imbalance problem, since majority of boxes do not contain any objects for classification. However, in YOLO, since loss function treats errors for small and large boxes equally, hence disproportionate boxes size in detecting defects might lead to incorrect predictions.

4.8 Hyperparameter tuning

The custom YOLOv5s network is trained over the 10291 images with an SGDM loss function (Eq. 3) by selecting a batch size of 16 and running over 100 epochs. The number of chosen epochs is optimized by considering detection accuracy and computation time. The first hyperparameter, momentum, is set to 0.937 and is used for updating parameters between iterations. The learning rate is determined by how frequently the weights get updated during training. Selecting a large learning rate can cause the model to converge quickly, requiring fewer epochs, while lower learning rates need more training epochs to get optimum weights. The learning rate used for this study is 0.01, and the weight decay is 0.0005, which is responsible for weight parameter reduction during backpropagation, thus adding a penalty component to the cost function.

The faster R-CNN is trained on the image dataset to test the model’s performance and compare it with the custom YOLOv5 model. It is a state-of-the-art object detection model that gives high accuracy on a dataset in quasi-real-time. It uses a region proposal network for predicting object regions and classes, eliminating the need for the selective search algorithm, thus reducing the region proposal time drastically. The object detection method in general, comprises three steps, i.e., performing a forward pass, followed by calculating losses and updating the weights of the network. The main aim of the training is to minimize the loss and obtain an almost constant value. The accuracy of detection for the current model is determined using an IoU method, which measures the extent of overlap between the ground truth and the predicted bounding boxes. IoU is defined as the ratio of intersection area to union area. The threshold value of IoU is taken as 0.5 [89]. This means that only predicted bounding boxes with IoU \(\ge\) 0.5 are considered as correct detection.

The mAP is used to determine the accuracy of detection. The faster R-CNN is implemented on Google Collab cloud GPU using TensorFlow 1.4, and the object detection API [90]. The faster R-CNN and RPN are trained using a momentum optimizer value of 0.9, and the first stage feature map stride for the sliding convolutional layer of RPN is 16. The training image dataset and the validation image dataset are used to calculate the gradient for backpropagation and to obtain an optimum number of iterations for training, respectively. Iterations affect precision and training time of the DL model. Fewer iterations reduce training time, but not training loss, leading to low detection and precision. Iterations help reduce and stabilise loss function.

However, too many iterations may lead to overfitting detection and waste time and computational resources. A set of alternative numbers of iterations in ascending order is utilized to identify the best number of iterations. The network is trained ten times, and the maximum detection accuracy is 85.04 % as measured by the mAP.

5 Results and discussion

The custom YOLOv5 model is trained, validated, and tested on a CH image dataset containing 10291 images with image sizes decreased to 416\(\times\)416 pixels, with batch size set to 16 and the number of epochs to 100. The present study mainly focuses on developing a DL model that can detect multiple defects in a given CH structure. The number of epochs was set to 100 based on analysis and the trial-and-error method to optimize for ensuring a minimum loss function and time and computational efficiency. The performance metrics for the YOLOv5 model are obtained using the parameters mAP@0.5, precision, recall, and loss, namely, classification, box, and objectness, as a function of the number of epochs. The maximum mAP@0.5 obtained is 93.7% for four types of defects, i.e., 98.9% for cracks, 85.3% for discoloration, 96.4% for exposed bricks, and 94.2% for spalling, as shown in Table 1. The faster R-CNN could detect the defects more accurately for some instances, but it gives less overall accuracy when looked at for multiple defects. The precision and recall are 85.9 and 91.8%, respectively. The confidence score is selected as 0.5, which means the model identifies defects with this as the minimum accuracy. The graphs and model performance over test images contained in the 10% split of the original dataset are shown in Figs. 7, 8, 9.

Table 1 Performance metrics corresponding to each class of defects
Fig. 7
figure 7

Defect detection results from the custom YOLOv5 defect detection model (Some labels have been modified for enhanced visibility)

Fig. 8
figure 8

Defect detection results from the custom YOLOv5 defect detection model (Some labels have been modified for enhanced visibility)

5.1 Damage detection results and performance metrics from the custom YOLOv5 model

An image classifier is trained on the final dataset of CH photos, and it is then used to detect defects in the dataset’s images of CH sites. Examples of the classifier’s output are presented in Figs. 7 and 8. Each detected bounding box has a detection confidence level indicated in the upper-right corner. The confidence level for detecting several flaws is within an acceptable range (56–97%). For the detection of defects, a variety of site conditions are analyzed. In Fig. 7a-d exposed bricks are detected with a confidence level in the range 75–97%. In the same image, Fig. 7e illustrates two types of defects: discoloration and exposed brickwork. Discoloration detected within the area of the exposed brick with a confidence level of 74% is highly accurate. Figure 7f illustrates two classes of defects, namely, discoloration and exposed brick, which are detected satisfactorily. In Fig. 8g, discoloration is detected with an accuracy of 71–72%; in Fig. 8h-j, exposed brick is detected accurately. Figure 8k and l detects three and two types of defects, respectively, and these show that the model can detect instances of crack over lantern-shaped structures over the dome.

The results from the custom YOLOv5 model are plotted in Fig. 9. The mAP for the model is 93.7% (Fig. 9c), which is achieved at approximately 50 epochs, and then the graph reaches saturation at 100 epochs. The mAP the model is 85.9%, as shown in Fig. 9b, and the maximum recall is 91.8% (Fig. 9a). The precision in Fig. 9b passes through a dip at approximately 30 epochs, mainly because the model is learning over a new dataset and determining the number of TPs. The recall from Fig. 9a increases continuously and indicates that the number of correct predictions gradually increases with the number of epochs, reaching its maximum value at approximately 50 epochs. The loss function, shown in Fig. 9d-f, indicates the errors with which the model detects an object, gradually decreasing with each training epoch.

Fig. 9
figure 9

Performance metrics of custom YOLOv5 defect detection model a recall, b precision, c mAP, d classification loss, e objectness loss, f box loss

Fig. 10
figure 10

Defect detection results from the faster R-CNN comparison model

5.2 Damage detection results and performance metrics from the faster R-CNN model

As shown in Fig. 10, the faster R-CNN can successfully detect multiple defects in a single test image. Fig. 10a–i shows detection cases of discoloration and exposed brick in the range 46–91%. Figure 10d detects exposed brick with multiple bounding boxes. This is because the ground truth bounding boxes used for training are similar to it, and they provide better localization results. Figure 10d shows an instance of crack over the lantern located at the top of the dome; the model was unable to detect the crack but detected spalling with 89% accuracy at the bottom of dome, but the YOLOv5 model detected it successfully. The faster R-CNN gives a particular confidence score for each of the bounding boxes that captures the defect. No single bounding box has two confidence scores (multiple accuracy), and the reason for having smaller boxes in the results is that the labeling was done in the same way for better localization of the defects. However, the object detection model can have partially overlapping bounding boxes, as for some cases, the rectangular area can have multiple defects within the same region such as discoloration and exposed bricks shown in Fig. 10g.

The mAP achieved from training the faster R-CNN model as shown in Fig 11 is 85.04%, which is achieved after training the model for 8000 iterations, even though the maximum value is reached at approximately 6500 epochs. The class-wise mAP for exposed bricks is 96.9%, 89.9% for discoloration, 87.9% for spalling and 65.76% for cracks. The model gives sufficient accuracy, but adding to the dataset will ensure better accuracy.

Fig. 11
figure 11

Maximum average precision from the faster R-CNN model

5.3 Comparison of the custom YOLOv5 network and faster R-CNN

To compare the performance of the proposed YOLOv5, a ResNet 101 architecture-based faster R-CNN is used and trained on the same dataset. Cracks, spalling, exposed brick, and discoloration are detected successfully. The mAPs for faster R-CNN and proposed YOLOv5 are 85.04% and 93.7%, respectively. The results of both DL-models are comparable in terms of defect detection accuracy. The faster R-CNN requires more than four times the training time of the YOLOv5s model, i.e., 8 h 38 min versus 1 h 53 min. Thus, the proposed YOLOv5 is superior, with fewer false detection and considerably faster training and inference speed. Some instances of model training, with custom YOLOv5s, custom YOLOv5l and faster R-CNN are shown in Table 2. The YOLOv5s gives comparable results for the same dataset at approximately one-third of the time taken by the faster R-CNN.

Table 2 Model training time, number of defects taken, image dataset, epochs, and mAP for various DL models

5.4 Comparison of proposed model with other automatic visual inspection systems

This study aimed to determine the lack of research in defect detection systems and how AI-based visual inspection systems can be used to move in this direction. The results obtained for multi-class defects are often more challenging and less accurate compared to instances where only one category of defects, such as cracks, needs to be identified. Mansuri and Patel [31] developed an automatic web-based visual inspection system based on faster R-CNN inception v2 architecture that can detect three classes of defects with an accuracy of 91.5%. Chen et al. [51] used CNN to detect only cracked and uncracked concrete spaces with an accuracy of 99.71%. Wang et al. [1] detected two classes of defects and achieved an accuracy of 95%, while Cha et al. [52] detected five damage types with 87.8% accuracy. Cosovic and Jankovic [91] applied CNN for categorizing CH images into 10 categories with an accuracy of up to 90%. Although their work didn’t directly identify damages, the identification of various components/objects [92] in the CH technique can be extended to damage detection in CH buildings. Wang et al. [93] GreatWatcher platform based on R-CNN DL framework gave 78.2% accuracy on a trained dataset on a small image sample of 610 images. Another study by the same research group of Wang et al. [1] used faster R-CNN and reported average precision of 0.999 and 0.900 for efflorescence and spalling damage in old masonry construction. In this study, we successfully developed a model that can detect four types of defects, namely, spalling, discoloration, exposed brick, and cracks, with an mAP of 93.7%. Hence, in comparison with other automatic damage detection DL techniques, the proposed model reports good performance and, at the same time, potential for real-time detection of CH defects. Kwon and Yu [94] classified stone-related damages typical of CH sites into four types (i.e., crack, material loss, detachment of material, biological colonization) based on the Faster CNN algorithm and achieved a confidence score of 94.6%. Samhouri et al. [95] employed CNN for detecting surface damages in architectural CH buildings in Jordan for four defects (erosion, material loss, color change of the stone, and sabotage issues in CH) with an accuracy of 95-96%. The proposed approach can be used similarly to contemporary methods of automatic visual inspections of historic buildings.

6 Conclusions, limitations, and future scope

This study proposes an automatic heritage structure defect detection model to aid in ensuring the sustainability of CH structures; it develops an automated visual inspection system that can accelerate the preservation and maintenance processes. The YOLOv5 DL model is used to produce automatic defect detection. The dataset has 10291 images with four types of surface defects: spalling, exposed brick, discoloration, and cracks. The Dadi-Poti tomb dataset is annotated and used as an image dataset. With the highest detection accuracy (mAP) of 0.937, the system can successfully detect four types of defects in CH structures. The model’s performance is evaluated in various settings, including background noise and images taken from different angles. Successful application of this case study can help identify structural anomalies in need of urgent repair, thereby facilitating an improved civil infrastructure monitoring system. YOLOv5 can be enhanced by adding more photographs to the database and modifying the design.

The originality and contribution of this study are the application of the DL model to detect damages using the Dadi-Poti tomb as a case study. The custom YOLOv5 automatic inspection system is tested and validated by comparing it with a ResNet 101-based faster R-CNN model, and it can be used by conservation authorities to conduct regular inspections at a lower cost; it will allow them to make more timely decisions about repair and maintenance work to be carried out in the built environment. The proposed YOLO model gives better accuracy for classifying multiple defects in CH and almost a quarter of the training time when compared with its counterpart, R-CNN. This reduction in training time is helpful for practical purposes, thus saving computational costs and enabling faster model deployment. A random set of test photos is used to test the model. We have attached a small video of 20 seconds in supplementary material that could be the prototype for future works. If the images/video is captured using a drone, then YOLO can give damages in real-time (as video is just the number of frames in a second). The findings of this study can encourage the use of automatic systems instead of manual inspection to save time and money in terms of labor costs. This method is beneficial to inspection engineers, material scientists, and the overall heritage conservation community; it can also boost the development of new damage detection techniques. The suggested automated inspection can help conservation agencies manage their finances. Furthermore, the paper deals with the SHM of CH. Still, the framework can be extended to other areas of CH preservation, such as the classification of architectural elements within heritage buildings, identifying disaster-affected CH, artwork identification, and image reconstruction in CH.

The proposed YOLO model is confined to detecting four types of typical surface flaws in CH structures in this research work. More categories of defects, such as efflorescence, seepage, dust deposition, fungal growth, and missing components, can be considered in future studies, and a comparable model can be developed using the proposed method. Threats to the validity of the DL models should be taken into account, such as the quality of the image dataset, accurate labelling of various defects and sufficient images of each defect, in particular for the YOLO model detection of minor defects that appear in groups of defects. Future research can be focused on severity assessment as well as defect quantification. The current research utilizes image datasets, but future works include running the YOLO model on the input obtained from UAVs and computer webcams. Moreover, the model can be further developed to achieve higher performance with a smaller dataset and less computation time.

The present stage of research is based on detecting defects that are present in the collected images. Still, this automatic detection of defects does not tell about the spatial location of defects on the CH structure itself. Therefore, future studies can develop a model to locate the object and geo-tagging it at the exact position of the CH structure, taking the ground frame of reference to tell the precise location coordinates of the defect over the structure in 3D platforms.