Keywords

1 Introduction

Recent advances in 3D point cloud representation have facilitated the development of new applications such as heritage information digitalization [1]. However, even the most powerful 3D laser scanners available today may produce physically accurate but semantically meaningless 3D representations. Machine Learning (ML) and Deep Learning (DL) techniques for extracting semantic information from 3D point clouds have emerged and applied in a variety of applications, including construction engineering [2], building modeling [3], energy estimation [4], and cultural heritage [5].

Despite substantial advances in the existing methods for analyzing point clouds, applying ML and DL approaches to the semantic segmentation of architectural cultural heritage (ACH) remains challenging due to their complexity and uniqueness. More specifically, hand-crafted geometric features such as anisotropy, planarity, linearity, sphericity, and verticality are required for ML-based approaches [6, 7], which may raise concerns about the generalization of ML methods to other ACH datasets. On the other hand, DL-based methods can automatically learn features from input point clouds and classify each point in an end-to-end manner. However, their success is based on the presumption that annotated data is constantly available and covers a wide range of samples. What remains unknown is the generalizability of the pre-trained machine learning and deep learning methodologies to various types of architectural heritage.

Certain challenges have been described by Grilli et al. [8], while their focus is on evaluating the generalization of ML-based techniques. In this context, a comprehensive understanding of both approaches is critical for optimizing the use of AI techniques on ACH (e.g., data usage, analysis, and conservation). This paper aims to compare the generalization capabilities and limitations of ML-based and DL-based approaches for the semantic segmentation task of 3D ACH. Specifically, we employ a supervised ML Random Forest (RF) classification method described in the work presented by Grilli et al. [7] as our ML-based approach. Then we employ Dynamic Graph Convolutional Neural Network (DGCNN) [9] as the DL-based method, which automatically derives classification results from input point clouds.

The ML method and the DL approach are trained and tested on i) the test scene part of the Architectural Cultural Heritage (ArCH) dataset [10] and ii) point clouds from three different chapels part of the “Sacromonte Calvario di Domodossola” complex [11]. To assess its generalization, the DL-based technique is trained solely on the training split of the ArCH dataset and then evaluated on both the test split of the ArCH and “Sacrimonti” datasets. The classification results of this approach are compared to those of the supervised ML approach. The advantages and limitations of these two methodologies are then compared through a systematic study of the results. The contribution of our research can be summarized in the following points:

  • In the ML-based method, satisfactory classification results are reached by training specific RF classifiers for each dataset that require minimal manual annotations and previously computed covariance features.

  • Without manually annotating the “Sacrimonti” dataset, the DL-based classification algorithm delivers competitive performance in cross-dataset point clouds.

  • We conducted empirical experiments to gain a comprehensively understand of how pre-trained ML and DL methods perform on distinct types of 3D ACH point clouds.

2 Related Works

Contemporary survey techniques can collect the geometrical representation of ACH in the form of point clouds models. At the same time, to maximize the exploitation of acquired measurements, the huge volume of data necessitates a semantic interpretation at a high Level-of-Details (LoDs). Machine learning (ML) and deep learning (DL) methods for 3D point cloud analysis are constantly being developed and enhanced, the state-of-the-art approaches for determining the feasibility of ML and DL methods applied to ACH point clouds are discussed in this part. These approaches are reviewed from two perspectives: i) ML-based methods, and ii) DL-based methods.

2.1 ML-Based Method

Semantic categories are learned from a dataset of manually annotated data using supervised machine learning techniques such as support vector machines [12], naive Bayes [13], and random forests [14]. The semantic categorization is then disseminated across the full dataset using the trained model. In most cases, providing a substantial amount of annotated data to train the model is not required. Traditional approaches, on the other hand, often use a set of hand-crafted form descriptors as feature vectors to learn the categorization pattern. Local surface patches, spin pictures, intrinsic shape signatures, and heat kernel signatures are among the descriptors listed by Griffiths and Boehm [15]. To accomplish classification, a 2.5D technique uses features and labels from 2D photos and projects them onto 3D models. Grilli et al. [16] presented a classification method that uses 2D data as input ("texture-based" approach). For the test scenarios under investigation, optimized models, orthoimages, and UV maps were developed. They classified the items using orthoimages or UV maps first and then projected the 2D classification findings onto the 3D objects.

Only during the last few years have AI techniques that work directly on the 3D CH point model appeared. In supervised machine learning, the algorithms use certain manually annotated parts of the datasets as inputs, as well as hand-crafted features (such as geometric and/or radiometric qualities), to learn patterns that are then projected throughout the whole dataset. Grilli et al. [7] proposed a classification method that works directly on point clouds, using geometric characteristics to train a Random Forest (RF) classifier. The approach iteratively extracts the most important aspects based on a set of geometric properties that are tightly tied to the dimensions of architectural elements. In [8], the same author confirmed the ability to generalize the categorization model across various architectural settings.

Teruggi et al. [5] suggested an MLMR technique based on the approach described by Grilli et al. The full resolution dataset is subsampled, and large macro-elements are categorized using a low-resolution version of the point model and a particular RF classifier. The output is then back interpolated on a higher resolution point cloud to subdivide components that require great geometric precision. Using the full resolution dataset, the algorithm iterates up to the categorization of single high detail architectural elements. Each step in this approach necessitates training a specific RF model, but only a small amount of labelled data is required, and the training and classification speed has proven to be effective. It divides the data into sub-classes in a hierarchical manner as the geometric features rise. When compared to non-hierarchical categorization, this technique was shown to be more computationally efficient and allowed for more accuracy in the case of complex datasets.

2.2 DL-Based Method

Qi et al. presented PointNet [17], a ground-breaking approach that operates directly on point clouds and employs the shared Multilayer Perceptrons (MLPs) to learn high-dimensional features for each point separately. This approach has been extended in numerous ways to extract local information from a point cloud [9, 18].

For instance, the Dynamic Graph Convolutional Neural Network (DGCNN) improves segmentation efficiency by constructing graphs with point correlations encoded in graph edges [9]. In the CH domain, Pierdicca [19] proposed using DGCNN [9] for the point cloud segmentation of the Architectural Cultural Heritage (ArCH) dataset [10]. DGCNN-Mod [19] improves the performance of classification by including radiometric (HSV value) and normal information. By integrating spectral information and hand-crafted geometric features, DGCNN-Mod + 3Dfeat [20] combines the positive aspects of both ML and DL for semantic segmentation of point clouds in the field of CH. Their work demonstrated the promise of deep learning approaches for segmentation tasks. Unsupervised learning technologies have been developed in this domain to address the absence of labeled data. By faithfully reconstructing the input original point cloud, an autoencoder (AE) is trained to learn a compressed representation from unlabeled data [21]. Among them, DGCNN is utilized as their backbone due to its robustness when applied to point clouds of different scales.

In general, learning to generate powerful and robust representations from inhomogeneous point clouds, particularly ACH point clouds with complex geometric patterns, remains a challenge. We compare the ML-based and DL-based methods using two different ACH datasets to achieve a compressive understanding of the generalizability of both methods in the CH point cloud semantic segmentation task to enable us to make more effective use of AI techniques in the ACH domain.

3 Method

3.1 ML Method

In this work, we present a supervised ML classifier based on the work presented by Grilli et al. [7].

The RF classifier does not need a significant amount of manually annotated data to classify the final dataset, but it requires as input significative geometrical features able to highlight the discontinuities between elements.

These features highlight the structure of the point cloud in the point neighbourhood of a certain radius which is directly dependent on elements dimensions.

Following the reasoning behind the work presented in [57 and 9] in addition to taking the z coordinate of the point into account, the following geometric covariance features (Fig. 1) are computed for radii from 0.05 m to 0.4 m with an increment of 0.05 m: i) anisotropy, ii) planarity, iii) linearity, iv) surface variation, v) sphericity, vi) verticality.

The classification process encompasses different steps: i) extraction of geometric features based on the covariance matrix for the whole dataset; ii) manual segmentation of a portion of the dataset to be used as training set and evaluation set; iii) training the RF classifier; iv) input the dataset to be classified together with computed covariance features to the trained model to obtain the final prediction (Fig. 2).

Fig. 1.
figure 1

Examples of computed covariance features on Chapel 3 of “Sacrimonti” dataset. The number inside the parentheses indicates the radius of the searching neighborhood.

Fig. 2.
figure 2

The architecture of the used ML-based point cloud semantic segmentation method.

3.2 DGCNN-Based DL Network

We employ DGCNN [9] as our DL-based approach for ACH point cloud semantic segmentation. As shown in Fig. 3, the input of the encoder of the network is \(N\) coordinates \((x, y, z)\) and their features – RGB color and normalized coordinates \((r,g,b,nx,ny,nz)\) of ACH point clouds. Graphs are constructed using input points and their k-nearest-neighbors as nodes and the connections of nodes as edges. The local and global geometric features are extracted by utilizing shared multilayer perceptron (MLP). Then, the edge features are aggregated by a local max-pooling operation on the extracted features. Additionally, by dynamically constructing graphs in each layer and stacking three EdgeConv layers, the receptive field is enlarged, and information is aggregated across many receptive fields. The intermediate outputs are learned discriminative representations, a 1,024-dimensional “codeword”, and three 64-dimensional edge features. To semantically segment input ACH point clouds, we build the classification network, which uses four shared fully connected layers to transform the outputs of the encoder. The final output of the downstream semantic segmentation network is per-point classification scores.

Fig. 3.
figure 3

The architecture of DGCNN-based DL network.

4 Experiment

4.1 Datasets

Arch Dataset.

The architectural cultural heritage (arch) dataset [10] includes 17 annotated indoor and outdoor scenes. we employ the 15 labelled scenes as training data of the dl-based method. moreover, two unseen scenes (“a smg portico” and “b smv chapel 27to35”) are used as test data in both our dl-based approach and ml-based method to validate the generalizability across different types of ach.

“SACROmonte calvario di domodossola”.

It is a roman catholic complex (piedmont, italy) that is part of piedmont and lombardy's nine “sacri monti” and has been on the unesco world heritage list since 2003. The structure has a sanctuary and fifteen chapels each with sculptures and murals depicting the stages of the “via crucis” [11]. The site has been surveyed during the summer school “laboratory of places – isprs workshop” [22] organized by the 3d survey group of the department of architecture, built environment and construction engineer of politecnico di milano. Chapel n.3, 6 and 7 taken into consideration for this work have been measured using a terrestrial laser scanner (leica rtc360/leica c10) and uav photogrammetry (chapel n.3: avg. Resolution 5 mm for 14,294,406 points; chapel n.6: avg resolution 1 cm for 10,352,825 points; chapel n. 7: avg res 5 mm for 17,816,051 points).

4.2 Experiment Settings

ML-Based Method.

The classification conducted in the following experiment leverage the RF classifier present in the Scikit-learn Python library (version 1.0.2). Following the experiment performed in [5] and in [7] the number of the decision trees and the number of variables to be selected and tested for the best split when growing the trees as been set to 100 and “None” respectively. This allows creating the forest trees. From each tree, a prediction is obtained, and the best solution is selected through voting among all. Specific RF classifiers have been trained for each chapel part of the “Sacrimonti” dataset and the two scenes extracted from the ArCH dataset (A_SMG_portico and B_SMV_chapel_27to35). Figure 4 report an example of the training sets (manually annotated portions) that have been used to train the model for the chapels of “Sacromonte Calvario di Domodossola”.

Fig. 4.
figure 4

Training set for (a) Chapel 3, (b) Chapel 6, (c) Chapel 7 of the sacrimonti dataset.

Dl-Based method.

We chose a block size of \(1 \times 1\) square-meter area for dividing each ach scene into blocks along the horizontal direction as training input. In addition, the points in each block are sampled into a uniform number of 2,048. The setup of the neighbor size and hidden layers in our encoder follows that of dgcnn [9]. We have used adam as our optimizer, with a learning rate of 0.01, batch size of 4, and training epochs of 200. A \(0.5\) probability dropout is used in the last fully connected layer.

4.3 Results

The overall performance of the two approaches is reported in Table 1 and it is evaluated in terms of overall accuracy (OA), weighted precision, weighted recall, and weighted F1-score as explained in [23]. Weighted metrics have been used to take into account the unbalanced number of samples belonging to the different classes.

Results of ML-based method are strictly dependent on the quality of the geometric features computed on the point cloud and fed to the RF classifier. Errors are concentrated in those parts of the point cloud where the point cloud is particularly noisy (resulting in bad features computations). Furthermore, elements in different classes that present similar geometric characteristics are easily confused.

The ML-based approach proved successful in the classification of the considered scenes. The OA reaches up to 0.97 for Chapel 6, with the lower result on the “B_SMV_chapel_27to35” coming from the ArCH dataset.

In the case of the DL-based approach, the model gets 0.678 and 0.749 of OA on the unseen and varied types of scenes (portico and chapel) of the ArCH dataset. When we use the “Sacrimonti” point clouds to cross-test our approach, we get 0.738, 0.761, and 0.628 in term of OA on Chapel 3, Chapel 6, and Chapel 7, respectively.

The qualitative results of the classification are shown in Figs. 5, 67. We can see that both the ML-based and DL-based methods generate acceptable results, with the second one having some difficulties in recognizing classes such as columns and moldings.

Table 1. Classification metrics. “Scene_A” and “Scene_B” denote “A_SMG_portico” and “B_SMV_chapel_27to35”, respectively.
Fig. 5.
figure 5

Qualitative results of ML-based and DL-based approach for the ArCH dataset “B_SMV_chapel_27to35” (a) ground truth, (b) ML-RF prediction and (c) DL prediction.

Fig. 6.
figure 6

Qualitative results of ML-based and DL-based approach for the “Sacrimonti” dataset. Chapel 3 (a) ground truth, (b) ML-RF prediction and (c) DL prediction. Chapel 6 (d) ground truth, (e) ML-RF prediction and (f) DL prediction. Chapel 7 (g) ground truth, (h) ML-RF prediction and (i) DL prediction.

Fig. 7.
figure 7

Qualitative results of ML-based and DL-based approach for the ArCH dataset “A_SMG_portico” (a) ground truth, (b) ML-RF prediction and (c) DL prediction.

5 Discussion and Conclusions

We observed a difference in the DL-based method in performance between “B_SMV_chapel_27to35” and “A_SMG_portico” (0.749 vs. 0.678) in Table 1. Due to the fact that the ArCH dataset has five chapel scenes but only one portico scene, the portico type performs worse than the chapel type. To help us understand the generalization of the DL-based method on a dataset that has never been seen for testing, we also provide the results of testing on a cross dataset – “Sacrimonti” point clouds that consist of three chapels. Equivalent results are produced on Chapel_3 and Chapel_6 to the “B_SMV_chapel_27to35” result, while some decline happened on Chapel_7.

We can quickly assess the reliability of the two methods by comparing the results of directly testing on unseen scenes of the ArCH dataset using the DL method against using the ML approach that was trained and tested on the test scenes of the ArCH dataset. Compared to the ML method, we found that the performance of the DL method is lower (see Table 1). The results demonstrate the performance of the DL method is highly dependent on the amount and diversity of training data, as the performance on chapels is closer to that of the ML method. Furthermore, to validate the DL approach generalization, in addition, tests on the “Sacrimonti” chapels used the model trained on the ArCH dataset. By comparing the DL-based and ML-based methods, we found the DL method is less generalizable but more automatic at extracting features and test scenes directly without the requirement of manual labeling during the classification phase. The ML approach, on the contrary, requires specific training for each test case and a few manual segmented samples are necessary.

In this research, machine learning and deep learning semantic segmentation methods for architectural cultural heritage point clouds are investigated. A cross-testing technique is presented to analyze their performance and generalizability to fully comprehend these two methodologies. The advantages and limitations of these two methodologies are then compared.

In the case of ML-based methods, the short amount of time to manually label the necessary training and evaluation sets and the speed of training the model and spreading the classification confirms the success of this methodology. However, having the possibility to automatically classify the dataset without the need for an expert operator to intervene during the classification process is desirable. The results obtained with the presented DL method are promising and leave the field open for future improvements. In addition, transfer learning techniques such as pre-training techniques can be incorporated to enrich the training data diversity and enhance the generalizability.