Introduction

The ancient architectures were the important components of the Chinese cultural heritage [1, 2]. Suffering from weathering, fires and rotting, a mass of Chinese historical buildings with wooden structural frame has disappeared. The digital archiving was the important measurement for the protection of historical buildings [3, 4] and 3D point cloud which can provide the extract spatial geometry of built heritage with a complex shape has been widely used in the documentation of Chinese ancient architectures [5]. However, the captured original point cloud lacked the structured information, such as semantics and hierarchy between parts, which disturbed the usage of point cloud in other application fields [6,7,8].

Point cloud semantic segmentation divides the original point cloud into subdatasets with semantic meaning. Based on the results of semantic segmentation, the annotated point cloud can be utilized to reconstruct parametric geometries manageable in H-BIM platforms [9,10,11,12]. In practical projects, manual methods involving segmentation through visual recognition and manual labeling of semantic information on the point cloud have been widely adopted by operators. Recently, numerous studies on deep learning (DL) techniques, offering novel and more effective solutions for point cloud semantic segmentation, have been conducted with the aim of replacing manual operations [13,14,15,16].

However, existing methods still face challenges in segmenting complex structures, such as ancient Chinese architecture. 3D modeling of such intricate models is also a prominent topic in the fields of graphics and digital modeling. This article introduces a Mix Pooling Dynamic Graph Convolutional Neural Network (MP-DGCNN) designed for the segmentation of point clouds representing ancient architecture.

Related works

Nowadays, various of point cloud semantic segmentation methods including machine learning (ML) techniques, deep learning (DL) techniques and hybrid methods have been proposed and achieved good performance [17].

The ML techniques involved multiple stages including neighborhood selection, feature extraction, feature selection and semantic segmentation [18]. Aiming at each stage, the researchers proposed several of strategies so that the ML can obtain better performance on the point cloud semantic segmentation of historical buildings. Grilli et al. [19] analyzed the efficacy of the geometric covariance features and the impact of the different features calculated on spherical neighborhoods at various radius sizes. Simone Teruggi et al. [20] extended a machine learning (ML) classification method with a multi-level and multi-resolution (MLMR) approach to segment the Pomposa Abbey (Italy) and the Milan Cathedral (Italy). Dong et al. [21] fused the geometric covariance features and the features from construction regulation to classify the roof point cloud of ZHONG HE Temple into 9 categories. The proper features promote the higher accuracy of point cloud segmentation [22]. The different types of historical architectures have different appearances, it is necessary to design proper segmentation features. This limits the application of machine learning.

The hybrid approach first utilizes an over-segmentation or point cloud segmentation algorithm as the initial pre-segmentation. Subsequently, prior knowledge or supervised methods are employed to label segments rather than individual points. In [23], the point clouds were initially clustered into segments with the assistance of a multi-resolution super voxel algorithm. Following this, Vosselman et al. [24] employed the Hough Transform (HT) to generate planar patches within their PCSS algorithm framework as the preliminary segmentation step. Similarly, Landrieu and Simonovsky [25] employed a super point structure in the pre-segmentation phase and introduced a contextual PCSS network that combines super point graphs with Point-Net and contextual segmentation.

In the past decades, the successful application of deep learning based on the image has promoted the scholars to extend research into 3D point cloud data, and some representative methods and theories have been developed [26,27,28,29]. Deep neural networks have made significant progress in research on indoor scenes, urban streets, and remote sensing, and have also provided new ideas for component recognition in ancient architectural scenes.

DL approaches on the semantic segmentation of point cloud can be classified into 3 categories: projection-based [30, 31], voxel-based [32,33,34], and point-based. Now, the point-based networks have established as mainstream method for point cloud semantic segmentation. PointNet and its later improvement PointNet++ [35, 36] were considered as pioneer works. Compared with the regular supervise machine learning, the DL methods do not need to design the features.

Inspired by PointNet, Wang et al. [37] proposed DGCNN. DGCNN make use of the KNN to establish a local graph structure and designs the EdgeConv operator to process the edge features of the graph structure for learning the topological relationships between points. Due that the local graph constructed by DGCNN is dynamically updated, the convolutional receptive field can cover the entire diameter of the point cloud and allowed the information to diffuse within the diameter of the point cloud. Zhang et al. [38] proposed LDGCNN (Linked Dynamic Graph CNN) based on DenseNet [39]. This method can mitigate the problem of gradient vanishing by densely connecting the features at different levels and extracting the richer semantic information from multi-level local features. To learn the important features, Sun et al. [40] designed an edge feature weighting function for DGCNN to weaken the interference of distant points and strengthen the features of nearby points considering the feature distribution of local points. Wang et al. [41] introduced the residual network idea [42] to increase the depth of the network, making the network training more stable and improving the feature extraction capability.

DGCNN achieved excellent performance on the public datasets such as ModelNet40 [43], ShapeNetPart [44], and S3DIS [45]. The scholars in the field of architectural heritage introduced this method into the point cloud semantic segmentation process of the historical building. Pierdicca et al. [46] make use of DGCNN to label the elements of churches, porches, and monasteries in the ArCH dataset. Given the resemblance in color among components sharing the same structural characteristics in historical buildings, this method added color features in HSV space and normal vectors to the input layer of DGCNN and extended the edge convolution operation layer by constructing the local graph in the feature space. To improve the accuracy of segmentation results, Matrone et al. [47] added the geometric features such as verticality and planarity to the input layer of DGCNN on the basis of the performance evaluation of the various machine models and deep learning models on the ArCH dataset. Massimo et al. [47] described a strategy using DGCNN for point cloud semantic segmentation of the historical building. This method firstly expanded the ArCH data set by rotating, cropping, translating, coordinate perturbation, and scaling; subsequently, the transfer learning technology was applied to pre-train parameters of the DGCNN model on the extended ArCH dataset; last, a small portion of the point clouds from a new scene was selected to adjust parameters for semantic segmentation of the remaining point clouds.

DGCNN make use of the local graph structures to describe spatial points and learns the topological relationship between points through edge convolution. Although DGCNN can achieve superior performance on some semantic segmentation tasks, it was still a challenge for the semantic segmentation of Chinese historical buildings. One hand, the components of Chinese historical buildings are overlapped and the scales of the components varied greatly. On the other hand, the randomness of input points exacerbates the complexity of point clouds. The deep neural networks should capture more discriminative geometric information from local structures of point clouds so that its generalization ability and stability need to be further improved when facing ancient architectural scenes. To overcome this limitation, this article focuses on the edge features and pooling functions of DGCNN, proposes an improved strategy for autonomously learning pooling rules, and puts forward a more robust mixed-pooling DGCNN point cloud semantic segmentation network.

The main contributions of the presented MP-DGCNN are three folds:

  1. 1)

    The modified edge features of local graph are proposed by adding the position, direction, and distance information. The modified edge features enable MP-DGCNN to learn richer structural information and indirectly enhances the ability to extract local features.

  2. 2)

    A hybrid pooling operator integrating the maximum pooling, average pooling, aggregation pooling, and adaptive pooling that can learn pooling rules independently from point clouds is designed, which can efficiently autonomously learn pooling rules to extract local feature vectors of points.

The structure of this paper is organized as follows: the second chapter introduces the proposed MP-DGCNN which was the applied algorithm and net in this paper; the third chapter evaluated the performance on the test data; in the last chapter, the author summarized and forecast the methodology.

MP-DGCNN

The MP-DGCNN network proposed in this paper for semantic segmentation of ancient buildings is shown in Fig. 1. The core component of MP-DGCNN is the improved EdgeConv layer, which captures fine-grained local geometric structures by stacking three EdgeConv layers. The output features of the three EdgeConv layers are concatenated and pooled to form a one-dimensional global feature descriptor. Then, the descriptor is concatenated with the local features of each point in the output features of the three EdgeConv layers, so that each point's features include both local detail information and global feature descriptors. Finally, MLP is used to fuse depth semantic information, and dropout technology is used to alleviate overfitting. The segmentation score of each point was calculated to complete the segmentation task.

Fig. 1
figure 1

MP-DGCNN

The improved edge convolution operator is the core concept of MP-DGCNN, which mainly includes defining edge features, extracting graph features with MLP, and merging mixed pooling.

Edge feature definition

The MP-DGCNN network redefines edge features by introducing squared distance and position coordinates of neighbor node, enhancing the characterization of local graph structures and the richness of neighborhood information, thereby achieving a more comprehensive representation of the topology of points. The defined edge features are:

$${L}_{i}=\left\{{l}_{{i}_{0}},{l}_{{i}_{1}},\ldots ,{l}_{{i}_{j}}\right\}$$
(1)

In Eq. (1), \({L}_{i}\) represents the edge feature vector set of the local graph and \({l}_{{i}_{j}}\) represents the edge features between the center point \({p}_{{i}_{0}}\) and one of its neighboring points \({p}_{{i}_{j}}\). \(j=\left\{\text{0,1},2,\ldots ,k-1\right\}\).

$${l}_{{i}_{j}}=\left({p}_{{i}_{0}},{p}_{{i}_{j}},{e}_{{i}_{j}},{{e}_{{i}_{j}}}^{T}{e}_{{i}_{j}}\right)$$
(2)

\({l}_{{i}_{j}}\) is a feature vector composed of four parts, \({p}_{{i}_{0}}\) expresses the coordinate information of the center point, \({p}_{{i}_{j}}\) describes the position information of neighboring nodes, \({e}_{{i}_{j}}\) describes the direction of the graph structure, and \({{e}_{{i}_{j}}}^{T}{e}_{{i}_{j}}\) explicitly describes the distance. In particular, In particular, when \(j=0\), \({e}_{{i}_{j}}\) was zero vector. If the point \({p}_{i}\) is a row vector with \({n}_{{p}_{i}}\) elements, the vector \({l}_{{i}_{j}}\) has \(3{n}_{{p}_{i}}+1\) elements.

Graph feature extraction based on MLP

The neural network extracted graph features from the local topological structure of points relying on the feature extraction function \({f}_{e}\). The input is edge feature \({L}_{i}\). Output local graph feature vector \({f}_{i}\).

$${f}_{i}={f}_{e}\left(G\left({V}_{i},{E}_{i}\right)\right)=MixPooling\left\{h\left({l}_{{i}_{0}}\right),h\left({l}_{{i}_{1}}\right),\ldots ,h\left({l}_{{i}_{k-1}}\right)\right\}$$
(3)
$${l}_{{i}_{j}}=h\left({p}_{{i}_{0}},{p}_{{i}_{j}},{e}_{{i}_{j}},{{e}_{{i}_{j}}}^{T}{e}_{{i}_{j}}\right)$$
(4)

In Eq. (3), \(MixPooling\left\{\right\}\) is the mixed pooling function and \(h\left(\right)\) is the function which extracted the feature vector \(h\left({l}_{{i}_{j}}\right)\) of the hidden layer from edge features using parameter shared MLP.

$${h}_{{c}{\prime}}\left({p}_{{i}_{0}},{p}_{{i}_{j}},{e}_{{i}_{j}},{{e}_{{i}_{j}}}^{T}{e}_{{i}_{j}}\right)=\sum\limits_{c=1}^{C}\left({w}_{{c}^{\prime}c}{p}_{{i}_{0}c}+{w}_{{c}^{\prime}\left(c+C\right)}{p}_{{i}_{j}c}+{w}_{{c}^{\prime}\left(c+2C\right)}{e}_{{i}_{j}c}\right)+{w}_{{c}^{\prime}}{{e}_{{i}_{j}}}^{T}{e}_{{i}_{j}}+{b}_{{c}^{\prime}}$$
(5)
$$h\left({p}_{{i}_{0}},{p}_{{i}_{j}},{e}_{{i}_{j}},{{e}_{{i}_{j}}}^{T}{e}_{{i}_{j}}\right)=\left({h}_{1}\left({p}_{{i}_{0}},{p}_{{i}_{j}},{e}_{{i}_{j}},{{e}_{{i}_{j}}}^{T}{e}_{{i}_{j}}\right),\ldots ,{h}_{{C}{\prime}}\left({p}_{{i}_{0}},{p}_{{i}_{j}},{e}_{{i}_{j}},{{e}_{{i}_{j}}}^{T}{e}_{{i}_{j}}\right)\right)$$
(6)

In Eq. (6), \({p}_{{i}_{0}c}\), \({p}_{{i}_{j}c}\) and \({e}_{{i}_{j}}\) presented the element values of the feature vectors of the \({i}_{th}\) center point, \({j}_{th}\) neighboring point, and \({j}_{th}\) edge on the \({c}_{th}\) channel, respectively; \(C\) is the number of channels of the center point \({p}_{{i}_{0}}\), that is \(C={n}_{{p}_{i}}\); \({C}^{\prime}\) is the number of neurons in the MLP layer, and \({c}^{\prime}\) identifies the neuron ordinal; \({w}_{{c}^{\prime}c}\), \({w}_{{c}^{\prime}\left(c+C\right)}\), \({w}_{{c}^{\prime}\left(c+2C\right)}\), \({w}_{{c}^{\prime}}\) and \({b}_{{c}^{\prime}}\) are trainable parameters for MLP.

Mix pooling

For a local graph structure \({G}_{i}\), each vector in the graph features \({F}_{i}\) of \({G}_{i}\) is located in the feature space with the \({a}_{n}\) dimension as is seen in Eq. (7).

$${F}_{i}=\left({f}_{0},{f}_{1},\ldots ,{f}_{k-1}\right) \subseteq {R}^{{a}_{n}}$$
(7)

The MLP with \({a}_{n}\) neurons in one layer (without bias constant and activation function) is used to adjust the graph features.

$$h\left({F}_{i}\right)={F}_{i}{W}_{adj}=\left[\begin{array}{ccc}{f}_{{0}_{1}}& \cdots & {f}_{{0}_{{a}_{n}}}\\ \vdots & \ddots & \vdots \\ {f}_{k-{1}_{1}}& \cdots & {f}_{k-{1}_{{a}_{n}}}\end{array}\right]\left[\begin{array}{ccc}{w}_{11}& \cdots & {w}_{1{a}_{n}}\\ \vdots & \ddots & \vdots \\ {w}_{{a}_{n}1}& \cdots & {w}_{{a}_{n}{a}_{n}}\end{array}\right]$$
(8)

In Eq. (8), \({F}_{i}\) represents the local graph feature, \({W}_{adj}\) represents the weight matrix of MLP, \({f}_{{i}_{j}}\) represents the \({j}_{th}\) element of the \({i}_{th}\) feature vector of the local graph feature, and \({w}_{ij}\) represents the learnable weight parameter. If the number of neurons in a single-layer MLP is \({a}_{n}\). \({W}_{adj}\) is an \({a}_{n}\times {a}_{n}\) matrix which is shared among different local graph \({G}_{i}\). According to the contribution degree of each element of each feature vector, make use of the activation function \(Softmax\left(\right)\) to stimulate the adjusted graph features into the attention weight coefficient \({W}_{att}\). Finally, multiply the adjusted graph feature \({F}_{i}\) and the attention coefficient \({W}_{att}\) to generate Graph Features \({F}_{i}^{\prime}\). The optimized Graph Features was described as \({F}_{i}^{\prime}\).

$${W}_{att}=Softmax\left(h\left({F}_{i}\right)\right)=Softmax\left({F}_{i}{W}_{adj}\right)$$
(9)
$${F}_{i}^{\prime}={F}_{i}*{W}_{att}={F}_{i}* Softmax\left({F}_{i}{W}_{adj}\right)$$
(10)

In Eq. (9), \({F}_{i}\) is the graph feature after optimization, \(\prime\) is Hadamard product, and \(Softmax\left(\right)\) is the activation function. Neural networks can dynamically generate attention coefficients \({W}_{att}\) based on different \({F}_{i}\), and dynamically optimize local graph features to obtain \({F}_{i}^{\prime}\).

Mix-pooling is composed of four pooling methods: max pooling, mean pooling, sum pooling, and adaptive pooling. It is a local structure information fusion strategy based on a dynamic feature adjustment mechanism. The strategy mainly consists of three parts:

  1. 1.

    The construction of max pooling, mean pooling and sum pooling

The Max pooling vector, mean pooling vector, and Sum pooling vector are further extracted along each axis direction of all feature vectors.

$$\left\{\begin{array}{c}{f}_{max}^{\prime}=\mathit{max}\left({F}_{i}^{\prime}\right)=max\left\{{f}_{0}^{\prime},{f}_{1}^{\prime},\ldots ,{f}_{k-1}^{\prime}\right\}\subseteq {R}^{{a}_{n}}\\ {f}_{mean}^{\prime}=mean\left({F}_{i}^{\prime}\right)=mean\left\{{f}_{0}^{\prime},{f}_{1}^{\prime},\ldots ,{f}_{k-1}^{\prime}\right\}\subseteq {R}^{{a}_{n}}\\ {f}_{sum}^{\prime}=sum\left({F}_{i}^{\prime}\right)=sum\left\{{f}_{0}^{\prime},{f}_{1}^{\prime},\ldots ,{f}_{k-1}^{\prime}\right\}\subseteq {R}^{{a}_{n}}\end{array}\right.$$
(11)

In Eq. (11), \({f}_{max}^{\prime}\), \({f}_{mean}^{\prime}\) and \({f}_{sum}^{\prime}\) represented Max Pooling vector, Mean Pooling vector and Sum Pooling vector, respectively. \({f}_{max}^{\prime}\) describes the main features of the graph structure, \({f}_{mean}^{\prime}\) reflects the internal properties of the graph structure, and \({f}_{sum}^{\prime}\) aggregates pooling vectors to solve the problem of important feature loss caused by random sampling. It can effectively handle several topological structures of the same semantic point.

  1. 2.

    The construction of adaptive pooling vector

Although the above three different pooling methods can extract graph features with different characteristics, the Max Pooling vector, Mean Pooling vector and Sum Pooling vector are designed based on the fixed rules. In order to abstract feature vectors representing local features of point, adaptive pooling vector is proposed. Make use of the 2D convolution kernel with size \(1\times k\) to perform convolution operator on the local graph to reconstruct the feature vectors of points and adaptively extracted pooling vectors directly from the graph features. Finally, applied the feature adjustment mechanisms to this vector to obtain the adaptive pooling vectors \({f}_{adapt}^{\prime}\subseteq {R}^{{a}_{n}}\) of points.

  1. 3.

    The construction of adaptive pooling vector

Concatenated the pooling vectors of \({f}_{max}^{\prime}\), \({f}_{mean}^{\prime}\), \({f}_{sum}^{\prime}\) and \({f}_{adapt}^{\prime}\) into \({f}_{concat}^{\prime}\) along the feature dimension as is shown in Eq. (12). Then, make use of a single-layer MLP with \({a}_{n}\) neurons to map \({f}_{concat}^{\prime}\) back to the \({R}^{{a}_{n}}\) space, and calculate the local feature vector \({f}_{Edgeconv}\) of the points extracted by the edge convolution operator as is shown in Eq. (13).

$${f}_{concat}^{\prime}=\left({f}_{max}^{\prime},{f}_{mean}^{\prime},{f}_{sum}^{\prime},{f}_{adapt}^{\prime}\right)\subseteq {R}^{4{a}_{n}}$$
(12)
$${f}_{Edgeconv}=h\left({f}_{concat}^{\prime}\right)\subseteq {R}^{{a}_{n}}$$
(13)

Experiments and results

Experimental data

Qutan Monastery is located at Mabugoukou, 21 km south of Ledu County, Haidong City, Qinghai Province as is shown in Fig. 2. It was firstly built in 1392 with a construction area of 27,000 hectare, carrying a history of more than 600 years till now. It keeps the most intact Ming-style building groups in northwestern China.

Fig. 2
figure 2

The location of Qutan Monastery and the overview of Qutan Monastery

To obtain the 3D point cloud of Qutan Monastery, Using FARO X330 and DJI drones to collect point clouds and images of Qutan Temple. The 1620 UAV images are collected by DJI Phantom4 which was composed of a FC6310R camera with a 13.2 × 8.8 mm2 sensor size and a 2.41 µm pixel size. The flight path surrounded the building as is shown in Fig. 3. The distance from the exposure points to this building varied from 20 to 85 m. Considering the 8.8 mm focal length and the photographic distance, the ground sampling distance (GSD) for all cameras ranged from 0.5 cm to 2 cm. Relying on the commercial software package Bentley, this DIM point cloud was generated. The density of the generated DIM point cloud was 59,737 points/m2. Finally, cut out the scene of the BaoGuang Hall for the convenience of subsequent annotation work. The integrated complete point cloud is shown in Fig. 3.

Fig. 3
figure 3

A fully structured Color point cloud of BaoGuang building

The collected internal and external data of the BaoGuang Hall (Zhongdian) and its surrounding scenes.

Evaluation criteria

Neural networks find it challenging to effectively learn all structural features when training models with only a small portion of point clouds and inferring the rest. Considering the scarcity of benchmark datasets for semantic segmentation of Chinese ancient architecture, this paper adopts a two-fold cross-validation method as the evaluation standard for experiments.

The twofold Cross Validation method calculates the average overall accuracy (OA), mean intersection over union (mIOU) and mean accuracy (mAcc) to evaluate the segmentation performance of different network models on the ancient building dataset. Initially, the left part of the point cloud is used as training data, and the right part is used as testing data to compute the OA, mIOU and mAcc of the network model on the right part of the point cloud. Subsequently, the network model is retrained using the right part of the point cloud, and the OA, mIOU and mAcc on the left part of the point cloud are calculated. Finally, the average of the two is taken as the final result.

Training process

Training data

The Qutan Temple dataset comprises 48,788,720 labeled points, and this article divides the ancient building point cloud into 12 categories: roof, ceiling, column, bracket, sparrow, doors and windows, wall, plinth, railing, steps, ground, and others. The semantic label of the Baoguang Hall point cloud was manually implemented, and the labeled results are shown in Fig. 4. After labeling, the scene was cut along the central axis into left and right point clouds for subsequent twofold cross-validation experiments. The left half of the point cloud was used as the training set in the first fold, with the right half serving as the testing set. For the second fold, these roles were reversed, with the right half used for training and the left half for testing. This division guarantees that both subsets maintain a consistent distribution of architectural features.

Fig. 4
figure 4

Experimental data display

The number of points within each component is shown in Fig. 5. The roof category contained the highest number of points, approximately 9 million points, while the categories such as Sparrow and Step only have 10,000 points. To balance space distribution of the data, the training data is resampled at 0.01m intervals to ensure uniform distribution of points. After sampling, the left part of the point cloud contains 26,531,655 points, while the right part contains 22,257,065 points.

Fig. 5
figure 5

The quantity distribution statistics of points cloud dataset

Experiment environment and parameter setting

This experiment was conducted on Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz; NVIDIA GeForce RTX 2080Ti with 11GB of VRAM; Pycharm IDE, Python3.7, and Tensorflow2.3 environment.

During the experimental training phase, the number of iterations was set as 150, the initial learning rate was set as 0.001, the learning decay rate was 0.5, and the Batch Size parameter was set to 4. The k-nearest neighbor parameter for MP-DGCNN was set to 20.

Experimental results and analysis

Experimental results

The point cloud semantic segmentation experiments were conducted on the Baoguang Temple dataset using PointNet, PointNet++, DGCNN, LDGCNN, GACNet and MP-DGCNN. Based on the twofold cross-validation, the segmentation accuracy of different network models was shown in Table 1.

Table 1 Segmentation results produced by different DNN model

Table 1 showed that the DL method based on the graph structure including DGCNN, LDGCNN, and MP-DGCNN achieved better segmentation accuracy than that of PointNet and PointNet++ due that point’s local structures are learned through edge convolution which can capture the topological relationship of points and better handle the unordered nature of point clouds. Among the methods based on the graph structure, the MP-DGCNN designed in this paper achieved an overall accuracy (OA) of 90.19%, a mean intersection over union (mIOU) of 65.34% and a mean accuracy (mAcc) of79.41% for the point cloud semantic segmentation on the Baoguangdian dataset. Compared with DGCNN, the improved MP-DGCNN increased the overall OA by 3.91%, mIOU by 8.38% and mAcc by 16.58%.

Figures 6 and 7 showed the OA and IoU of different components by different networks. MP-DGCNN achieved good results and the segmentation precision of pedestals, floors, walls, ceilings, roofs, fences, brackets, and columns and reached about 70% or more. Except for windows and doors, the segmentation IoU of all components also increased.

  1. a.

    MP-DGCNN achieved segmentation accuracies of 91.34%, 62.15%, and 98.94% for the foundation, treads, and ground (with surface features), respectively. The accuracies were 28.07%, 13.34%, and 1.24% higher than those achieved by DGCNN. This indicated that MP-DGCNN had advantages in capturing subtle differences between structures and can learn fine-grained features of point cloud structures compared with DGCNN.

  2. b.

    Due that edge features were strengthened by introducing distance and the coordinates of neighboring points and the mixed pooling strategy which can retain more useful features and alleviate information loss was applied in the process of MP-DGCNN, the segmentation accuracy of fence, bracket, and column also increased by 12.19%, 13.58%, and 11.79% compared with DGCNN.

  3. c.

    PointNet and PointNet++ was hardly separated sparrows from the large-scale point clouds scene and the accuracy of sparrows based on DGCNN only reached 3.16%. Compared with DGCNN, MP-DGCNN has a significant improvement in overall accuracy of sparrow segmentation, with an increase of 21.50%, but the overall accuracy is still not high.

Fig. 6
figure 6

OA with different DNN model

Fig. 7
figure 7

IOU with different DNN model

Based on the above analysis, the experimental results showed that the proposed MP-DGCNN in this study is more effective and has stronger generalization ability compared to other networks. Figure 8 showed the semantic segmentation results of MP-DGCNN.

Fig. 8
figure 8

Points cloud semantic segmentation for Baoguang building using MP-DGCNN

Experimental results analysis

  1. (1)

    The Impact of edge features and Mix Pool function

To analyze the advantages of the designed edge features and mixed pooling function on the semantic segmentation results, several of comparative experiments were conducted. DGCNN was used in the first experiment; the method used in the second, third and fourth experiments applied the designed edge features and the pooling function was max-pooling, mean-pooling and sun-pooling; the MP-DGCNN was used in the fifth experiment.

In Table 2, the symbol "√" and "–" indicates the corresponding method is used. As is shown in Table 2, OA of the five experimental groups are 86.28%, 89.39%, 87.02%, 87.67%, and 90.19%, and the mIOUs are 56.96%, 63.08%, 58.39%, 60.11%, and 65.34%, respectively. Specifically,

  1. (a)

    In the first and second experiments, the same pooling function (max pooling) was used. Both OA and mIOU of the second experiment were higher than that of the second experiment. The experimental results showed the designed edge feature which contains the distance information of graph nodes and the neighborhood information is more comprehensive and can better characterizes the topological relationship between points.

  2. (b)

    According to the second, third and fourth experiments, the max-pooling method in the second experiment obtained the best performance. These experimental results are consistent with the conclusions in references.

  3. (c)

    Due that the mix pooling function was applied in MP-DGCNN, OA and mIOU of the fifth experiments increased by 0.8% and 2.26% compared with the second experiment in which the designed edge feature was also used. This further confirmed that the proposed mixed pooling module alleviates the problem of information loss caused by max-pooling to a certain extent. This is because the mixed pooling function optimizes the distribution of graph features through self-learning of pooling rules and retained more feature information.

Table 2 Segmentation influence with different selection of pooling function
  1. (2)

    Limitations

Although the MP-DGCNN can obtain better performance, there are still some limitations as is shown in Fig. 9. Some points belong to ground are misclassified as pedestals. The shape of these misclassified points appeared as regular squares. This is because the shape of the fragmented ground which was derived by the method for dividing building blocks is similar to the shape of pedestal. The difference between the fragmented ground and pedestal is the absolute position. Similarly, the incense burner within other categories on the ground appeared as a stepped shape and was misclassified as pedestals. Moreover, the components connecting to the windows and doors are also misclassified due that the position, shape, and color of the windows, doors and these components are similar. This resulted that segmentation boundaries were clear and are misclassified as walls.

Fig. 9
figure 9

Some over-segmentation using MP-DGCNN model

It can be seen that the MP-DGCNN semantic segmentation network can learn point cloud features from data. However, when the points within the different categories similar in position, shape, and color, the misclassification of points still occurred. One hand, the proposed method can’t capture all the features. On the other hand, besides on the attributes such as geometric shape and color, other knowledge (the connection) should be considered in the process of recognize the components of ancient architecture. Therefore, there is still a lot of room for research on semantic segmentation networks for ancient architecture.

Robustness test

To evaluate the robustness of the MP-DGCNN, we conducted experiments on the Qutan Temple dataset with varying point cloud densities. Specifically, we modified the density of the point clouds to 25%, 50%, and 75% of their original density to test how well MP-DGCNN maintains its performance under conditions of data sparsity. For each density level, the point clouds were randomly sub-sampled to the specified percentage, ensuring a representative but reduced set of data points. The model was then tested on these modified datasets to measure any impacts on the Overall Accuracy (OA) and mean Intersection over Union (mIOU). The results are shown in Table 3. The result shows a minimal decrease in performance metrics with a reduction in point cloud density, demonstrating the model’s robustness to variations in input data quality. MP-DGCNN achieves high robustness and generalization across varying point cloud densities through advanced feature extraction and pooling strategies. These methods effectively capture and maintain crucial geometric and contextual information, enabling the model to deliver consistent and accurate segmentation results even under diverse data conditions.

Table 3 Segmentation influence with different selection of pooling function

Conclusion

The semantic segmentation of ancient building point clouds serves as the cornerstone for the 3D reconstruction of ancient structures. In this study, we introduced MP-DGCNN, a deep learning-based approach tailored for the semantic segmentation of ancient building point clouds. This method effectively labels the semantics of ancient architectural elements from point cloud data with robustness and high precision. By incorporating modified edge features and a hybrid pooling strategy into the original DGCNN framework, MP-DGCNN demonstrates improved capability in capturing nuanced features. The performance of MP-DGCNN was rigorously evaluated using the Qutan Temple dataset, yielding an overall accuracy of 90.19%, a mean Intersection over Union (mIOU) of 65.34% and a mean accuracy (mAcc) of 79.41%. Furthermore, comparative analysis with other point cloud segmentation algorithms such as PointNet++, PointNet, DGCNN, LDGCNN and GACNet revealed that MP-DGCNN consistently outperforms these methods in terms of overall accuracy and mIOU.

While our preliminary research demonstrates the enhanced accuracy of ancient building semantic segmentation achieved by MP-DGCNN, future research endeavors will focus on several fronts.

Firstly, we intend to establish a benchmark for typical components of ancient architecture to build a comprehensive and robust data foundation. This is crucial for enabling the DNNs to accurately learn the intricate details of ancient architecture. It also standardizes testing conditions, allowing researchers to conduct controlled, comparable experiments. Additionally, we aim to integrate targeted attention mechanisms to address the issue of incorrect semantic segmentation often caused by similar shapes. This approach will focus on refining our feature extraction methods to more precisely and efficiently capture the complex features of ancient wooden buildings. This will expand the application scope of DNNs in cultural heritage scenarios. Finally, we aim to investigate the possibility of incorporating prior knowledge of ancient architecture into semantic segmentation tasks. By incorporating this specialized knowledge, the DNNs can form more accurate and contextually appropriate segmentations, enhancing the reliability and precision of the models in analyzing and interpreting architectural elements from historical periods.