Introduction

As an important heritage of human civilization, ancient architecture carries rich historical and cultural information. It is entirely different in structure from modern architecture, being composed of thousands of wooden components like columns, beams, rafters, and tiles assembled in a specific order [1]. Due to the complexity of ancient architectural structures, two-dimensional images face challenges such as single viewing angles, occlusions, and lighting issues, and are insufficient in displaying concave-convex, three-dimensional structures, and decorative details. In contrast, three-dimensional models can more intuitively represent the shape, structure, and construction methods of ancient architectures, supporting digital twin and graphic space interaction applications [2]. With the continuous development of three-dimensional laser scanning technology, its application in the conservation of ancient architectural artifacts is increasing, especially as advancements in deep learning technology in point cloud semantic segmentation have moved beyond traditional methods—which often rely on manually designed features and rules. These methods struggle to adapt to the complex and variable structures of ancient architecture and are inefficient in processing large-scale point cloud data; moreover, machine learning methods based on manual features rely too heavily on feature descriptors, are not suitable for large and complex scenes, and have poor generalization capabilities [3]. End-to-end deep learning methods can assign semantic labels to every point in a scene while balancing algorithm accuracy and complexity well, offering new insights for architectural cultural heritage point cloud segmentation. However, the large volume of point cloud data and its irregular, unstructured, and unordered nature make it difficult to quickly learn discriminative features of large-scale point cloud objects and perform accurate segmentation.

In recent years, many neural network-based methods for semantic segmentation of 3D point clouds have been proposed, mainly divided into three types: projection-based, voxel-based, and point-based methods. When processing large-scale point clouds, projection and voxel-based methods not only increase computational overhead but also require additional operations, such as converting point clouds into other representations and reprojecting intermediate segmentation results back into the point cloud. Unlike these, point-based methods can directly and end-to-end process point cloud data, especially suitable for ancient architectures with complex geometric structures and shapes. This method can flexibly handle various irregular shapes and different resolutions of point cloud data, better preserving the original data information, such as location, color, and normals. Some previous methods have high computational and memory requirements when dealing with large-scale point cloud data such as ancient architectures, and are not suitable for real-time processing of ancient architecture scene data. Recently, a large-scale point cloud semantic segmentation method RandLA-Net [4] has been proposed, known for its efficient downsampling method, which enables it to process point cloud data for large-scale scenes such as ancient architectures. This is particularly important for efficiently solving the segmentation task of cloud data for large-scale scenic spots such as ancient architectures. However, architectural point cloud scenes have complex geometric structures, diverse features such as materials, textures, shapes, and sizes. Especially in traditional Chinese architecture, due to the large and dense number of point clouds in each scene, wooden materials are often used for doors, windows, and columns, while stone materials are used for footings and stone steps. Efficiently distinguishing similar structures between the above categories is more challenging. Therefore, while reducing computational and memory costs, it is also necessary to address the loss of important topological and semantic information in complex geometric structures in ancient architectural scenes, and to learn discriminative features for challenging complex ancient architectural scenes. Inspired by the successful use of attention mechanisms [5,6,7], and contextual information [8, 9] in many semantic segmentation tasks, we propose the following three questions and propose solutions:

  1. (1)

    How to efficiently learn highly discriminative local feature aggregation methods from large-scale ancient architecture point cloud data?

  2. (2)

    How to accurately understand the overall shape and long-distance dependency relationship between different categories of architectural cultural heritage through learning global contextual semantic information?

  3. (3)

    How to ensure accurate semantic segmentation in large-scale point cloud data of different building types, structures, and complexities?

In response to the above three issues, this article proposes a large-scale point cloud semantic segmentation network architecture for ancient architecture, which consists of a symmetric encoder decoder structure with skip connections. To effectively distinguish the geometric similarity between different categories and comprehensively capture category characteristics, we designed an enhanced dual attention pooling module and a global contextual semantic feature module. The former focuses on the similarity of geometry and appearance, and is applied to each module in the encoder stage to perceive the topological and semantic differences of similar points. The latter learns the global context of each 3D point cloud by utilizing neighborhood position and volume ratio, thereby achieving an understanding of the spatial layout and interrelationships of the entire building scene. The DSC Net we propose can be integrated into various network architectures to handle point cloud semantic segmentation tasks. Our main contributions are as follows:

  1. (1)

    We conducted comprehensive experiments and evaluations on our self built ancient architecture dataset, architectural cultural heritage dataset ArCH [10], and public dataset S3DIS [11] on a fully supervised task. The semantic segmentation results demonstrated the robustness and superiority of our method in different styles of architectural scenes. In particular, our method comprehensively considers the complex and similar geometric structures but different appearance categories in various scenes of ancient architecture, providing strong support for the digital analysis and protection of architectural cultural heritage.

  2. (2)

    We have developed an enhanced dual attention pooling (EDAP) module that can capture more complex and refined local feature information and distinguish the geometric and appearance similarity of adjacent points. This module can be inserted into the feature aggregation of the encoder decoder stage to explore new point cloud segmentation networks.

  3. (3)

    We have introduced the Global Context Feature (GCF) module, which specifically analyzes the global context of each 3D point from point cloud data. By integrating global context, this module focuses on learning global information from 3D points, which can more effectively handle large-scale spatial changes and complex ancient architecture clusters, thereby improving performance in segmentation tasks.

Related work

In this section, we will thoroughly review point cloud semantic segmentation methods based on deep learning, which can generally be categorized into three types: projection-based methods, voxel-based methods, and point-based methods.

Projection based methods

Inspired by 2D convolutional neural networks, existing work [12] describe a method that involves projecting point clouds onto a two-dimensional plane and then using traditional 2D image segmentation algorithms to process them. Subsequently, the segmentation results are mapped back to the original three-dimensional space to achieve semantic segmentation of the point cloud. In related methods, Tatarchenko [13] project the local surface geometry surrounding each point onto a tangent plane, creating tangent images that can be processed with 2D convolutions. However, these multi-view segmentation methods inevitably introduce a loss of detail due to the projection step, therefore not fully utilizing the underlying geometric and structural information. Despite these methods being able to leverage mature 2D image processing technologies, the projection process inevitably leads to a loss of detail information, and remapping the 2D segmentation results back to three-dimensional space incurs significant computational overhead.

Voxel based methods

Voxel-based point cloud semantic segmentation methods primarily involve converting three-dimensional point cloud data into a voxel format. Specifically, the point cloud is organized into a 3D grid structure of small cubic units, and deep learning techniques are used to predict semantic labels for each voxel, achieving semantic segmentation of the overall point cloud. Given the sparsity of point cloud data and its substantial resource consumption, researchers have proposed various sparse convolution techniques to reduce computational costs [14,15]. Furthermore, to enhance processing performance, researchers have also introduced technologies such as octrees and hash maps [16] to improve efficiency. Due to the high computational demand and significant memory consumption involved in 3D convolutions, research based on this technology has declined in recent years.

Point based methods

Point based methods directly target point clouds for end-to-end operations, which can be divided into the following categories: point convolutional methods, multi-layer perceptron (MLP), graph based methods, RNN based methods, and attention mechanism based methods.

Based on point convolution method

Inspired by the success of convolutional operators in the two-dimensional image domain, several studies have proposed convolutional methods for three-dimensional point clouds [17,18,19,20]. These methods mainly extend the traditional concept of image convolution to unordered point cloud data, effectively processing point cloud data through local neighborhood modeling and feature extraction.

The method of multi-layer perceptron (MLP)

The per-point MLP method typically employs shared MLPs as the basic unit. The pioneering PointNet [21], by applying a symmetric function to handle the disorderliness of point clouds, uses MLPs to extract features from each point, followed by a max pooling operation to extract global features of the point clouds across various dimensions. However, PointNet has limitations in extracting local features. To address this, PointNet +  + [22] introduced a multi-level feature extraction structure, which effectively enhanced the extraction capability of local and global features. However, it faces issues of excessive computational resource consumption when processing large-scale point clouds. Hu Q [4] proposed an efficient and lightweight network for semantic segmentation of large-scale point clouds, RandLA-Net, which significantly reduces the memory and computational overhead for large-scale point cloud processing by using random point sampling techniques. Based on these findings, researchers proposed a PointNet-based deep learning point cloud segmentation workflow for architectural cultural heritage. Bulent H [23] evaluated the application of the deep learning model PointNet in the segmentation of point clouds of heritage buildings in Gaziantep, Turkey. By analyzing the point cloud data of 28 buildings, it was found that PointNet performs with high accuracy when handling synthetic data, providing a new method for precise classification and segmentation of heritage buildings. To address the problem of low segmentation accuracy caused by the complexity of training scenarios, Literature [24] selected four types of objects: arcs, columns, walls, and windows. Researchers trained the network using annotated point cloud data from field surveys and used the PointNet +  + method for segmentation, to assess the impact of training data variability on performance.

Graph based and RNN based methods

Graph convolution-based methods extract features by utilizing the topological structure and connections of point cloud data. In contrast, RNN-based methods combine the feature extraction capabilities of CNNs with the temporal information processing ability of RNNs to capture the spatial and temporal correlations in point cloud data, enabling semantic label prediction for each point. DGCNN introduced an EdgeConv module, which generates edge features describing the relationships between a point and its neighbors. RSNet [25] developed a lightweight local dependency module that uses slice pooling layers to transform unordered point cloud features into ordered feature vector sequences. Liu [26] proposed a new method called 3DCNN-DQN-RNN, which integrates three-dimensional convolutional neural networks (CNNs), deep Q networks (DQN), and residual recurrent neural networks (RNNs). Through an "eye window" mechanism, this method effectively locates and segments target class points. However, the complexity of model computation and excessive computational overhead cannot be ignored. 3D CNN and residual RNN together extract robust and discriminative features within the eye window, thus improving the parsing accuracy of point clouds. This method automates the mapping of raw data to classification results, integrating target localization, segmentation, and classification into one. Christian [27] developed and trained an improved DGCNN, the RadDGCNN network model, using synthetic point cloud data. This model performed well in real TLS point cloud segmentation, although it still has shortcomings in handling segmentation tasks of similar categories. Literature [28] developed an improved dynamic graph convolutional neural network that uses edge attention convolution technology to reinforce the learning of local features. With the 3DMAX model trained on sampled points, this network can effectively extract the roof structures of ancient architectures from real point cloud data. Roberto Pierdicca [29] and others proposed an improved version of the dynamic graph convolutional neural network (DGCNN), which, by integrating key features such as normals and colors, enhanced the processing capabilities for the newly collected digital cultural heritage dataset ArCH. Francesca Matrone et al. [30] compared the application of machine learning and deep learning in large-scale cultural relic classification, analyzed the advantages and disadvantages of these two technologies, and developed a semantic segmentation architecture DGCNN Mod + 3Dfeature that integrates the advantages of these two methods. However, these methods have not fully evaluated the diversity and applicability across different types of datasets and may lead to excessive consumption of network computational resources and low computational efficiency.

Method based on attention mechanism

Point cloud semantic segmentation methods based on attention mechanisms enhance segmentation performance by dynamically adjusting weights to increase focus on key information, considering the relationship between each point's local information and the global context. Yang [31] developed a Point Attention Transformer to simulate interactions between points. Literature [32] introduced a local spatial awareness layer designed to learn spatial distribution weights to capture local geometric structures. Literature [33] built on the structure of 3D Unet [34], designing modules for global feature learning and multi-scale feature fusion. This approach also introduced a sparse tensor-based implementation to reduce unnecessary computations and adapt to the sparsity of 3D point clouds. These methods, by refining the weight adjustments between points, have significantly enhanced the recognition and utilization of key features, greatly improving segmentation performance.

Methodology

In this section, we provide a detailed introduction to a novel semantic segmentation network: DSC Net, which has developed two core modules including an enhanced dual attention pooling module (EDAP) and a global context feature module (GCF). The Enhanced Dual Attention Pooling Module (EDAP) utilizes topology and appearance semantic information, integrates multi-level features, and dynamically adjusts feature weights during the pooling process, effectively improving the network's sensitivity and discriminative ability to local details. The design of this module enables the network to adaptively enhance key features, suppress unimportant information, and more accurately segment the complex and fine similar geometric structures of ancient architectures, distinguishing adjacent point geometric and appearance differences caused by materials and weathering. The Global Context Feature Module (GCF) is responsible for capturing and integrating global information in point cloud data. By analyzing the distribution and structural characteristics of the overall point cloud, this module helps the network grasp the overall semantic context. Not only does it perform well in local areas, but it can also perform effective feature learning and semantic parsing at the global level, further enhancing the model's adaptability and accuracy to complex structures. In addition, the ancient architectural complex spans a large area of space and exhibits significant scale changes. The Global Context Feature Module (GCF) module can effectively handle large-scale spatial changes, and by learning global information, the model can still maintain efficient segmentation performance when facing structures of different scales.

DSC module

We have designed the DSC module to learn discriminative spatial feature aggregation. This section will provide a detailed introduction to the two modules, the Enhanced Dual Attention Pooling Module (EDAP) and the Global Context Feature Module (GCF), and will specifically describe the architecture of the DSC module.

Enhanced dual attention pooling module

This section introduces a method for local feature aggregation called the Enhanced Dual Attention Pooling Module (EDAP), which is designed to differentiate categories that have similar geometric shapes but different appearance structures. Detailed explanations related to this are shown in Fig. 1. The input includes N point clouds, each consisting of three-dimensional coordinates \({\text{p}}_{{\text{i}}} \in {\text{P}}^{{{\text{N}}\, \times \,{3}}}\) and corresponding appearance features \({\text{f}}_{{\text{i}}} \in {\text{P}}^{{{\text{N}}\, \times \,{\text{d}}}}\). For each point cloud in the set, we use a K-NN algorithm based on Euclidean distance to aggregate its neighboring point set \({\text{P}}_{{\text{j}}} { = }\left\{ {{\text{P}}_{{\text{j}}}^{{1}} {\text{,P}}_{{\text{j}}}^{{2}} {\text{,P}}_{{\text{j}}}^{{3}} {,} \ldots {\text{,P}}_{{\text{j}}}^{{\text{k}}} } \right\}\), and obtain the corresponding appearance features\({\text{f}}_{\text{j}}\). The formula for calculating the aggregation of local features to distinguish points with similar properties is defined as follows (Eq. 31):

$$F = S\left( {F\left( {\left[ {p_{i} ,p_{j} ,f_{i} ,f_{j} } \right]} \right)} \right)$$
(3-1)
Fig. 1
figure 1

Schematic diagram of Enhanced Dual Attention Pooling (EDAP) module

Here, S represents a symmetric reduction function, while F is our designed feature aggregation function, which includes per-point multilayer perceptrons (MLP), adaptive weight adjustments, and max pooling operations. The "[]" represents a series of operations covering various manipulations of \({\text{p}}_{{\text{i}}} ,\,{\text{p}}_{{\text{j}}} ,\,{\text{f}}_{{\text{i}}} ,\,{\text{f}}_{{\text{j}}}\) including raw operations, arithmetic operations (such as addition and subtraction), and data concatenation. Specifically, this enhanced dual attention local feature aggregation method adopts a strategy based on adaptive weights and multi-level feature fusion. This strategy aims to automatically adjust and optimize the weights during the training process to better adapt to the data features and model objectives. This method can fully exploit topological and appearance features, capturing details from coarse to fine levels, enabling the network to dynamically adjust its focus on feature levels, thereby effectively extracting semantic information of different structures and complexities. Below is a detailed explanation of our proposed enhanced dual attention:

  1. (1)

    Coordinate Position Encoding: Position encoding plays a crucial role in networks based on Transformers and self-attention. For example, in the field of 2D images, the relative position of 2D coordinates is often used for position encoding to enhance image features [35]. However, in 3D space, the absolute coordinates of points may not be suitable for the network to extract high-level features, as the network tends to focus on the relative positions and centroids of points. For N input point clouds, the coordinate encoding of each point can be represented by the centroid coordinates, neighboring point coordinates, relative coordinates, and relative distances. These are processed through a shared multilayer perceptron (MLP) to obtain the encoded features P, which have the same dimensions as the features \({\text{f}}_{\text{i}}\).

$$P = MLP\left( {p_{i} \oplus p_{j} \oplus (p_{i} - p_{j} ) \oplus \left\| {p_{i} ,p_{j} } \right\|} \right)$$
(3-2)

|| || represents the Euclidean distance between two points, and⊕ represents the concatenation operation, which doubles the dimensions after concatenation. Position encoding significantly enhances the model's ability to recognize the positional information of point clouds. By concatenating the encoded features P with the neighboring appearance features \({\text{f}}_{\text{j}}\), more comprehensive feature information [P⨁fj] can be obtained.

  1. (2)

    Multi-level Feature Fusion: In our method, we first integrate encoded topological information with neighboring appearance features to extract high-level semantic information. Additionally, we pay particular attention to the interactions between local centroid features and their adjacent point features, which can be represented as [fi⨁fj]. These features, along with the combination of centroid and neighboring point features [P⨁fj], are processed through a shared multilayer perceptron (MLP). Subsequently, the output of the MLP is aggregated through a max pooling operation, as shown in Eq. (3-3), mapping it to a new feature space, thus comprehensively extracting local advanced semantic features FL, further enhancing the semantic expression capabilities of the model.

$$FL = {\text{Maxpooling}}\left( {{\text{MLP}}\left( {[f_{i} \oplus f_{j} ] \oplus [P \oplus f_{j} ]} \right)} \right)$$
(3-3)
  1. (3)

    Dual Attention: In this part, our processing procedure is divided into two steps: First, for each point, we balance the topological weights and the computed appearance weights to weight the features [P⨁fj]:

    $$W_{p} = {\text{MLP}}(P)$$
    (3-4)

Equation (34) is crucial for understanding the topological features of the neighborhood, providing advanced topological information. Therefore, after encoding the features, we use a shared multilayer perceptron (MLP) to learn the topological weights \({\text{W}}_{\text{p}}\)​ of local points. The coordinate features of local points alone may not be sufficient to distinguish objects of the same class, as differences in texture, color, and shape among objects can make their appearance features difficult to distinguish by the network. Considering that the texture features of the same type of objects are usually similar, we concatenate the features \({\text{f}}_{\text{i}}\) of the centroid and the features \({\text{f}}_{\text{j}}\) of the neighboring points, and use a shared multilayer perceptron (MLP) to perceive appearance features, thereby calculating the local semantic weights \({\text{W}}_{\text{f}}\)

$$W_{f} = {\text{MLP}}\left( {f_{i} \oplus f_{j} } \right)$$
(3-5)

Next, we merge the obtained local geometric topological features and appearance features, and activate both weights using the ReLU function. We use addition to obtain the composite weight. The weight coefficient is denoted by η, and the calculation of the composite weight is shown in Eq. (36):

$$W = {\text{ReLU}}(W_{p} ) + \eta \times {\text{ReLU}}(W_{f} )$$
(3-6)
  1. (4)

    Local Feature Aggregation: We calculate the fused attention weights, considering the importance of different positions and appearance features. Using the SoftMax activation function, we perform bilinear weighting on the weights W and the enhanced local neighborhood features FL. Subsequently, we use SUM as the reduction function to aggregate and update the point's features to \({\text{f}}_{{\text{i\_}}_{\text{new}}}\), as shown in Eq. (3-7):

$$\mathop f\nolimits_{i\_new} = {\text{SUM}}\left( {{\text{softmax}}(W) \odot FL} \right)$$
(3-7)

Global context feature aggregation

Local feature aggregation describes the contextual relationships between neighboring points, but for complex structures such as architectural cultural heritage, starting solely from local features is insufficient for global perception. To more effectively express features, we introduce a global context feature module, aimed at enhancing the model's global perception by integrating panoramic scene information, enabling it not only to recognize individual architectural structures but also to effectively handle complex scenes and extensive spatial relationships.

We assume a spherical spatial domain,as shown in Fig. 2, using the position and volume ratio of objects to represent the global context. It is important to note that even objects of the same class may exhibit different styles; their geometric structures are similar, but their positions and orientations vary. Therefore, given the insensitivity of the volume ratio to local and global boundaries, we use this characteristic to recognize subtle geometric deformations of objects within the same class.

$$S_{i} = \frac{{V_{g} }}{{V_{i} }}$$
(3-8)
Fig. 2
figure 2

Display of Global Context Feature (GCF) aggregation module (the roof in the picture belongs to the Hall of Heavenly Gods)

In this context, \({\text{V}}_{\text{i}}\) represents the volume of the neighborhood boundary, while \({\text{V}}_{\text{g}}\) represents the global boundary volume. The geometric coordinates X, Y, Z indicate the position of the local neighborhood. Based on this, we have defined the following method for aggregating global context features:

$$F_{g} = {\text{MLP}}\left( {(x_{i} ,y_{i} ,z_{i} ) \oplus S_{i} } \right)$$
(3-9)

Here, \(\left( {{\text{x}}_{{\text{i}}} {,}\,{\text{y}}_{{\text{i}}} {,}\,{\text{z}}_{{\text{i}}} } \right)\) represent the coordinates of the point, and " ⊕ " denotes the concatenation operation.

DSC structure

The structure of the DSC architecture is shown in Fig. 3. This architecture accepts two types of inputs: spatial information and previously learned features. Spatial information is used to learn both local and global semantic features, while previously learned features are specifically used for local feature aggregation. The diagram shows the process of local feature aggregation for points, which are input into the EDAP module for two-level local feature aggregation. Subsequently, the aggregated features are overlaid with the feature map to produce the final local features. Global context information is extracted from spatial information through the GCF module. The output of this module is the learned discriminative spatial features, which are a concatenation of local and global features.

Fig. 3
figure 3

DSC structure

DSC net structure

In this section, we will provide a detailed introduction to our designed network, DSC-Net, which is a symmetric encoder-decoder network architecture. Both the encoder and decoder stages contain the same number of basic blocks, and the workflow is illustrated in Fig. 4. The network input consists of N point clouds, which include coordinate information and features, represented as \({\text{P}} \in {\text{R}}^{{{\text{N}}\, \times \,{3}}}\) and \({\text{F}} \in {\text{R}}^{{{\text{N}}\, \times \,{\text{d}}}}\), respectively. Point clouds can be viewed as a collection that integrates topological attributes and appearance features. Features are first input into a shared MLP layer, where dimensions are unified to 8. Subsequently, the encoder, composed of five enhanced attention feature aggregation modules and global context feature modules, progressively encodes features to extract the semantics of multiple features such as color (details in Sect. "Enhanced dual attention pooling module"). After each encoder block, a random point sampling method is used for downsampling. The number of points gradually reduces from N to N/512, and feature dimensions increase from 8 to 512. The next five decoder blocks are used for decoding high-level semantic features. The encoded features are upsampled through nearest neighbor interpolation and connected to the intermediate feature map through skip connections. Finally, three fully connected layers reduce the dimensions of the features to the final output categories, predicting the final semantic labels with the output dimensions of semantic segmentation prediction being \({\text{N}}\, \times \,{\text{C}}_{{{\text{class}}}}\) ​, where \({\text{C}}_{\text{class}}\) is the number of categories.

Fig. 4
figure 4

DSC-Net structure

Experiments

Experimental details

In this section, we will comprehensively evaluate our proposed network DSC Net, using datasets including self built ancient architecture dataset, publicly available architectural cultural heritage dataset ArCH, and publicly available dataset S3DIS. Our experimental setup includes virtual CPUs (Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50 GHz), a Tesla V100-SXM2 GPU, and all experiments were conducted in a virtual environment equipped with CUDA 11.3 and cuDNN v7 on the TensorFlow 2.6.0 framework. In the experiments with these three datasets, we used an Adam optimizer with an initial learning rate of \({10}^{-2}\). The network underwent 100 training epochs, with the learning rate decreasing by 5% at the end of each epoch, and the number of neighborhood points was set to 16. During the training phase, a fixed number of points (40,960) were sampled from each point cloud. For the testing phase, the entire original point cloud was used, with each point including 3D coordinates and color information.

Datasets

  1. (1)

    The equipment used in this experiment is the FARO Focus3D X130 3D laser scanner, which can capture 976,000 points per second and scan at a distance of over 130 m. It is equipped with a coaxial high-resolution camera, making the matching of color images and point clouds unbiased. The collected data comes from the architectural heritage of the Niangniang Temple during the Xuande period of the Ming Dynasty in ancient China. It was first built between 1426–1435 during the Ming Dynasty and has a history of over 500 years. The main buildings in the Beiding Niangniang Temple include the Hall of Heavenly Gods, the East Supporting Hall, the Niangniang Hall, the Dongyue Hall, and the Shanmen Hall. The Beiding Niangniang Temple is a typical traditional Chinese wooden architecture, mainly composed of a roof and a pedestal. The roof consists of tiles and roof figures on the roof ridge. Doors and windows are all made of wooden structure, and there are hollow patterns on them. The Niangniang Temple was listed as the seventh batch of municipal level cultural relics protection units in Beijing in 2003. It is one of the "Five Top Temples" in Beijing's history and also a landmark building on the central axis of Beijing.

The point cloud datasets collected in this experiment are the Beiding Niangniang Hall and its auxiliary hall (Area1), the Hall of Heavenly Gods (Area2), and the East Side Hall (Area3). Among them, as for the Niangniang Hall and its auxiliary hall (Area_1), the Niangniang Hall is five rooms wide, with the roof of Xieshan ound ridge roof, green glazed tile and yellow trimmed roof. The auxiliary halls on both sides of Niangniang Hall are respectively gable roof and simple tile roof. The Hall of Heavenly Gods (Area_2) has a width of three rooms and a gable roof with a simple tile roof. The front of the Hall of Heavenly Gods has four five painted wooden doors, with four threshold windows on each side of the doors, and four five painted wooden doors on the back. The East Side Hall (Area_3) is a gable roof with a simple tile roof. Figure 5 shows the appearance characteristics of ancient architectures in three regions. This dataset consists of ten categories, namely Tiebeam, Window, Door, Column, Roof, Floor, Stylobate, Step, Wall, and Clutter. Table 1 provides a visual representation of the appearance of each category in the dataset. In addition, we also conducted detailed statistics on the number of point clouds in the ancient architecture dataset, and Table 2 shows the number of points in each region and the total number of points. Table 3 shows the number of point clouds corresponding to each category.

  1. (2)

    ArCH is a large point cloud dataset, jointly released by the University of Turin and other universities and institutions, focusing on the semantic segmentation of point clouds related to historical architectural heritage. The ArCH dataset contains 17 annotated point cloud collections and 10 unannotated collections. These 17 point cloud scenes have been meticulously labeled into 10 categories, including architectural elements such as Vault, Column, Floor, Door,Window, Wall, Moldings, Stair, Arch, and Roof, as shown in Fig. 6.

  2. (3)

    S3DIS is a 3D indoor space dataset acquired by Stanford University in 2017 through scanning technology. This dataset encompasses six large indoor areas of three different buildings. Each area contains between 20 to 70 rooms, with the number of points in each room ranging from 50,000 to 2.5 million. Each point is labeled with one of thirteen semantic categories. We use only the 3D coordinates, color information, and corresponding labels from the point cloud data to train the network and employ a six-fold cross-validation strategy for evaluation.

Fig. 5
figure 5

Display of buildings in each region of the Ancient architecture dataset

Table 1 Partial display images for each category in the Ancient architecture dataset
Table 2 Number of point clouds in each area of the Ancient architecture dataset
Table 3 Number of point clouds for each category in the Ancient architecture dataset
Fig. 6
figure 6

Partial scene display of the ArCH dataset

Evaluation on Ancient architecture dataset

In order to effectively distinguish the various categories in the ancient architecture dataset, we conducted a comprehensive evaluation of DSC-Net and adopted a K-fold cross validation (K = 3) strategy to evaluate the performance of the self built ancient architecture dataset in eight methods. Select one fold as the test set each time, and the other two folds as the training set (where each fold corresponds to an area). The detailed results are shown in Table 4 below. We use overall accuracy (OA) and mean intersection to union ratio (mIoU) as standard indicators for evaluation. In this experiment, we used seven methods, including Point Net, Point Net +  + , DG-CNN, KPConv, RandLA-Net, BAAF-Net, RandLA-Net + PnP-3D, as reference methods for comparison with this method. Point Net directly processes each point, uses multi-layer perceptrons (MLP) to extract features from point cloud data, and aggregates global features through global max pooling. Point Net +  + is an improved version of Point Net, which captures local and global features by introducing a hierarchical feature learning mechanism. It uses multiple PointNet modules to handle point cloud regions of different scales. DG-CNN uses dynamically constructed KNN maps for point cloud feature extraction. It captures local geometric structures by calculating neighbors in the feature space. KPConv is a convolutional based point cloud feature extraction method that uses learnable convolution kernels to process local neighborhood features of point clouds. RandLA-Net uses random sampling and local aggregation to efficiently process large-scale point cloud data. It captures features of different scales through a multi-layer attention mechanism. BAAF-Net uses a dual attention mechanism to aggregate local features, distinguishing categories with similar geometric structures but different appearances in point clouds. On the basis of RandLA-Net, the PnP-3D module is integrated to enhance feature representation by introducing more local context and global bilinear response. The above methods all adopt the default settings in the original paper, including network structure and training parameters.

Table 4 Detailed semantic segmentation results for the Ancient architecture dataset (numbers in bold indicate results higher than the corresponding baseline. In each column, the highest value is highlighted in red)

The experimental results show that our method achieves an average intersection to union ratio (mIoU) of 63.56% and an overall accuracy (OA) of 82.63%, all higher than the other seven methods. Our method outperforms the benchmark network RandLA-Net by 20.56% and 12.04% in mIoU and OA evaluation metrics, respectively. Meanwhile, compared to the two improved methods BAAF-Net and RandLA-Net + PnP-3D on the benchmark network, our method is 9.58% and 10.49% higher than BAAF-Net. Compared with RandLA-Net + PnP-3D, our method has improved by 3.36% and 2.06%, respectively. Among them, the category of Fang has the highest segmentation accuracy on DG-CNN, with an IoU of 37.61%. BAAF-Net has the highest segmentation accuracy on doors, with an IoU of 44.18%. RandLA-Net + PnP-3D has an IoU of 71.02% on styleboard and 86% on walls. Our method has an IoU of 95.01%, 79.02%, 84.52%, 48.80%, 59.92%, and 55.02% for doors, roofs, steps, windows, columns, and others, respectively. The results have demonstrated the superiority of our method over the other seven methods in semantic segmentation of self built ancient architecture point clouds. The experimental results also show that the overall accuracy of the method using RandLA-Net as the benchmark is higher than the other four methods. We evaluated the performance of our method on the entire dataset using class accuracy (ACC,%) as an indicator, as shown in Fig. 7. As shown in Fig. 8, we present the visualization results of the Hall of Heavenly Gods (Area_2), which are the front view, side view, and rear view of the the Hall of Heavenly Gods from top to bottom. Figure 9 shows a comparison of the segmentation details of the front, side, and back views of the Hall of Heavenly Gods (Area_2) (with blue dashed lines indicating the comparison of details). Our method shows that it is more accurate than the benchmark method RandLA-Net in segmenting similar geometric structures and boundaries, especially in difficult to distinguish categories such as doors, windows, columns, and walls.

Fig. 7
figure 7

Accuracy of testing for each area of the ancient architecture dataset (ACC,%)

Fig. 8
figure 8

Display of segmentation results for the Hall of Heavenly Gods (from left to right are: RGB color input point cloud, Randla Net prediction results, this method prediction results, and Ground Truth)

Fig. 9
figure 9

Details of semantic segmentation effect in Area_2 (left: RandLA Net, right: Our method)

This is because for large-scale scenes such as ancient architecture, the amount of point cloud data is usually large. RandLA-Net's efficient random downsampling strategy can process this data more quickly. At the same time, to overcome the problem of accidentally discarding key features during the random downsampling process, a local feature aggregation module is introduced to gradually increase the acceptance domain of each 3D point, effectively preserving geometric details. The high computational complexity of the other four methods results in slower processing speed, larger memory usage, and affects overall accuracy. This has been demonstrated in reference [4] to demonstrate the superiority of this benchmark method in handling large-scale complex scenes. However, for ancient architecture, which has more complex detailed structures and high geometric similarity between different categories, our method can adapt to complex geometric shapes and fine decorations, and capture complex overall layouts and structures to cope with more challenging complex scenes of ancient architecture. Specifically:

  1. (1)

    Firstly, the enhanced dual attention pooling module first extracts topological features, learning edge, corner, and curvature information of categories such as tiebeam, doors and windows, and roofs, in order to gain a deeper understanding of the structure and form of buildings. The subsequent extraction of appearance features involves information such as color, texture, and material, such as the color of walls and the texture of doors and windows.

  2. (2)

    Secondly, the global contextual semantic feature module captures the global information of the entire point cloud scene, helping the model understand how various architectural elements are interrelated in space. This is particularly important for distinguishing architectural elements that are similar in location but of different types, such as decorative columns next to windows. This module achieves accurate segmentation of components such as doors, windows, columns, and roofs by analyzing the overall shape and structure of the building, and can clarify their boundaries with the surrounding environment. In addition, it also helps to identify and segment various types of ancient architectural elements, such as accurately identifying the shape of roofs and the height and diameter of columns by analyzing the overall shape and scale of the building.

To better validate the effectiveness of our method, we have listed several sets of results for object categories with similar geometric structures that are difficult to distinguish in Table 5. In terms of geometric structure, doors and windows are represented as planes perpendicular to the ground, which are mainly distinguished by color and texture. The proposed feature aggregation strategy fully utilizes the geometric and appearance information in the points. Columns and walls are difficult to distinguish due to their similar textures and close geometric positions, resulting in poor segmentation results. Our global contextual semantic feature module effectively improves this issue by more accurately grasping spatial location and overall structure.

Table 5 Results of geometric structure similarity types between the proposed DSC-Net, RandLA-Net, and RandLA-Net + PnP-3D on the Ancient architecture dataset (evaluation metric is mIoU, %)

Evaluation on ArCH

In this experiment, we evaluated the ArCH [10] dataset, which consists of 17 annotated point clouds and an additional 10 unannotated point clouds. The ArCH dataset contains numerous scenes that are part of the UNESCO World Heritage List (WHL), showcasing multiple historical periods and architectural styles. In this benchmark dataset, 15 scenarios were used for training, while 2 scenarios were used for testing. Due to some scenarios not covering all nine categories, five scenarios were selected for analysis in this experiment: 5-SMV_chapel_1, 6-SMV_chapel_2to4, 7-SMV_chapel_24, 15-OTT_church, and A-SMG_portico. We use five fold (K = 5) cross validation to evaluate the final results, selecting one fold as the test set and the other two folds as the training set (where each fold corresponds to an area). Table 6 presents detailed information on the selected data, including the number of point clouds, experimental scenarios, data acquisition methods, and categories of the dataset. We use overall accuracy (OA) and mean intersection to union ratio (mIoU) as evaluation metrics. Given the limited evaluation of this dataset, we conducted comparative experiments using six methods: PointNet PointNet +  + , DG-CNN、RandLA-Net, BAAF-Net, RandLA-Net + PnP-3D. Table 7 provides a detailed list of quantitative segmentation results for each method. Our method outperforms RandLA Net by 0.5% in mIoU and 1.04% in OA compared to baseline based methods; Compared with BAAF Net, it is 0.17% and 0.15% higher, respectively. Compared with RandLA Net + PnP3D, it is 1.7% and 1.06% higher, respectively. The segmentation performance in the 6_SMV_chapel_2to4 scene is shown in Fig. 10, and our method is also higher than the other six methods, demonstrating the highest segmentation results.

Table 6 Key Features of the Partial ArCH Dataset
Table 7 Quantitative Segmentation Results of Different Methods on the ArCH Dataset (numbers in bold indicate results higher than the corresponding baseline. In each column, the highest value is highlighted in red)
Fig. 10
figure 10

Front view of 6_SMV_chapel_2to4 (left: original point cloud, center: predicted result, right: real point cloud)

Evaluation on S3DIS

In order to make the effectiveness of our method more convincing, we conducted experiments on the recognized public dataset S3DIS [11] and demonstrated semantic segmentation results from sixfold cross validation. As evaluation indicators, we used mean union intersection (mIoU) and overall accuracy (OA), and the detailed comparison results of these indicators are shown in Table 8 and Fig. 11. Compared to the baseline network, our method demonstrates stronger competitiveness in all evaluation metrics. Compared with RandLA Net using the same random downsampling strategy, the sixfold cross validation results showed that our mIoU increased by 1.03% and OA increased by 0.4%. Our IoU in ceiling, beam, table, board, and clutter reached 93.8%, 63.9%, 71.4%, 67.1%, and 60.9%, respectively, showing significant advantages. These results clearly demonstrate the superiority of our method on the benchmark dataset S3DIS. In Fig. 12, we visualize the raw and predicted results of three classic scenarios from S3DIS. The comparison with the baseline method RandLA Net shows that our method can accurately distinguish similar geometric categories, demonstrating its robustness in various scenarios.

Table 8 Quantitative Segmentation Results of Different Methods on the S3DIS Dataset (numbers in bold indicate results higher than the corresponding baseline. In each column, the highest value is highlighted in red)
Fig. 11
figure 11

Comparison of semantic segmentation effects of different categories using different methods (mIou,%)

Fig. 12
figure 12

Visualization Examples of S3DIS Dataset in Three Typical Indoor Scenarios

Ablation experiments

Our method has been validated through testing on the ancient architecture dataset, ArCH dataset, and S3DIS dataset. In order to gain a deeper understanding of the mechanism of the network, we conducted two sets of ablation experiments on the ancient architecture dataset and evaluated the ablation results using standard threefold cross validation. Meanwhile, considering the widespread use of S3DIS as a common dataset in 3D point cloud semantic segmentation research, conducting ablation experiments on a standardized S3DIS set can help demonstrate the generality and robustness of our method. We also conducted two sets of ablation experiments on Area_5 of the S3DIS dataset.

We evaluated the effectiveness of various modules of DSC Net under different configurations. Specifically, we designed four control experiments: removing coordinate encoding operations, removing all weights, removing fused features, and removing global contextual semantic feature modules. To further explore the effectiveness of different components of enhanced dual attention, we employed three different methods to evaluate the impact of attention forms: using topological weights alone, using semantic weights alone, and not applying ReLU activation before weight fusion. As shown in Tables 9 and 10, our enhanced dual attention pool module and global context feature module significantly improve the accuracy of point cloud segmentation. In the enhanced dual attention module, compared to the method without coordinate encoding, our encoding strategy increased mIoU by 3.31% and 2.67% on both datasets, respectively. Our encoding module achieves this by calculating the distance between the centroid and adjacent points, as well as their offsets in the x, y, and z directions, which provides information that the original coordinates do not have and is crucial for local geometric perception. When evaluating the importance of different weights in our dual attention module, we found that the lack of weights hinders effective learning and aggregation of local adjacent points, where topological weights typically have a more significant impact on feature information learning than semantic weights. In addition, if ReLU activation is not applied before feature fusion, the weights of these two types will interfere with each other, resulting in a decrease in mIoU. Our experiments have shown that by integrating enhanced feature information, this module can adapt to complex ancient architectural data structures and perform well in other data scenarios, thereby improving feature differentiation ability. Learning global contextual features can enable networks to extend their understanding of complex objects from local to global, enriching feature representation. The experimental results have demonstrated the effectiveness of the module and significantly improved its performance.

Table 9 Results of ablation experiments on self built ancient architecture datasets
Table 10 Ablation experiments on the S3DIS dataset Area_5

Conclusion

In this study, we propose a network based on Enhanced Dual Attention Pooling and Global Context Feature Aggregation, named DSC-Net, aimed at accurately analyzing and understanding 3D point cloud data obtained from complex, large-scale scenes. By guiding local feature fusion at both the dimensional and point levels, the network enhances its ability to recognize objects with similar geometric structures. The DSC module can be easily embedded into various network architectures for point cloud segmentation; we have embedded it into an encoder-decoder architecture, resulting in the DSC-Net architecture featured in this work. We tested it on the Ancient architecture dataset, ArCH, and S3DIS, where the proposed DSC-Net not only outperforms the most advanced point cloud segmentation methods based on rapid random sampling (such as RandLA-Net) in terms of accuracy but also shows excellent performance in handling diverse architectural environments. Particularly in the ancient architecture dataset, the model not only achieves high precision in spatial resolution but also effectively identifies and classifies key structural elements in cultural heritage, such as doors, windows, roofs, and decorative details. With our designed point cloud semantic segmentation network, we provide strong technical support for cultural heritage preservation, further advancing the scientific and precise approach to the conservation and restoration of ancient architectures. Additionally, the widespread application of this model also helps to strengthen the systematic study and management of cultural heritage, offering new perspectives and methods for protecting precious historical monuments worldwide.