Introduction

Cultural heritage protection has long been a focal point of attention for both the global academic community and conservation circles. As a part of the world cultural heritage, the Terracotta Warriors holds significant historical, artistic, and scientific value and embodies the cultural memory of China’s long history. However, due to prolonged natural erosion and human-induced damage, the conservation and restoration of the Terracotta Warriors face enormous challenges. In this context, digital technology and artificial intelligence have gradually become important tools, among which the acquisition and processing of three-dimensional point cloud data show great potential in the 3D reconstruction, morphological analysis, and damage assessment of cultural relics [1,2,3]. Point cloud registration [4, 5], as an indispensable component of point cloud processing, plays a crucial role in cultural heritage protection due to its accuracy and efficiency. The main goal of point cloud registration is to integrate multiple datasets collected from different perspectives or time instances into a globally consistent coordinate system, thereby achieving high-precision 3D reconstruction, object recognition, and scene analysis of cultural relics.

With the continuous advancement of point cloud acquisition technologies and the widespread application of sensors and scanning devices, modern point cloud registration encounters numerous challenges. One challenge is the complex morphological and structural variations, sparsity, and irregularity of the data, which lead to inefficiencies and susceptibility to local noise and overlap rates [6,7,8] when dealing with large-scale [9], high-density complex cultural relic point clouds using traditional registration methods. Additionally, the differences between different datasets [10,11,12] pose challenges to point cloud registration as they may contain variations in objects, environments, and sampling methods, resulting in insufficient generalization performance of existing algorithms.

To address these challenges, researchers have proposed numerous innovative point cloud registration methods in recent years [13,14,15]. These methods cover various aspects ranging from traditional feature-based matching approaches to end-to-end methods. However, traditional feature-based matching methods often rely on handcrafted feature descriptors, leading to unstable performance when dealing with point clouds of different densities and scales. While deep learning-based methods partially address the issue of feature extraction, their generalization performance on large-scale cultural relic data and different datasets remains limited.

Inspired by the successful application of the Transformer architecture [16, 17] in the field of natural language processing, recent research has introduced it into the field of computer vision, aiming to capture wide-range relationships and integrate overall contextual information. Our work aims to apply the Transformer to point cloud registration tasks and proposes a novel method called Enhancing Point Cloud Registration with Transformer (EPCRT). Our method utilizes the Transformer architecture to encompass both local and global geometric features of point cloud data, thereby not only improving the accuracy and efficiency of point cloud registration but also providing new technological means for cultural heritage protection.

This paper brings forward significant contributions across the following dimensions:

Local Geometric Perception Mechanism We introduce an innovative approach for local geometric perception and positional encoding, combining local density information and geometric angle encoding to enhance the flexibility and robustness of positional encoding. This mechanism dynamically adjusts positional encoding information based on local structures, thereby better representing the complex local morphology and structural variations of artifacts.

Convolutional-Transformer Hybrid Module We design a convolutional-Transformer hybrid module to facilitate interactive learning of point cloud features, achieving effective fusion of local and global features. This hybrid module captures the global semantic information of point cloud data while retaining local details, thereby improving registration performance and effectively handling the sparsity and irregularity of artifact point cloud data.

Experimental Validation and Performance Evaluation We conduct extensive experimental validation on multiple standard datasets, including 3DMatch, ModelNet, KITTI, and MVP-RG, and validate it on the Terracotta Warriors cultural heritage dataset. Through benchmarking against cutting-edge methods, we demonstrate the effectiveness and superiority of the proposed approach. Experimental results show that EPCRT exhibits significant performance advantages in handling complex morphological and structural variations, sparsity, irregularity, and generalization across different datasets.

Related work

Deep feature learning Methods In the field of cultural heritage protection, point cloud registration tasks are crucial for accurately reconstructing and safeguarding artifacts, and the application of deep learning in point cloud feature extraction has become increasingly prevalent. To address the inadequate registration accuracy of unsupervised point cloud registration algorithms in cases of partial overlap, Shen et al. [18] proposed a dependable technique for evaluating inliers, enhancing the resilience of unsupervised point cloud registration. This method aims to effectively differentiate inliers and capture geometric differences between source point clouds and pseudo-target. Specifically, the method comprises a Matching Graph Optimization module and an Inlier Assessment module. In the Matching Graph Optimization module, aggregation of matching scores from neighbors improves the estimation of point-to-point matching graphs. This neighborhood information aggregation helps construct discriminative matching graphs, providing high-quality correspondences for generating pseudo-target point clouds. The Inlier Assessment module calculates inlier confidences for each estimated correspondence based on structural differences between source and pseudo-target point cloud. Li et al. [19] proposed a point cloud registration method named QGORE, aiming to achieve efficient point cloud registration while ensuring outlier removal. QGORE’s key idea lies in employing a simple yet effective voting method to estimate upper bounds through geometric consistency. This voting method yields results nearly equivalent to the tightness in traditional GORE methods. Moreover, to enhance computational efficiency, QGORE proposed a single-point RANSAC algorithm that explores “rotation correspondences” to estimate lower bounds, significantly reducing the iterations required by the traditional three-point RANSAC algorithm.

To simplify the ego-motion estimation process by removing most of the complex parts and focusing on the core elements, Vizzo et al. [20] proposed a system based on the point-to-point ICP algorithm, combined with adaptive thresholding for correspondence registration, robust kernel functions, motion compensation methods, and point cloud subsampling strategies. The results indicate that this system performs well under various operating conditions and does not require tuning for specific LiDAR sensors.

To address the challenge of reconstructing 3D models of artifacts with limited samples and avoiding overfitting, Zhu et al. [21] proposed a transfer learning-based method to recover the 3D shape of artifact faces from a single old photograph. This method utilizes UV position maps to represent the 3D shape and employs a convolutional neural network to reconstruct the UV position map from the 2D image.

End-to-end Methods To enhance the robustness of point cloud registration algorithms, Zhang et al. [22] proposed an end-to-end learning approach to learn partial permutation matrices. This approach addresses the shortcomings of existing hard assignment methods in handling outliers and avoids misleading correspondences that can arise in soft matching methods. The algorithm introduces a registration framework called the Soft-to-Hard Matching Procedure (S2H matching process). This process consists of two steps: the S-step and the H-step. In the S-step, soft matching matrices, which represent the matching probabilities between corresponding points rather than hard assignments, are learned using techniques like graph signal processing. Then, in the H-step, partial permutation matrices are obtained by projecting and clipping the soft matching matrices, achieving hard assignment and avoiding misleading correspondences.

To address the challenges of partial overlap and different datasets, Tan et al. [23] proposed a framework named MCLNet that leverages multi-level consistency algorithms. MCLNet first trims points outside the overlapping region using point-level consistency. It introduces a multi-scale attention module to ensure consistency learning at various levels, thereby establishing dependable correspondences. This module captures features at different scales, improving the handling of local feature matching in point cloud registration. To further enhance accuracy, the authors propose consistency learning to alleviate the adverse effects of non-coincident points. This method helps manage non-overlapping points in point clouds, preventing them from adversely affecting the matching results and making the overall framework more robust and reliable. Wang et al. [24] proposed a registration method named Neighborhood Multi-compound Transformer (NMCT). Firstly, they introduced Neighborhood Position Encoding, which enhances the ability to extract relevant local feature information and local coordinate information by selecting spatial points using the nearest neighbor method. Secondly, they employed the Multi-compound Transformer as the interaction module for point cloud information, consisting of both spatial and temporal transformers. The combination of these two stages enables NMCT to better handle the complexity and diversity of point cloud data. The algorithm was extensively tested on multiple datasets, demonstrating excellent generalization and robustness.

Transformer Methods In the past few years, there has been notable advancement in point cloud registration techniques leveraging Transformer learning. To seamlessly integrate geometric and visual data from disparate modalities, Wang et al. [25] introduced a Geometric-Aware Visual Feature Extractor. This method gradually fuses geometric and visual information of RGB and depth data using a multi-scale local linear transformation. The depth data’s geometric attributes function akin to convolution kernels, reshaping the visual characteristics of RGB data. This process places the generated visual-geometric features in a normalized feature space, mitigating visual differences caused by geometric variations and obtaining more reliable correspondences.

To address the issue of handling the relationships between point clouds in continuous scans during 3D point cloud registration, Zaman et al. [26] proposed a method that uses a continual graph network architecture with an attention mechanism. This approach improves the registration of current point cloud pairs by leveraging the learned associations from previous point cloud pairs, thereby enhancing the expressiveness of the point clouds. The results show that this method significantly improves correspondence performance, registration performance, and generalization ability.

To enhance the performance of registration within expansive 3D environments, Han et al. [27] introduced a model used on Hough voting for rejecting outlier correspondences. This approach utilizes an overlap-based correspondence calculation method and extracts depth geometric features to enhance registration performance under low overlap ratios. Transform parameters are represented in 6D Hough space using triplet voting to address ambiguity issues during the matching process. Similarity values are employed as features for each vote to reduce ambiguity during training. The algorithm combines fully convolutional geometric feature networks and Transformer attention mechanisms to reduce noise during the voting process. Finally, a binning method is used to determine consensus on correspondences and predict the final transformation parameters. This method demonstrates superior performance on both indoor and outdoor datasets.

Inspired by the successful results achieved by feature learning-based methods, Transformer-based learning approaches, and end-to-end techniques, particularly in addressing challenges such as complex morphological and structural variations, sparsity, irregularity, generalization across different datasets, and outdoor large-scale scenes, we introduce a Transformer-based end-to-end approach for point cloud registration.

Method

In alignment with the architectures of D3feat [28] and Predator [29], the end-to-end algorithm we propose utilizes an encoder-decoder network with a hierarchical structure. Additionally, we employ RANSAC to estimate rigid transformations, as depicted in Fig. 1.

Fig. 1
figure 1

Network architecture of EPCRT. LGP Local Geometric Perception, MHCA Multi-head Cross Attention, FFN Feedforward Neural Network

Problem setting

For two sets of points denoted as source \( P = \{ p_i \in \mathbb {R}^3 \mid i = 1, 2, \ldots , N \} \) and target \( Q = \{ q_i \in \mathbb {R}^3 \mid i = 1, 2, \ldots , M \} \), both residing in three-dimensional space, where N and M denote the number of points. Point cloud registration endeavors to align them through an unknown 3D rigid transformation \( RT = \{ R, T \} \), which comprises rotation \( R \in SO(3) \) and translation \( T \in \mathbb {R}^3 \). This transformation aims to minimize the disparities between corresponding points in the source and target clouds, achieving optimal alignment. The formula is as follows:

$$\begin{aligned} \mathop {\min }\limits _{R,T} \sum \nolimits _{({p_i},{q_i}) \in \vartheta } {\left\| {R\cdot {p_i} + T - {q_i}} \right\| } _2^2 \end{aligned}$$
(1)

Here, \( \vartheta \) symbolizes the ground truth correspondences between points in P and Q. The notation \( \left\| \bullet \right\| \) signifies the Euclidean distance.

Encoder-decoder

Encoder To process the denser original point cloud P and Q, each in \( \mathbb {R}^{N \times 3} \) and \( \mathbb {R}^{M \times 3} \), we employ the KPConv module as our foundation. This module, comprising a sequence of residual units and strided convolutions, facilitates downsampling, thereby reducing the number of keypoints to \( P' \) and \( Q' \), each in \( \mathbb {R}^{N' \times 3} \) and \( \mathbb {R}^{M' \times 3} \). Additionally, we adopt a shared encoding mechanism to extract pertinent features, yielding \( F'_{P'} \) and \( F'_{Q'} \), each in \( \mathbb {R}^{N' \times D} \) and \( \mathbb {R}^{M' \times D} \), where D denotes the feature dimension.

Decoder The decoder module follows a conventional design, featuring a 3-layer network structure. It incorporates upsampling, linear transformation operations, and skip connections as its primary components.

Transformer

Local Geometric Perception Mechanism (LGP): In traditional Transformer models, positional encoding is typically implemented in a fixed manner, such as sine/cosine positional encoding. However, for point cloud data, where the number of points is variable, traditional positional encoding methods are not suitable. Therefore, we introduce a Local Geometric Perception Mechanism, which dynamically adjusts positional encoding information by integrating local density information and geometric angle encoding. The local density information allows the adjustment of positional encoding parameters based on the actual density, enabling the model to better adapt to point clouds of different densities. Additionally, we incorporate geometric angle encoding into the Local Geometric Perception Mechanism to enhance the model’s performance by capturing angle information of points in the point cloud. The Local Geometric Perception Mechanism enables the model to better understand the spatial structure and achieve improved performance in point cloud processing tasks.

Local Density Information: Obtaining local density information to adjust the parameters of dynamic positional encoding. Local density information is determined by counting the points located in the immediate vicinity of each point. By defining the local neighborhood using a spherical region, the local density information of each point can be obtained. This information is then used to adjust the magnitude of dynamic positional encoding, allowing it to better adapt to the local structure. The calculation formula is as follows:

$$ {\psi _{i}} = \sum \limits _{j = 1}^N \left( {K\parallel {p_{i}} - {p_j}\,\parallel} \right) $$
(2)

Here, N denotes the aggregate number of points within the point cloud, with \( p_{i} \) indicating the coordinates of the \( i-th \) point, and \( K( \cdot ) \) stands as the kernel function, defined as follows:

$$\begin{aligned} K(r) = {e^{ - \frac{{{r^2}}}{{2{\sigma ^2}}}}} \end{aligned}$$
(3)

where, r represents the distance between points,\( \sigma \) serves as the standard deviation of \( K( \cdot ) \), regulating the extent of the local neighborhood.

Geometric Angle Encoding: Integrating geometric angle position information into dynamic positional encoding, so that positional encoding not only dynamically adjusts its magnitude based on local density information but also fine-tunes positional encoding according to angle information. This enables better capturing of the local structure and directional information. The calculation formula is as follows:

$$ \theta _{i} = arccos\left( {f_{i} \cdot \beta } \right){\text{ }} $$
(4)
$$ \eta _{i} = \left[ {sin\left( {\theta _{i} } \right),cos\left( {\theta _{i} } \right)} \right] $$
(5)

where, \( f_i \) denotes the normal vector of the \( i-th \) point, \( \beta \) is the reference direction, \( \theta _{i} \) stands as angle between normal vector and the reference direction, and \( \eta _i \) represents the angle information of the \( i-th \) point.

Dynamic Fusion Position Encoding: Integrating local density information \( \psi _{i} \) and angle information \( \eta _i \) into positional encoding, dynamically adjusting the magnitude of positional encoding to be correlated with local density. Specifically, points with higher local density will have smaller positional encoding values, while points with lower local density will have larger positional encoding values. Simultaneously, attention is paid to directional information within the point cloud. This way, dynamic positional encoding better adapts to the local structure and enhances the performance of the model in registration tasks. The computation is expressed by the following formula:

$$ _{i}^{{LGP}} =\, sin\left( {\frac{{pos_{i} }}{{10000^{{2 \times d/D \times \psi _{i} }} }} + \eta _{i} } \right) + cos\left( {\frac{{pos_{i} }}{{10000^{{2 \times d/D \times \psi i}} }} + \eta _{i} } \right) $$
(6)

where, point cloud data points are represented as \( {p_i} = \left( {{x_i},{y_i},{z_i}} \right) \), \( pos{_i} \) represents the position of the \( i-th \) point, \( pos{_i} = \sqrt{x_i^2 + y_i^2 + z_i^2} \), d represent the dimensions of positional encoding, D represent the dimensions of point cloud data, and \({\alpha }_{i}^{LGP} \) represents local geometric positional information.

Convolutional-Transformer Network: Traditional point cloud registration methods typically employ iterative local search strategies to achieve registration processes, but they lack in global correlation and feature learning. To optimize the efficiency and accuracy, we introduce a Convolutional-Transformer network.

Firstly, we employ convolutional operations to extract features from the input data, aiming to capture local structural information. This helps reduce the dimensionality of the point cloud data and extract useful feature information. Next, we feed the features extracted by convolutional operations into a Transformer model. The Transformer model achieves global correlation and feature learning among the point cloud data through its cross-attention mechanism. With the multi-head attention mechanism, the Transformer is able to simultaneously consider different aspects of the point cloud data, thereby enhancing the accuracy and robustness.

we define \( {F'_{P'}} = (x_1^{P'},x_2^{P'} \cdots x_{N'}^{P'}) \) and \( {F'_{Q'}} = (x_1^{Q'},x_2^{Q'} \cdots x_{N'}^{Q'}) \) as the input \( MHAttn({F'_{P'}},{F'_{Q'}},{F'_{Q'}}) \) in the i-th layer, and \( Z' = (z_1^{P',Q'},z_2^{P',Q'} \cdots z_{N'}^{P',Q'}) \) as the resulting matrix. The expression is given by:

$$ _{i}^{{P^{\prime},Q^{\prime}}} = \sum\limits_{{j = 1}}^{{N^{\prime}}} {soft\max \left( {\alpha _{{i,j}}^{{Cross - }} } \right)} x_{j}^{{Q^{\prime}}} W^{{V,Q^{\prime}}} {\text{ }} $$
(7)

where, \( \alpha _{i,j}^{Cross - } \) represents the unnormalized weight coefficient, characterized as follows:

$$ _{{i,j}}^{{Cross - }} = \frac{1}{{\sqrt {d_{{head}} } }}\left( {Conv\left( {x_{i}^{{P^{\prime}}} } \right)W^{{Q,P^{\prime}}} + \alpha _{i}^{{LGP}} } \right)\left( {x_{j}^{{Q^{\prime}}} W^{{K,Q^{\prime}}} } \right)^{T} {\text{ }} $$
(8)

Finally, the output defines the matching relationship between point clouds, as follows:

$$ F_{i} = MLP\left( {cat\left[ {F^{\prime}_{{P^{\prime}}} ,z_{i}^{{P^{\prime},Q^{\prime}}} } \right]} \right) $$
(9)

Loss function

Our proposed network EPCRT is built upon end-to-end training and supervised using ground truth data. The loss function is as follows:

Feature Loss In line with the methodologies of D3Feat [28] and Predator [29], we employ a circle loss function to assess feature divergence and regulate point-wise feature descriptors in the training. It is defined as follows:

$$\begin{aligned} \mathcal {L}_{FL}^P = \frac{1}{{{N_P}}}\sum \limits _{i = 1}^{{N_P}} {\log \left[ {1 + \sum \limits _{j \in {\varepsilon _p}} {{e^{c \beta _p^j(d_i^j - \Delta p)}}} \bullet \sum \limits _{k \in {\varepsilon _n}} {{e^{\lambda \beta _p^k(\Delta n - d_i^k)}}}} \right] } \end{aligned}$$
(10)

Here, \( d_i^j \) denotes the Euclidean distance between features, \( d_i^j = {\left\| {{f_{{p_i}}} - {f_{{q_j}}}} \right\| _2} \). \( \varepsilon _p \) and \( \varepsilon _n \) represent the matching and non-matching points of \( P_{RS} \) (randomly sampled points from the source point cloud), corresponding to positive and negative regions, respectively. \(\Delta p\) and \(\Delta n\) denote positive and negative areas respectively, and \(\lambda \) is a predefined parameter. Similarly, the feature loss \( \mathcal {L}_{FL}^P \) for the target point cloud is calculated analogously. The total feature loss is expressed as \( {\mathcal {L}_{FL}} = \frac{1}{2}(\mathcal {L}_{FL}^P + \mathcal {L}_{FL}^Q) \).

Overlap Loss For supervised training, we employ a binary cross-entropy loss function, expressed as:

$$ \mathcal {L}_{OL}^P \, = \,\frac{1}{N}\sum\limits_{{i = 1}}^{N} {O_{{pi}}^{{label}} } \log \left( {O_{{pi}} } \right)\, + \,\left( {1 - O_{{pi}}^{{label}} } \right)\log \left( {1 - O_{{pi}} } \right) $$
(11)

where, \( O_{{p_i}}^{label} \) represents the overlapping mark at point \( p_i \) of ground truth, characterized as follows:

$$ O_{{pi}}^{{label}} \, = \,\left\{ \begin{gathered} 1,\,\,\,\,\,\,\,\,\,\,\,\left\| {T_{{P,Q}}^{{GT}} \left( {pi} \right)\, - \,NN\left( {T_{{P,Q}}^{{GT}} \left( {pi} \right),\,Q} \right)} \right\|\,\,\,\,\,\,\,\,\, < \tau _{1} \hfill \\ 0,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{\text{otherwise}} \hfill \\ \end{gathered} \right.\,\,\,\, $$
(12)

where \(T_{P,Q}^{GT}\) signifies the ground truth rigid transformation, and NN denotes the nearest neighbor. \(\tau _1\) serves as the threshold for overlap determination. Likewise, the overlap loss \( \mathcal {L}_{OL}^Q \) for the target point cloud is computed in a similar manner. The overall overlap loss is formulated as \( {\mathcal {L}_{OL}} = \frac{1}{2}(\mathcal {L}_{OL}^P + \mathcal {L}_{OL}^Q) \).

In summary, the overall loss function is \( \mathcal {L} = {\mathcal {L}_{FL}} + {\mathcal {L}_{OL}} \).

Experiments

Dataset and evaluation metrics

To assess the efficacy of EPCRT in handling issues such as complex structural variations, sparsity, irregularity, and large-scale scenes, we conducted extensive experiments on various datasets, including real indoor scenes from 3DMatch [30] and 3DLoMatch [29], synthetic datasets ModelNet [31] and ModelLoNet, incomplete synthetic dataset Multi-View Partial [32], and outdoor large-scale odometry KITTI [33] dataset.

3DMatch The 3DMatch dataset comprises depth images from 62 different scenes sourced from datasets like 7-Scenes and SUN3D. 3DLoMatch is a dataset generated from the 3DMatch dataset. Notably, the overlap ratios for 3DMatch and 3DLoMatch datasets are greater than 30% and between 10% to 30%, respectively.

ModelNet The ModelNet dataset is based on the ModelNet40 dataset, a computer-aided design (CAD) synthetic dataset containing 12,311 models. ModelLoNet is a dataset generated from the ModelNet dataset. The overlap ratios for ModelNet and ModelLoNet datasets are 73.5% and 53.6%, respectively.

MVP-RG The MVP-RG dataset is derived from a synthetic and partially incomplete Multi-View Partial (MVP) point cloud dataset [34]. It consists of 7,600 pairs models.

Odometry KITTI The Odometry KITTI dataset comprises data captured from city, rural, and highway scenes using the Velodyne HDL-64E S3 LiDAR scanner. There are 11 large scenes.

Evaluation Metrics In line with the approaches of Predator [29], REGTR [35], and GMCNet [32], we evaluated the datasets using Relative Rotation Error (RRE) and Relative Translation Error (RTE). Additionally, Registration Recall (RR), Modified Chamfer Distance (CD), and Root Mean Square Error (RMSE) were employed for evaluating specific datasets. The definitions are outlined below:

$$\begin{aligned} RTE&= {\left\| {t - {t^{GT}}} \right\| _2} \end{aligned}$$
(13)
$$ RRE = \arccos \left( {\frac{{trace\left( {R^{T} R^{{GT}} } \right) - 1}}{2}} \right){\text{ }} $$
(14)
$$\begin{aligned} CD(P,Q)&= \frac{1}{{\left| P \right| }}\sum \limits _{p \in P} {\mathop {\min }\limits _{q \in {Q_{raw}}} } \left\| {T_{P,Q}^{GT}(p) - q} \right\| _2^2 \nonumber \\&\quad +\frac{1}{{\left| Q \right| }}\sum \limits _{q \in Q} {\mathop {\min }\limits _{p \in {P_{raw}}} } \left\| {q - T_{P,Q}^{GT}(p)} \right\| _2^2 \end{aligned}$$
(15)
$$\begin{aligned} RMSE&= \sqrt{\frac{1}{{\left| {C_{ij}^{GT}} \right| }}\sum \limits _{(p,q) \in C_{ij}^{GT}} {\left\| {T_{P,Q}^{GT}(p) - q} \right\| _2^2} } \end{aligned}$$
(16)

where, \( R^{GT} \) and \( t^{GT} \) represent the ground truth error of rotation and translation. \( C_{ij}^{GT} \) denotes the collection of ground truth correspondences.

To distinguish Eq. (16), we specify the RMSE for the MVP-RG dataset as follows:

$$\begin{aligned} {\mathcal {L}_{RMSE}} = \frac{1}{N}\sum \limits _{i = 1}^N {{{\left\| {{T^{GT}}({p_i}) - T({p_i})} \right\| }_2}} \end{aligned}$$
(17)

3DMatch and 3DLoMatch

To validate registration performance of EPCRT under low overlap, we adopted the training method from Predator and conducted evaluations on the 3DMatch and 3DLoMatch datasets.

Additionally, we compared EPCRT with other cutting-edge techniques, including FCGF [36], Predator [29], OMNet [37], REGTR [35], GeoTrans [38], RoReg [40], UDPReg [39], MAC [41], and RIGA [42]. Figure 2 shows the registration visualization of low overlap datasets.

Fig. 2
figure 2

Registration visualization on 3DMatch, 3DLoMatch

As depicted in Table 1, our proposed algorithm not only outperforms other algorithms in terms of the three registration metrics on the sparsity datasets, but it also exhibits lower parameter count and average processing time. In the comparison of the Registration Recall (RR) metric with MAC, UDPReg, RoReg, and GeoTrans algorithms under 3DLoMatch, our proposed algorithm demonstrates improvements of 16.3%, 11.8%, 4.9%, and 2.1% respectively.

Table 1 Performance on 3DMatch and 3DLoMatch datasets

ModelNet and ModelLoNet40

To further validate registration performance of EPCRT, we extended the training phase with the Predator and subsequently performed assessments on both the ModelNet and ModelLoNet datasets. Additionally, we compared the EPCRT algorithm against other cutting-edge techniques, including PointNetLK [43], DCP [44], RPM-Net [45], Predator[29], OMNet [37], REGTR [35], UDPReg[39], and HECPG [46]. Figure 3 shows the registration visualization of the ModelNet and ModelLoNet datasets, respectively.

Fig. 3
figure 3

Registration visualization on ModelNet, ModelLoNet

From Table 2, it is evident that our EPCRT achieves superior registration outcomes compared to other algorithms on the ModelNet and ModelLoNet datasets. While our proposed algorithm slightly lags behind the UDPReg algorithm in terms of the Relative Translation Error (RTE) metric, overall, our proposed algorithm exhibits a clear advantage in handling registration tasks.

Table 2 Performance on ModelNet and ModelLoNet datasets

MVP-RG

To confirm the registration performance of our EPCRT algorithm in incomplete and irregularity models, we trained it using the Predator method and conducted evaluations. Additionally, we compared the proposed algorithm against other cutting-edge techniques, including DCP [44], RPM-Net [45], GMCNet [32], IDAM [47], Predator [29], and DSMNet [48]. Figure 4 shows the registration visualization of the MVP-RG dataset.

From Table 3, it is apparent that our EPCRT algorithm achieves superior results on MVP-RG dataset compared to other algorithms. Through performance comparison with other algorithms, our proposed algorithm demonstrates a clear advantage in handling point cloud registration tasks under incomplete and irregularity scenarios.

Fig. 4
figure 4

Registration visualization on MVP-RG

Table 3 Evaluation results on MVP-RG dataset

Outdoor dataset: odometry KITTI

To confirm the registration performance of EPCRT algorithm in large-scale scenes, we trained it using the Predator method and conducted evaluations on Odometry KITTI dataset. Additionally, we compared the proposed algorithm against other cutting-edge techniques, including FCGF [36], D3Feat[28], Predator [29], SpinNet [49], HRegNet [50], GeoTrans [38], SHM\( _{DGR} \)[22], GeDi [51], MAC[41], SC\( ^{2} \) -PCR++ [52] and RIGA [42]. Figure 5 shows the registration visualization of large-scale scenes dataset.

Fig. 5
figure 5

Registration visualization on Odometry KITTI

From Table 4, we can see that our EPCRT algorithm achieves superior registration results on the KITTI Odometry dataset compared to other algorithms. Through performance comparison with other algorithms, our algorithm demonstrates a clear advantage in handling point cloud registration tasks in large-scale scene.

Table 4 Evaluation results on Odometry KITTI dataset

Cultural heritage dataset

To evaluate the registration performance of the proposed algorithm in cultural heritage datasets, we first validate it using the dataset of the Terracotta Warriors in the Mausoleum of the First Qin Emperor collected by Northwest University, as shown in the Figs. 6 and 7. Additionally, we compared the proposed algorithm against other cutting-edge techniques, including Predator [29], as shown in Table 5.

Fig. 6
figure 6

Registration visualization of 3DMatch \( \rightarrow \) Terracotta Warriors data. (a stands for head registration; b stands for arm registration; c stands for body registration); d stands for feet registration)

Fig. 7
figure 7

Registration visualization of 3DMatch \( \rightarrow \) Terracotta Warriors data

Table 5 Evaluation results on Terracotta Warriors dataset

From the Figs. 6, 7 and Table 5, it can be seen that there are good registration results in the head, feet, leg, and arms of the Terracotta Warriors.

Ablation study

To validate the impact of individual module selection within EPCRT model, we conducted ablation study using the 3DMatch and 3DLoMatch datasets.

From Table 6, it is evident that on top of the original encoder-decoder(Base) architecture, all four individual modules(Local Density Information: LDI; Geometric Angle Encoding: GAE) exhibit certain performance improvements. Among these, the proposed Local Geometric Perception Mechanism (LGP) and the Convolutional-Transformer Network(C-TNet) show significant enhancements in performance. Furthermore, the combined utilization of the proposed modules yields the best overall performance.

Table 6 Ablation of different modules

Conclusion

This paper proposes a Transformer-based registration method, named Enhancing Point Cloud Registration (EPCRT), aiming to address challenges in cultural heritage protection, such as complex structural variations, sparsity, irregularity, and generalization across different datasets. By introducing dynamic adjustment mechanisms and convolutional-Transformer hybrid modules, our approach can flexibly capture both local and global geometric features and achieve effective feature fusion and interactive learning. Through extensive experimental evaluations on multiple benchmark datasets, we showcase the effectiveness and superiority of EPCRT. Experimental results show that EPCRT exhibits significant performance advantages in handling complex structural variations, sparse and irregular scenes, and generalization to different datasets. Compared to traditional methods, EPCRT can align point clouds more accurately and demonstrate better generalization across different datasets, which is crucial for the accuracy and reliability of cultural heritage protection. Future research directions include further optimizing the performance of the EPCRT method, particularly in enhancing its effectiveness in handling the internal structure of complex cultural heritage data.