1 Introduction

Due to the application and popularization of high-precision 3D sensors in recent years, the point cloud data have become indispensable and widely used in numerous engineering practices and research. Especially deep learning has shown amazing excellent capabilities in many fields including but not limited to computer vision [1, 2], medical image analysis [3], and unsupervised learning [4]. Therefore, researchers began to focus on using various deep learning methods to solve the problem of point cloud in 3D object detection [5,6,7,8], quality monitoring [9], path planning of robots [10], point cloud registration [11], and other aspects.

In the field of point cloud registration that we focus on, the impact of surroundings, equipment, or angle of sampling caused the generated point cloud data are various even from the same scene, so how to register two-point clouds quickly and accurately has become complex and important research. The point cloud registration (PCR) is to estimate the mapping between the source point cloud X and the target point cloud Y, including translation, rotation, stretch, affine, transmission, polynomial, and other transformed forms. We studied here mainly for translation and rotation problems in rigid registration from the same-source point clouds. Currently, there are two implementations of point cloud registration [12]: optimization-based methods and depth-based methods.

The optimization-based method is a process of gradually optimizing registration accuracy by iterating correspondence searching and transformation estimation adopting mathematical theory. Correspondence searching is a process to find the corresponding relation, and transformation estimation uses the corresponding relation to calculate the transformation matrix. These methods are mainly represented by iterative closest point (ICP) [13], normal distributions transform (NDT) [14], and 4-points congruent sets (4PCS) [15]. In addition, there are some methods designed with artificial coding features to improve the correspondence searching, such as point feature histograms (PFH) [16], fast point feature histograms (FPFH) [17], SHOT [18], the spin image proposed by Johnson [19], the 3D and harmonic shape contexts proposed by Frome [20], etc. They all implemented shape descriptors in various ways and made better improvements in robustness. However, most of the optimization-based registration doesn’t perform satisfactorily facing the effects of noise, outliers, and low overlap, which cannot be avoided when sampling a point cloud. Not only that, but sensitivity to the initial position of the point cloud is also one of their disadvantages [12].

In addition to the optimization-based method, another method widely studied by scholars is the depth-based registration method. Due to the excellent intelligence, wide coverage and data-driven advantages of deep neural networks, such methods provide better accuracy, robustness, and generalization for point cloud registration tasks. According to the different functions and outputs of the neural network, two registration methods in depth-based methods have been divided [12]: end-to-end learning-based and feature learning-based registration methods.

The end-to-end learning-based registration methods estimate the mapping directly through an end-to-end framework, whose input is two-point clouds and the output is a transformation matrix. That is, the transformation estimation is embedded into the neural network optimization. It integrates feature extraction, correspondence, and transformation. One idea of end-to-end learning-based methods is to treat registration as a regression problem and fit it into a regression model for the transformation matrix estimation [21,22,23]. Besides, some methods combine conventional registration-related optimization theories with deep neural networks, like the maximum likelihood estimate (MLE) and Gaussian mixture model utilized in DeepGMR [24], and the minkowski proposed in DGR [25]. In addition, focus on the feature extraction provides feasible schemes, like rotation-invariant (RI) in DWC [26], two shape tensors proposed in PR-Net (Wang L et al. [27]), and two-point clouds alignment algorithms in PCRNet [28]. Generally, the end-to-end learning-based registration methods could leverage both the merits of mathematical theories and deep neural networks, also its neural network could be designed and optimized for registration tasks specifically. But an end-to-end framework is a black box that includes correspondence searching and transformation estimation, causing the network model to be sensitive to different environmental data [12].

The feature learning-based registration methods use deep neural networks for feature extraction and estimate accurate correspondences before optimization, and then, using a one-step optimization, like Singular Value Decomposition (SVD) algorithm, to determine the final transformation matrix. It provides robust and accurate correspondence searching because of the deep learning-based point feature, and the one-step estimation leads to more accurate registration results through accurate correspondences. And recent research has made some improvements. To solve the low correspondence between point clouds caused by insufficient semantic information, which further affects the registration effect, the existing methods hope to obtain as many feature maps as possible from multiple dimensions, to improve the ability of the network to complete the determination of correspondence. The deep closest point (DCP) proposed by Wang et al. [29] estimates correspondences with depth features based on dynamically updating the graph structure [30] between layers combined with an attention module while maintaining the permutation of points, and it uses SVD computational transformation finally for the registration, which performs robustness to noise. Wang and Sun et al. propose the PRNet [31] to improve the DCP model, which utilizes global pooling to aggregate point-by-point features to obtain global features, then predicts annealing parameters through a subnetwork to control the sharpening degree of matching. RPM-Net [32] used the differentiable Sinkhorn layer and annealing algorithm to obtain the matching relation between point pairs by learning and integrating spatial features and local geometric information. Furthermore, IDAM [33] proposed an iterative distance-aware similarity matrix convolution network, which can be easily integrated with traditional features (such as FPFH [17]) or learning-based features to achieve registration. When the high-dimensional semantic information of each point is of the same importance in the point cloud, the network cannot accurately distinguish whether there are irrelevant points such as noise, so the network cannot show better performance in complex point cloud registration tasks. Many existing approaches try to overcome this problem by using attention mechanisms, but we find that the performance of attention mechanisms is spotty when processing feature maps obtained in different ways and dimensions.

To address the problem of feature richness and change the influence factor of feature vectors, a dynamic learning framework integrating the attention mechanism is proposed in this paper inspired by the ideas of dynamic feature fusion and attention mechanism. In our framework, the features of a point cloud could be extracted and fused dynamically by the multi-layer of EdgeConv in series. This method can effectively extract features from multiple dimensions to enrich semantic information. And we import a variant attention module of self-attention to measure the importance of feature maps; it integrates contextual information and enhances the features’ representation. Our framework has been tested on ShapeNet Part and ModelNet40 datasets extensively and compared with other advanced networks. The experimental results prove that our framework takes advantage of robustness and generalization, and it has higher registration accuracy.

This paper extends a preliminary version of this work presented at the 2022 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE). Compared with the conference version, this paper provides some new additions. First of all, more theoretical additions about the edge convolution and offset-attention module are provided. It is convenient for readers to understand and reproduce the ideas in this paper by means of mathematical modeling. Secondly, new contrast and ablation experiments are shown to fairly and objectively demonstrate the excellent results achieved by our framework. In addition, an analysis of the effect of our implementation on field regularity and runtime is added in this paper.

2 Method

The registration process of our framework is mainly divided into three parts: feature extraction, integration, and registration (as shown in Fig. 1). The source point cloud and the target point cloud are fed into the same feature extraction network, respectively, to embed them to high-dimensional, then rescreen the feature matrixes and obtain the correlation information between them using the attention module [34], and finally estimate transformation matrixes using SVD.

Fig. 1
figure 1

The framework structure of this paper’s network

2.1 Feature extraction

When there is a large distribution difference between the unknown scene and the training data, the registration performance will drop sharply, which limits the generalization ability of the network [35]. We found that the accuracy of the corresponding relation and generalization can be effectively improved by fusing the global and local spatial features of point clouds. This cannot only be evidenced by the aforementioned DCP and RPM-Net, but also by Dynamic Graph CNN (DGCNN) [30]. DGCNN proposes to construct the K-nearest neighbor graph of points and uses the edge convolution (EdgeConv) module to capture the edges connecting pairs of points.

At the extraction stage, the point cloud data are embedded in a high-dimensional space and transforms local features and global features using multiple stacked EdgeConv modules [30]. The EdgeConv module performs convolution-like operations on edges connecting neighboring pairs of points using local neighborhood graphs. We use the EdgeConv module in feature extraction so that the local feature information can be extracted without changing the number of features and preserving the feature information to the maximum extent. This cannot only improve the network's ability to extract local features and enrich feature information, but also take into account the relationships between pairs of points, which can more effectively characterize local features.

The directed graph is constructed dynamically in front of each EdgeConv module to enrich the expression of feature information. Our feature extraction exploits a Spatial Transformer and multi-layer edge convolutional network EdgeConv cascade (as shown in a dashed box in Fig. 1). The Spatial Transformer aligns all points to a unified point set space for learning rotation invariance. And the key lies in the multi-layer EdgeConv module which adopts dynamic feature extraction.

When calculating the features at each EdgeConv module, it is divided into three steps: constructing the k-nearest neighbor (k-NN) graph, calculating the edge feature, and the channel-wise symmetric aggregation operation (Fig. 2). The point cloud feature is described by a point set with n points, denoted as \(P = \{ p_{1} ,p_{2} ,...,p_{n} \} \in \mathop {\mathbf{\mathbb{R}}}\nolimits^{F}\), where F represents the feature dimension and \(p_{{\text{i}}} = (x_{i1} ,x_{i2} ,...,x_{iF} ),i = 1,2,...,F\). Before the feature matrix is encoded into EdgeConv, the k-NN digraph \(G = (V,E)\) should be constructed, where \(V = \{ 1,2,...,n\}\) and \(E \subseteq V \times V\) represent the set of vertices and edges, respectively. Then, calculating edge feature by \({\text{e}}_{ij} = h_{\Theta } (p_{i} ,p_{j} )\), where \(h_{\Theta } :\mathop {\mathbf{\mathbb{R}}}\nolimits^{F} \times \mathop {\mathbf{\mathbb{R}}}\nolimits^{F} \to \mathop {\mathbf{\mathbb{R}}}\nolimits^{F^{\prime}}\) is a nonlinear function containing a set of learning parameters \(\Theta\). The edge feature is a local feature expression about the edges of \(p_{i}\) connecting neighboring pairs of points. Finally, we use a channel-wise symmetric aggregation operation \(\Omega \, (e.g., \, \sum {} {\text{or}}\;\max )\) to represent \(p_{i}\) with a new feature \(p^{\prime}_{i}\):

$$ p^{\prime}_{i} = \mathop \Omega \limits_{j:(i,j) \in E} \overline{h}_{\Theta } (p_{j} - p_{i} ,p_{i} ) $$
(1)
Fig. 2
figure 2

a k-NN points of \(p_{i}\); b Edge feature; c Aggregation operation

Here, each new feature \(p^{\prime}_{i}\) corresponds to \(p_{i}\) to ensure that reduces the loss of feature information. \(\overline{{\text{h}}}_{\Theta } (p_{j} - p_{i} ,p_{i} )\) is an asymmetric function to select edge features; it is the way of fusion operation and was designed to have an impact on the performance of EdgeConv.

This equation \(\overline{{\text{h}}}_{\Theta }\) takes into account both the local feature \({\text{p}}_{{\text{j}}} {\text{ - p}}_{{\text{i}}}\) containing the nearest neighbor points and the global feature \(p_{i}\). In M-layer filters, \(\Theta_{m} = (\theta_{1} ,\theta_{2} ,...,\theta_{M} ,\phi_{1} ,\phi_{2} ,...,\phi_{M} )\) is the encoding weight, and the nonlinear function \(h_{\Theta } :\mathop {\mathbf{\mathbb{R}}}\nolimits^{F} \times \mathop {\mathbf{\mathbb{R}}}\nolimits^{F} \to \mathop {\mathbf{\mathbb{R}}}\nolimits^{F^{\prime}}\) can be noted as follows:

$$ e_{{ijm}}^{\prime } = {\text{ReLU}}(\theta _{m} \cdot (p_{j} - p_{i} ) + \phi _{m} \cdot p_{i} ) $$
(2)

\(\theta_{m}\) and \(\phi_{m}\) have the same dimension as \(p\), and \(\cdot\) is the Euclidean inner product. Eventually, the aggregation operation is implemented by the symmetric function \({\text{max}}\) as follows:

$$ p^{\prime}_{im} = \mathop {M{\text{ax}}}\limits_{j:(i,j) \in E} e^{\prime}_{ijm} $$
(3)

The output will not be affected by different permutations because of the symmetry of the max function, which ensures the invariance of displacement.

2.2 Integration

To prevent the interference of too many features on the registration accuracy, the attention mechanism is added to screen the obtained point cloud features; it improves the generalization ability of the model on the premise of ensuring the registration accuracy. Attention mechanisms are introduced into point cloud tasks recently to extract feature information more effectively. Wang et al. [34]. proposed a Graph Attention Network (GAT), which achieved state-of-the-art tasks related to graph structure at that time. GAT can calculate the local attention coefficient or global attention coefficient for each point. The size of the coefficients is changed through training of the attention module so that more important features can be highlighted and the influence of irrelevant features can be reduced.

In the second step, integration, the goal is to rescreen the feature matrixes and obtain the correlation information between the source point cloud and the target point cloud. This paper introduced a variant of self-attention, offset-attention, which is used to process the graph structure data from the EdgeConv modules and focus the task on obtaining high-quality feature sets to realize the above work. Different features will get different attention scores according to their importance so that the feature information obtained in the extraction step can be rescreened and combined to obtain a similar relationship with a higher degree of fit for feature matching.

The offset-attention mechanism calculates the attention of a node in the graph relative to each adjacent node and connects the feature of the node itself and the attention feature as the node’s final feature. The main purpose of the graph attention network is to learn a function \(g:{\mathbb{R}}^{F} \to {\mathbb{R}}^{K}\), where F and K identify the feature dimension. This function maps the input feature H into a new set of vertex features \(H^{\prime} = \left\{ {h^{\prime}_{1} ,h^{\prime}_{2} \ldots h^{\prime}_{N} } \right\},h^{\prime}_{i} \in {\mathbb{R}}^{K}\). At the same time, the function can keep the relationship between these output features unchanged, that is, the category represented by the point cloud remains unchanged. Different from the relatively fixed neighborhood relations in 2D images, graph attention convolution can handle disordered and variable-sized neighborhoods according to the disordered structure of point clouds and assign weights reasonably. As shown in Fig. 3, it shows the effects of the point cloud convolution operation whether there is an attention mechanism. Unlike convolution without attention (as Fig. 3a), the attention mechanism distinguishes the degree of importance between point pairs (reflected in the thickness of edges in Fig. 3b). The importance degree between the point pairs is the weight of the network, which is constantly updated by the network training after initialization, to obtain more excellent weights. The neighborhood point cloud with similar features to the point cloud will be assigned high weight and defined as associated point pair, while the neighborhood point cloud that is not similar to the point cloud will be assigned low weight and defined as non-associated point pair to distinguish matching point pairs, which is critical for efficient point cloud registration capabilities.

Fig. 3
figure 3

The difference between the two types of convolution

From a given point cloud set \(P = \left( {p_{1} ,p_{2} , \ldots p_{n} } \right) \in {\mathbb{R}}^{3}\) construct a point cloud graph structure \(G = (V,E)\) according to its neighborhood information, where \(V = \{ 1,...,N\}\) is the vertices of the graph, N represents the number of vertices, and E represents the edges between the points. Define \(N(i) = \{ j:(i,j) \in E\} \cup \{ i\}\) to represent the neighborhood set of a point \(p_{i}\) in the point cloud. In our network, its input is a node feature vector matrix. In the point cloud registration task, the input is defined as \(H = \left\{ {h_{1} ,h_{2} \ldots h_{N} } \right\},h_{i} \in {\mathbb{R}}^{F}\), which represents the feature set of the input point cloud, and each feature \(h_{i} \in {\mathbb{R}}^{F}\) corresponds to the graph vertex \(p_{i}\), where F represents the dimension of the point cloud feature. The output is the new node feature matrix \(H^{\prime} = \left\{ {h^{\prime}_{1} ,h^{\prime}_{2} \ldots h^{\prime}_{N} } \right\},h^{\prime}_{i} \in {\mathbb{R}}^{F}\).

The offset-attention, which is an improved version of [36] and more suitable for point cloud processing, is used to generate the output \(P_{S}^{\text{out}} ,P_{T}^{\text{out}} \in {\mathbb{R}}^{\text{GAT}}\), so that the internal point cloud and between the two-point clouds contextual information can be exchanged. The offset-attention used is a variant of multi-head self-attention. Its framework is shown in Fig. 4. The multi-head attention mechanism divides the feature space into N independent subspaces (here, N = 4) and calculates the attention scores in each subspace. Then, the scores of the subspaces are combined through the concatenation operator. This improves the parallel processing capacity of the self-attention mechanism. In addition, inspired by the idea of using the Laplace matrix in graph neural networks, offset-attention calculates the difference between the self-attention (SA) features and the input features by element-wise subtraction [34].

Fig. 4
figure 4

The structure of offset-attention

The calculation process of offset-attention is shown in Fig. 5. It is divided into three stages: calculating the similarity score \({\text{e}}_{{{\text{ij}}}}\), the probability distribution of attention \(\alpha_{{{\text{ij}}}}\), and the final attention score. To obtain ampler feature expression capabilities, a learnable and shared linear transformation parameter matrix \(W \in {\mathbb{R}}^{{{\text{F}}{\prime} \times {\text{F}}}}\) is first applied to each node feature vector to obtain a new representation \(z = W{\text{h}}_{{\text{ i}}}\). The similarity score \({\text{e}}_{{{\text{ij}}}}\) between the i-th (\(h_{ \, i}\)) and j-th (\(h_{j}\)) nodes is shown in Eq. (4), which is obtained from the features of itself and its neighbor points, and the convolution kernel can dynamically adapt to the structure of the object.

$$ {\text{e}}_{{{\text{ij}}}}\,{ = }\,\text{LeakReLU}(\alpha^{T} [z_{i} ,z_{j} ]) $$
(4)
Fig. 5
figure 5

The calculation process of the offset-attention

The linear transformation \(z_{i} ,z_{j}\) obtained by \(h_{ \, i}\) and \(h_{j}\) is concatenated. Then do a dot product with a weight vector \(\alpha\), and obtain \({\text{e}}_{{{\text{ij}}}}\) by using the LeakRelu activation function. The expression of LeakRelu is shown in Eq. (5).

$$ \text{LeakReLU}(z) = \left\{ {\begin{array}{*{20}c} z & {z > 0} \\ {0.1z} & {z < 0} \\ \end{array} } \right. $$
(5)

Next, in order to deal with neighbors that vary on different vertices and spatial scales, the attention weights are normalized in all neighbor points of vertex i as shown in Eq. (6):

$$ \begin{gathered} \overline{\alpha }_{{{\text{ij}}}} = \text{SoftMax(}{\text{e}}_{{{\text{ij}}}} \text{)} = \frac{{\exp (e_{ij} )}}{{\sum\nolimits_{k \in N(i)} {\exp (e_{ik} )} }} \hfill \\ \alpha_{{{\text{ij}}}}\,{ = }\,\frac{{\overline{\alpha }_{{{\text{ij}}}} }}{{\sum\limits_{k} {\overline{\alpha }_{{{\text{ij}}}} } }} \hfill \\ \end{gathered} $$
(6)

and the self-attention weight value \(\alpha_{ij}\) is calculated by the \(\text{SoftMax}\) function. It should be pointed out that offset-attention uses a unique calculation method here, the experimental results show that it can reduce the interference of noise and is beneficial to downstream tasks.

Finally, a new node feature matrix is obtained using the offset-attention weight \(\alpha_{ij}\), and the symbolic representation is shown in Eq. (7), the symbol \(\cdot\) stands for convolution and \(b_{ \, i}\) is bias.

$$ {\text{Attention}}(h_{i} ){\mkern 1mu} = {\mkern 1mu} \sum\limits_{{j \in N(i)}} {\alpha _{{ij}} \cdot f(h_{i} ){\text{ }} + {\text{ }}b_{i} } $$
(7)

The \(P_{S}\) and \(P_{T}\) encoded the contextual features of the source point cloud and the target point cloud, respectively. but the point clouds are not understood. To enhance the correlation between features, it is necessary to add some cross-talk at the level of superpoints to the network and learn the importance weight of proximity [37]. Before concatenating the two feature codes, a graph neural network (GNN) is first used to further aggregate and strengthen their contextual relations. First, K-NN is used to connect the superpoints from \(P_{{\text{feature}}}\) to the graph in Euclidean space. Let \(x_{i} \in {\mathbb{R}}^{F^{\prime}}\) denote the feature encoding of the superpoints \(P_{{\text{feature}}}\), and \((i,j) \in E\) be the graph edge between \(x_{i}\) and \(x_{j}\) in \(P_{\text{feature}}\). The encoded features are iteratively updated by the k-th EdgeConv block, and using Eq. (8).

$$ {}^{(k + 1)}x_{i} = \mathop {\max }\limits_{(i,j) \, \in \, e} h_{\theta } (\text{cat}[{}^{(k)}x_{i} ,{}^{(k)}x_{j} - {}^{(k)}x_{i} ]) $$
(8)

The \(h_{\theta }\) represents linear layer, LeakyReLU activation function, and instance normalization. The function \({\text{max}}\) represents the maximum pooling layer. And the \(\text{cat}\) function represents concatenation.

The update is performed twice using the unshared parameter \(\theta\), and the final feature \(x_{i}^{\text{GNN}} \in {\mathbb{R}}^{G}\) is shown in Eq. (9):

$$ x_{i}^{\text{GNN}} = h_{\theta } (\text{cat}[{}^{(0)}x_{i} ,{}^{(1)}x_{i} ,{}^{(2)}x_{i} ]) $$
(9)

Then, for obtaining the correlation information between the two point clouds, the two superpoints obtained by Eq. (12) are connected to form a bipartite graph. Inspired by the attention, based on the key value \(k_{j} \in R^{F^{\prime}}\), the query vector \(s_{i} \in R^{F^{\prime}}\) is used to retrieve other superpoints \(v_{j} \in R^{F^{\prime}}\), the symbols are expressed as Eq. (10) where \(W_{k}\), \(W_{v}\) and \(W_{s}\) are learnable weight matrices:

$$ \begin{gathered} k_{j} = W_{k} \cdot x_{j}^{\text{GNN}} \hfill \\ {\text{v}}_{j} = W_{v} \cdot x_{j}^{\text{GNN}} \hfill \\ {\text{ s}}_{i} = W_{s} \cdot x_{i}^{\text{GNN}} \hfill \\ \end{gathered} $$
(10)

2.3 Registration based on SVD

After obtaining the encoded point cloud features, the final step is to estimate the transformation matrix. For the case where T is a rigid transformation, the point cloud registration problem can be described by Eq. (11). The purpose of this equation is to calculate the rotation matrix R and the translation matrix t that have a lower deviation between the transformed source point cloud \(R \cdot P_{S} + t\) and target point cloud \(P_{T}\).

$$ (R,t) = \mathop {\arg \min }\limits_{R,t} \sum\limits_{i = 1}^{{\left| {P_{S} } \right|}} {w_{i} \left\| {P_{T}^{i} - (R \cdot P_{S}^{i} + t)} \right\|^{2} } $$
(11)

Let \(\hat{P}_{S}\) be the center of \(P_{S}\) and \(\hat{P}_{T}\) be the center of \(P_{T}\). In theory, the center points of \(\hat{P}_{S}\) and \(\hat{P}_{T}\) should be the same after registration, denoted as \(t = \hat{P}_{T} - R \cdot \hat{P}_{S}\). So Eq. (11) can be equivalent to Eq. (12).

$$ R = \mathop {\arg \min }\limits_{R,t} \sum\limits_{i = 1}^{{\left| {x_{i} } \right|}} {w_{i} \left\| {y_{i} - Rx_{i} } \right\|^{2} } $$
(12)

where \(y_{i} = P_{T}^{i} - \hat{P}_{T}\) and \(x_{i} = P_{S}^{i} - \hat{P}_{S}\). Through this step, the rotation matrix R can be obtained first eliminating the influence of the translation matrix t. In the subsequent steps of solving the rotation matrix R separately, Eq. (12) is equivalent to solving a matrix R satisfying Eq. (13) using SVD. Let \(S = XWY^{T}\), we have \(tr(RXWY^{T} ) = tr(RU\Sigma V^{T} ) = tr(\Sigma V^{T} RU)\). After the rotation matrix R is calculated by SVD, the translation matrix t is obtained by \(t = \hat{P}_{T} - R \cdot \hat{P}_{S}\).

$$ \begin{gathered} R = \mathop {\arg \max }\limits_{R,t} \sum\limits_{i = 1}^{{\left| {x_{i} } \right|}} {w_{i} y_{i} Rx_{i} } \hfill \\ \, = \mathop {\arg \max }\limits_{R,t} [tr(RXWY^{T} )] \hfill \\ \end{gathered} $$
(13)

To avoid choosing a non-differentiable hard specification, a probabilistic approach from DCP [29] is used to generate a soft map from one point cloud to another. That is, each \(x_{i} \in P_{S}\) is assigned a probability vector in \(P_{T}\) as shown in Eq. (14).

$$ m(x_{i} ,P_{T} ) = \text{soft}\max (y_{{}}^{\text{GNN}} (x_{i}^{\text{GNN}} )^{T} ) $$
(14)

Among them, \(y_{{}}^{\text{GNN}} \in {\mathbb{R}}^{\text{GAT}}\) is the point cloud feature generated by offset-attention, \(x_{i}^{\text{GNN}}\) is the i-th row of the point cloud feature matrix, and \(m(x_{i} ,P_{T} )\) here can be regarded as the soft pointer that each \(x_{i}^{\text{GNN}}\) points to \(y_{{}}^{\text{GNN}}\). In this way, a matching average point in \(P_{T}^{\text{out}}\) can be generated for each point in \(P_{S}^{\text{out}}\), as shown in Eq. (15).

$$ \hat{y}_{i} = (P_{T} )^{T} m(x_{i} ,P_{T}^{\text{out}} ) \in {\mathbb{R}}^{3} $$
(15)

2.4 Summary and analysis of our algorithm

The main idea of this model is to use EdgeConv to achieve the goal of enriching the feature map (Sect. 2.1). And using offset-attention to change the influence factor of the feature vectors (Sect. 2.2).

The point cloud feature extraction algorithm introduced in Sect. 2.1 is shown in Algorithm 1. It has a time complexity O(n12), where n1 is influenced by the number of EdgeConv modules and points. Its space complexity is S(n2), where n2 refers to the dimension of layers. And the attention mechanism algorithm introduced in Sect. 2.2 is shown in Algorithm 2. It has a time complexity of O(n3), where n3 is influenced by the heads of offset-attention. Its space complexity is S(1) due to the network and data dimensions have not changed.

figure e
figure f

3 Experiment

3.1 Setup of experiments

Compare the performance between our model and recent deep learning-based registration methods: RPMNet [38], DCP-V2 [29], PTRNet [39], GeoTransformer [40], and the classic optimization-based point cloud registration algorithm ICP, of which RPMNet, DCP-V2, and GeoTransformer are the more advanced feature-based deep point cloud registration method in recent years, and PTRNet is an advanced end-to-end registration method.

The experiment is performed utilizing PyTorch, and the hardware configuration involves an NVIDIA GTX 2080ti GPU. The initial learning rate is established at 0.001, the number of training epochs is 250, and the batch size is configured as 8. Meanwhile, we measured the total parameter number as 4,514,176 using a third-party libraries thop from Python.

3.2 Datasets

When evaluating the effectiveness of our model, it is compared with other methods on two public datasets, ModelNet40 [41] and ShapeNet Part [42]. And we tested our model’s universal and robust performance on the local industrial parts dataset; we will be covered this dataset in detail separately in Sect. 4.2.

The ModelNet40 dataset used in our experiment comes from the ModelNet dataset [41]. And this dataset contains 12,311 mesh CAD models from 40 categories. It is a large dataset released in recent years and is widely used in point cloud processing tasks. In this paper, the experimental configuration on the ModelNet40 dataset is consistent with that of DCP [29], and the training and testing are performed on the ModelNet40 complete dataset, using 9843 models for training and 2468 models for testing.

ShapeNet Part is a subset of ShapeNet Core, containing 16,881 models in 16 categories. To make training and evaluation of learning-based methods possible, ShapeNet Part is split into three parts: 12,136 training models, 1870 evaluation models, and 2874 test models. Also sample 1024 points uniformly from the outer surface of each model, as in previous work.

3.3 Evaluation metrics

The mean square error (MSE), root mean square error (RMSE), and mean absolute error (MAE) between the true value and the predicted value were measured experimentally, and MSE, RMSE, and MAE were used as the evaluation indicators in this paper. Ideally, if the source and template point clouds are perfectly registered, all of these error metrics should be zero. In our results, both the rotation matrix (R) and the translation matrix (t) are evaluated, which were denoted as MSE(R), RMSE(R), MAE(R), and MSE(t), RMSE(t), MAE(t) in table, respectively. And all angle measurements in the results are in degrees.

4 Results

4.1 Results on public datasets

(1) Contrast experiment

In the contrast experiment, all categories in ModleNet40 and ShapeNet Part datasets were divided into training and testing sets randomly without knowing their category labels. Tables 1 and 2 are the experiment results under ModelNet40 and ShapeNet Part datasets, respectively.

Table 1 ModelNet40: Contrast experimental results
Table 2 ShapeNet Part: Contrast experimental results

Comparing the results in Tables 1 and 2, it can be seen that the registration results based on the ShapeNet Part dataset are generally higher than those in the ModelNet40 dataset, which is related to the amount of data in each category in training. ShapeNet Part has an average of 1055 models for each type of object, while ModelNet40 is only equivalent to 30% of it, and the more data, the more fully train the model, so the registration effect of each model on the ShapeNet Part dataset is better than that on the ModelNet40 dataset.

Based on the ModelNet40 dataset, our model is slightly better than that of advanced networks proposed in recent years but underperforms the GeoTransformer model on the ShapeNet Part dataset. The rotation matrices MSE, RMSE, and MAE of GeoTransformer are 0.016, 0.013, and 0.036 lower than ours. It is worth noting that the registration effect of PTRNet is slightly worse than that of the advanced network, which is precisely because it is not sufficient in local feature extraction, and the transformation generation method has a strong dependence on the data, which is confirmed in Table 2.

Figure 6 shows the registration results of some objects in our framework, the top is the registration results in the ModelNet40 dataset, and the bottom is the ShapeNet Part dataset. The green points and red points represent the source and target point cloud, respectively. The blue points represent the position of the source point cloud after transformation, and the higher the coincidence degree between it and the target point cloud, the darker the color.

Fig. 6
figure 6

The registration performance visualization of our framework

(2) Generalizability experiment

For testing the generalization ability of networks, the ModelNet40 dataset is divided into two groups by category randomly: 30 categories are training sets and 10 categories are test sets. Uniformly, 10 categories from the ShapeNet Part dataset are used as the training sets, and 6 categories are used as the test sets.

The experimental results on different datasets are shown in Tables 3 and 4, respectively. Combined with the results of comparison and generalization experiments, the registration errors of all models are larger in the generalization experiment. On the ModelNet40 dataset, GeoTransformer gets a relatively low RMSE(t) value. Our framework reduces the MSE(R), RMSE(R), MAE(R), and MAE(t) to 4.592, 2.143, 1.800, and 0.010, which are lower than all compared models. On the ShapeNet Part dataset, DCP-V2 performs better than others at MSE(t) and RMSE(t). Our framework gets the lower MSE(R), RMSE(R), MAE(R), and MAE(t) values. This suggests that the framework that focuses more on point cloud geometry is more suitable for the generalization task of PCR.

Table 3 ModelNet40: Generalizability experimental results
Table 4 ShapeNet Part: Generalizability experimental results

(3) Robustness experiment

To test the robustness of our framework and other models, the noise was sampled independently from N(0, 0.01); and the noise was clipped to [-0.05, 0.05] adding to the data during testing. This experiment uses models trained on all ModelNet40 noise-free data, and the experimental results are shown in Tables 5 and 6.

Table 5 ModelNet40: Robustness experimental results
Table 6 ShapeNet Part: Robustness experimental results

Comparing Tables 1 and 5 based on the ModelNet40 dataset, and Tables 2 and 6 based on the ShapeNet Part dataset, it is easy to see that GeoTransformer does not perform better than ours, and our framework still comes out on top in the robustness experiments. And DCP-V2 has a large error fluctuation on the ShapeNet Part dataset, which is caused by poor performance on individual data. Although the MAE of RPMNet is not disturbed much, it can be seen from the MSE and RMSE indicators that its registration performance is not stable. In addition, the registration accuracy is not ideal because ICP is prone to fall into local optimal solutions. In summary, our model performs optimally in robustness experiments on the two datasets, indicating that our framework is relatively robust.

(4) Ablation experiment

We added ablation experiments to verify the performance improvement of each module in the proposed framework.

Only the EdgeConv module in the framework proposed in this paper was changed into ordinary convolution, and the dimensions of each layer of convolution were kept unchanged. The experimental results obtained were shown in Table 7. In theory, the EdgeConv module should have excellent performance because it can extract semantic information from multiple dimensions and takes into account both local and global features. This point is also verified by the experimental results in Table 7, which show that the error will be greatly increased when a common convolutional network is adopted.

Table 7 Ablation experimental results of the EdgeConv module

Similarly, the offset-attention module is changed to self-attention, and the experimental results are shown in Table 8. The results show that offset-attention is more depressed than self-attention in the registration task.

Table 8 Ablation experimental results of the offset-attention module

4.2 Results on local industrial parts dataset

(1) Local industrial parts dataset

The local industrial parts dataset is composed of two-parts point cloud data, which are point cloud data collected from two actual parts, respectively, denoted as part a and part b. After denoising, the original data of part a are a point cloud with 344,962 (173 × 1994) points as shown in Fig. 7, and the original data of part b is a point cloud with 346,236 (172 × 2013) points as shown in Fig. 8. For using the method in this paper to register the local actual data set, 1024 points are uniformly sampled from each part point cloud when processing the data, and the same method as PointNet [29] is used to analyze the two part point clouds. And the data were normalized.

Fig. 7
figure 7

Schematic diagram of part a

Fig. 8
figure 8

Schematic diagram of part a

(2) Registration visualization

Combined with the actual industrial application scenario, we select GeoTransformer, PTRNet, and our framework to test the local industrial parts data, which have high registration accuracy, generalization, and noise resistance ability on public datasets. It helps to save the public resources of the enterprise. These 3 models all run 50 tests on point clouds with different initial positions, and the mean square error (MSE) was used as the evaluation index. The registration effect is shown in Fig. 9.

Fig. 9
figure 9

Registration effect on local industrial parts dataset

It shows that in the test based on the point cloud of part a, the mean square errors of rotation matrix MSE(R) and the mean square errors of translation matrix MSE(t) of our framework are 0.056 and 0.000, respectively. The two errors of PTRNet are 0.472 and 0.005. And The two errors of GeoTransfomer are 1.376 and 0.070, respectively. And the test based on the point cloud of part b, the mean square errors of rotation matrix MSE(R) and the mean square errors of translation matrix MSE(t) of our framework are 0.121 and 0.001, respectively. The two errors of PTRNet are 0.179 and 0.004. And The two errors of GeoTransfomer are 1.740 and 0.023, respectively.

According to the experimental results, our framework performs better in industrial data. The reason is our framework fully considers the geometric fusion relationship between the point cloud which leads to a good registration effect, and better generalization and robustness are also factors for success.

5 Conclusion

This paper proposes a point cloud registration framework based on feature fusion. By combining the graph attention network with offset-Attention, a new integration module that integrates point cloud context information and fuses point cloud information is proposed. The experiments show that our framework using this integration module can effectively improve the point cloud feature learning performance, which improves the accuracy of PCR with better generalizability and robustness. And it is highly integrated and can be embedded into other networks. Moreover, the experiments on the local parts dataset also show that our framework can be well applied to real data and is more universal.

We noticed that in actual production operations, the interference caused by complex environments to the point cloud data generated by 3D sensors should not be underestimated. Inspired by the operation of dehazing in 2D images [43, 44], we hope to solve the above problem of excessive point cloud noise effectively.