1 Introduction

Point cloud registration is a key problem in 3D computer vision with wide applications including 3D reconstruction [1], pose estimation [2], simultaneous localization and mapping (SLAM) [3], and Point cloud registration technology is also involved in emerging applications such as autonomous driving, robotics and augmented reality. Generally, the process is divided into two steps, coarse registration and fine registration. Among them, coarse registration is the basis of fine registration, and its result directly determines whether the iterative closest point (ICP) algorithm [4] can converge to the correct solution. In order to achieve accurate and robust coarse registration, a lot of research has been carried out. Some of the algorithms use highly robust fitting or optimization techniques, such as RANSAC [5, 6] and fast global registration (FGR) [7, 8]; other algorithms achieve the goal by detecting and matching feature points. During the research, several feature points detection methods is developed, such as uniform sampling [9], Harris [10], ISS [11], Narf [12], etc. To match those feature points robustly, a lot of feature description methods are proposed by employing the local spatial or geometric metric relationships, such as SI [15], SCOV [16], ROPS [17] and SHOT [18]. In recent years, many researchers also proposed feature descriptors based on deep learning networks (DNN). For example, Li [19] proposed an end-to-end framework to learn local multi-view descriptors of 3D point cloud and get better results than traditional methods in most of the test scenarios. Zeng [20] proposed the 3DMatch, which builds training samples from registered RGB-D data and generates describers by Siamese Neural Network. Despite the above progress, the stability of the coarse registration algorithm under extreme conditions, such as low overlap rate or large noise, still needs to be further improved.

In order to overcome the above problems, this paper proposes new strategies in three aspects: feature point detection, feature descriptor construction and registration conditions. Specifically, the contributions of this paper are as follows:(1) A lightweight network for feature point detection based on the idea of non-maximum suppression (NMS) is proposed. Compared with the method of calculation directly on the point cloud data, our method uses a deep learning method to fit the local surface changes, which can filter out the influence of noise better. At the same time, the NMS strategy also prevents the phenomenon of feature point aggregation and ensures the diversity of feature points. (2) We propose a lightweight feature descriptor construction network. The integration of attention mechanism makes it be able to make full use of the information contained in the point cloud and improve the discrimination of feature descriptors, also the Siamese structure makes sample construction and network training becomes easy. (3) By introducing the concept of virtual feature points, we reduce the number of feature points required for registration from 4 to 2, which improves the success of coarse registration under the condition of low overlapping. Experiments show that the proposed algorithm can get satisfactory results in various challenging scenarios, and the overall effect is better than the latest method.

The following sections are organized as follows. Section 2 reviews relevant work. Section 3 introduces the algorithm and its implementation details. Section 4 verifies the effectiveness of the proposed algorithm through experiments and compares it with the current mainstream algorithm. Finally, the thesis is summarized in Sect. 5, and further research direction is pointed out.

2 Related work

2.1 Coarse registration

At present, point cloud coarse registration algorithms can be divided into three categories: methods based on RANSAC, methods based on feature point, and methods based on deep learning.

  1. (1)

    Methods based on RANSAC

In RANSAC-based methods, some points are randomly selected from the point clouds firstly. Then, the optimal corresponding points are sifted out by judging the spatial structure consistency. Finally, the registration parameters can be calculated from the matched points. 4PCS [21] is a typical RASAC-based method. The algorithm finds matching pairs by coplanar-four-points criterion and shows good robustness. However, it consumed a lot of time to eliminate the false matching pairs, thus limiting its application. To overcome the problem, Super 4PCS [22] established an intelligent index and eliminated invalid point pairs according to the normal angle constraint, which reduces the time complexity of 4PCS to constant. Besides, the Super Generalized 4PCS [23] algorithm adds non-coplanar optimization to the intelligent index strategy, the V4PCS [24] algorithm proposes the concept of volume consistency and MSSF-4PCS [25] uses multi-scale clustering to extract point features. Although great progress has been made, due to the complex characteristics of point cloud data, the registration effect of the above methods is still not satisfactory when the point cloud symmetry is strong or the details of the overlapping region are not obvious.

  • (2) Methods based on feature points

This kind of method usually first calculates the significance of each point in the point cloud under some measurement index, and then identifies the points whose significance is higher than a threshold as feature points; finally, the point cloud registration is realized by constructing and matching the descriptor of the feature points.

In terms of feature point detection, Harris 3D [10, 26] can extract corners in the point cloud with high efficiency. However, in practical application, it is prone to the problem of feature points gathering together. Therefore, it is not easy to build highly differentiated descriptors for these points. The subsequent ISS [11, 27] is a feature point extraction method based on eigenvalue analysis, which has obvious geometric significance. But the principal component calculation is easy to be affected by outliers. Therefore, the robustness needs to be further improved. The NARF feature proposed by steder et al. [12] can detect the points robustly and efficiently, but is more suitable for regular depth images. When applied to irregular point cloud, the repeatability of feature points is greatly reduced. Learning discriminative features for better localizing accurate and distinct keypoints across various objects is still a challenging task. Yang et al. [13] build a large-scale and diverse dataset named KeypointNet which contains 8,234 models with 103,450 keypoints and can boost the semantic understanding of 3D objects. Subsequently, Yang et al. [14] propose a self-supervised 3D keypoint detector UKPGAN based on the GAN-based sparsity control and salient information distillation modules, which is applicable to rigid/non rigid objects and real scenes.

In terms of feature description, PFH [28] parameterizes the spatial differences between query points and neighborhood points and forms a multi-dimensional histogram to geometrically describe the nearest neighbors of points. The operator has rotation and translation invariance and is robust to sampling density change or noise. However, its computational complexity reaches o (nk2), where n is the number of points in the point cloud and k is the number of neighborhood points used. To reduce the computational complexity, the FPFH describer [29] is proposed. By simplifying the calculation of feature histogram, FPFH reduces the time complexity to o(nk) successfully. At the same time, the histogram weighting of neighborhood points makes the algorithm can capture local features better. In the Next, Bi proposed the RICI descriptor [30], which enhances the tolerance of the algorithm to outliers. Tal et al. [31] proposed a self-rotation descriptor, which improves the discrimination of the descriptor using a more local computational scale. At the same time, inspired by the SIFT descriptor [32] in the image field; they also proposed the local depth SIFT (LD-SIFT) descriptor [31] with rotation and scale invariance. PCEDNet [33] proposed a new parameterization and a new lightweight neural network structure, which has greatly improved the efficiency and classification ability.

  • (3) Methods based on deep learning

The rise of deep learning has brought new ideas to point cloud registration. Compared with the manual designed methods, methods based on deep learning can automatically find the potential laws and features in the data, so as to make full use of the original information and improve the effect of point cloud registration. In order to use the convolution network to process scattered point cloud, a common practice is to sample the point cloud to three-dimensional voxels or two-dimensional grids (such as 3DShapeNets [34], PointGrid [35], LORAX [36]), and then use convolution layers to generate a description of local geometric details. Especially, the 3DMatch [20] uses the registered depth data to make training samples, and then fits the hidden input–output relationship in the samples through a Siamese neural network, which improves the discrimination ability and robustness of the feature descriptor significantly. In order to extract feature information directly from point cloud, Qi et al. proposed PointNet [37], which processes each point independently through 1 × 1 convolution kennel. At the same time, it uses a symmetry function to eliminate the output change caused by the input order of the points.

Recently, the correspondences-based methods are gaining more and more attention, which constructs the correspondences for all source points without distinguishing inliers and outliers using virtual points. DCP [38] uses DGCNN [39] and Transformer [40] to learn the task-specific features. Rpm-net [41] leverage data-driven deep neural networks to learn local features from large-scale datasets. In DeepVCP [42], these virtual corresponding points are constructed based on the assumption that accurate initial motion parameters are provided as prior. Although shown to be more robust than traditional methods, they do not work well on partially visible point clouds. DCP was later extended to Prnet [43], which is a hard matching-based method and incorporates keypoint detection to handle partial visibility. However, this strategy can only work on the identified inliers and the drawback of one-to-many matching is ineluctable. [44] designs a dedicated soft-to-hard (S2H) matching procedure within the registration pipeline, which can be easily integrated with existing registration frameworks and has been verified in representative frameworks including DCP, Rpm-net. VRNet [45] constructs a pair of consistent point clouds by adjusting virtual corresponding points (vcps) to rectified virtual corresponding points (rcps) construct a pair of consistent point clouds, which effectively breaks the distribution limitation of VCPs and improves the registration performance and efficiency. SpinNet [46] propose a new neural architecture to extract local features which is rotation invariant, representative, and its descriptor achieve good results in point cloud registration. Pointdsc [47] select consistent correspondences after the initial matching to tackle the outliers. This approach is effective but complex. Predator [48] proposes an overlap attention module to handle point-cloud pairs with low overlap, but this is operationally complex as well as time-inefficient. DIP [49] presents a PointNet-based architecture for learning 3D local deep descriptors that can be used to register point clouds without requiring an initial alignment. Our network is also based on Siamese PointNet, but we add attention mechanism to the network, which enhances the expression ability of the network and obtain a better registration effect. At the same time, we simplify the network structure, making our network lighter and more efficient. Through experimental comparison, the registration rate of our network is as high as 97.08%, which is about 10% higher than DIP [49] and Predator [48], 30% higher than 3DMatch [20] and a traditional method [27] using ISS and FPFH.

2.2 Attention mechanism

Attention plays an important role in human perception [50,51,52]. When people see a scene, they will involuntarily choose the most important part to watch. Attention mechanism comes from the simulation of human vision. It allows the computer to efficiently and accurately screen out the useful message from the massive information. With the continuous progress of deep learning technology, the attention mechanism has been widely used. For example, Wang et al. [53] proposed the residual attention network, which uses an encoder-decoder style attention module. By refining the feature map, the network not only performs well on clean data, but also is robust to noisy inputs. Hu et al. [54] proposed the SE-Net, in which a new SE module is used. The module has a simple structure and strong universality. It can be embedded into any existing network, and the consumed computing resources only need to be increased by less than 10%. CBAM [55] is another popular attention module for convolution networks, which combines channel attention and spatial attention in series. Spatial attention allows the neural network to transform the spatial information in the original picture into another space while retain the key information. Channel attention can change the proportion of weight distribution between convolution channels to give more play to the network efficiency. PointNet extracts a global feature when extracting data features, thus ignoring local features and the differences between feature channels. This may cause the network to ignore potential relationships among feature information of point clouds. To solve the above problems, this paper adds a channel attention mechanism to the network structure to fully extract the feature information contained in point clouds. At the same time, different weights are assigned to different feature information, which improves the network's ability to extract feature information.

3 Our algorithm

This paper presents a point cloud coarse registration method combining local extremum point extraction and deep learning feature descriptor. Specifically, we first extract curvature extremum points from source and target point clouds as feature points, respectively; then, we put the feature points into a lightweight attention based feature description network to generate the descriptor; finally, false matching pairs are filtered by RANSAC and the registration is completed with at least 2 matching pairs with the help of virtual feature points. The flowchart of the specific algorithm is shown in Fig. 1.

Fig. 1
figure 1

Pipeline of our algorithm. Given the source and target point cloud (a), we first extract curvature extremum points using our feature point detection network (b);then we use our deep feature description network to generate the descriptors of feature points and find matching points (c); finally, registration is completed with the help of virtual feature points (d); the result of registration (e)

In this paper, we first choose curvature extremum points as feature points in (b). As curvature extremum points locate at the places with the most drastic surface changes in the model, it contains rich surface information, which is beneficial for generating feature descriptors. We regard it as the screening criterion and train a very lightweight net to detect the feature points. We also use a Siamese network with attention mechanism as the feature descriptor generator in (c), which can automatically find the most effective feature description strategy from a large number of samples, and effectively improve the discrimination of feature descriptors. Finally, by introducing virtual feature points, we reduce the minimum number of matching points required for registration from 4 to 2 in (d), which can effectively improve the registration efficiency and the registration success rate in the case of a low overlapping rate.

3.1 Curvature extremum points detection

  • (1) Theory

Curvature is a direct measurement of the local change of the surface. Generally, the greater the curvature, the richer the surface information contained. Therefore, if the curvature extreme points are used as feature points, they can encode the most surface information and improve the discrimination of feature descriptors. Following the idea of non-maximum suppression, this paper first calculates the gauss curvature value of each point in the point cloud, then fits the curvature distribution of the local neighborhood through a quadric surface, and finally recognizes the curvature extreme point by comparing the distance between the curvature extreme point from the quadric surface and the current point. The specific process is as follows:

  1. (1)

    Create a Local Coordinate System (LCS) with the current point as the origin, its normal as the z-axis, and the maximum principal curvature direction as the x-axis. Then, translate the current point and its r neighborhoods to the LCS;

  2. (2)

    Construct the objective equation in Eq. (1) and solve the parameter \(b_{0}\) ~ \(b_{5}\), where n is the number of neighborhood points, \(x_{i}\),\(y_{i}\) are the x, y coordinate component of the neighborhood points under the LCS and \(c_{i}\) is the gauss curvature of neighborhood points;

    $$\arg \min \sum\limits_{i = 1}^{n} {\left( {b_{0} x_{{\text{i}}}^{2} + b_{1} y_{i}^{2} + b_{2} x_{i} y_{i} + b_{3} x_{i} + b_{4} y_{i} + b_{5} - c_{i} } \right)^{2} }$$
    (1)
  3. (3)

    Put y = 0 into Eq. (1) to obtain the curvature curve at the maximum principal curvature direction as \(f\left( x \right) = b_{0} x^{2} + b_{3} x + b_{5}\), then calculate the extremal coordinate of \(f\left( x \right)\) as \({{x_{{{\text{max}}}} = - b_{3} } \mathord{\left/ {\vphantom {{x_{{{\text{max}}}} = - b_{3} } {\left( {2b_{0} } \right)}}} \right. \kern-0pt} {\left( {2b_{0} } \right)}}\);

  4. (4)

    Put y = 0 into Eq. (1) to obtain the curvature curve at the maximum principal curvature direction as \(f\left( y \right) = b_{1} y^{2} + b_{4} y + b_{5}\), then calculate the extremal coordinate of \(f\left( y \right)\) as \({{y_{{{\text{max}}}} = - b_{4} } \mathord{\left/ {\vphantom {{y_{{{\text{max}}}} = - b_{4} } {\left( {2b_{1} } \right)}}} \right. \kern-0pt} {\left( {2b_{1} } \right)}}\);

  5. (5)

    Calculate the distance between the current point \(p\) and the curvature extremal point as \(l = \left\| {\left( {x_{\max } , \, y_{\max } } \right)} \right\|\). If \(l\) is less than the average sampling density \(\rho\) of the point cloud, then identify it as a feature point; otherwise, \(p\) is far away from the real feature point and identified as a general point;

  6. (6)

    Project the extremal point onto the point cloud by MLS surface [56] to obtain its accurate position.

In order to prevent the aggregation of extreme points, which will adversely affect the discrimination of feature descriptors, we further screen the initial feature points by using the idea of NMS, that is, we only retain the point with the largest curvature in the local neighborhood as the final feature point. However, it should be pointed out that the curvature is a second-order differential of a continuous surface and it is difficult to be estimated accurately in the case of data missing or noise using traditional methods. This also makes it difficult to detect the curvature extreme points on the noise model, which will greatly reduce the repeatability of feature points. Thanks to the progress of deep learning, it brings an opportunity to the development of feature detection. Different from the traditional methods, deep learning can improve the robustness of feature detection through the diversity of training data.

  • (2) Feature point detection network

The computational cost of the traditional algorithm for detecting feature points is relatively small, but the accuracy is reduced due to noise and other factors in the non-ideal case. Therefore, the main challenge is how to extract enough expressive features for each point to make the model more accurate while ensuring that the computational cost is not too large. Aiming at this challenge, we take the above theory as the screening criterion, and train a lightweight network as a feature point detector. The specific structure of the network is shown in Fig. 2. In this network, we take the position and curvature of the point cloud as input and extract the features through 3 Convolution layers and a Max-pooling layer, then use the full connection layer to map the learned features to the sample label space. The last layer of the network is the softmax layer, which is used to map the features of the last fully connected layer to a score vector.

Fig. 2
figure 2

Feature point detection network

For the implementation details of our detection task, we added curvature information to help the network detect feature points better, i.e., taking (x, y, z, curvature) as input. The output of the network is a two-dimensional score vector, representing the probabilities of two categories (feature points or not), respectively. Finally, we screen the required feature points by setting a probability threshold. In training, since we regard this task as a classification problem, we choose cross-entropy as the loss function which is often used to evaluate the performance of a classification model. We also added different levels of noise and random rotation to the samples to increase the diversity of samples. Subsequent experiments show that those strategy is effective, which makes our feature point detection network receives better repeatability and robustness than the current mainstream algorithms.

3.2 Deep feature description network

In order to process the scattered point cloud directly, our feature description network uses PointNet as the backbone, which solves the problem of point cloud disorder by using a symmetric function. Also, it uses a spatial transformation network to solve the network output changing caused by point cloud pose variation.

Nevertheless, PointNet was originally designed for point cloud classification, so its feature description is not optimal for point cloud registration, that is, the direct use of PointNet structure cannot get the optimal feature descriptor. Therefore, as shown in Fig. 3, we modify the network structure to be Siamesed, and use the Contrastive Loss to fine tune the network parameters, so as to make the generated feature descriptor meet the needs of point cloud registration. At the same time, as the output of the network becomes binarized, the sample preparation process can be greatly simplified, and the difficulty of network training can also be greatly reduced. Specifically, we take the feature points(x, y, z) mentioned above as the input, finally generate a 1 × 1024 descriptor for finding matching points. During training, we input the descriptor into the Contrastive Loss for network’s learning.

Fig. 3
figure 3

Our deep feature description network

It is noteworthy that compared with point cloud classification, the local attribute of feature description is more obvious, that is, the required neighborhood points to generate feature is much less and the solution space range required to search is much smaller. Therefore, as shown in Fig. 4, we simplified the MLP part of the original network. Specifically, our network has reduced four full connection layers, three batch normalization layers and two drop out layers. Experiments show that the above modifications do not affect the discrimination of feature description.

Fig. 4
figure 4

The modified PointNet and Channel Attention

In order to make the network be able to capture the channel differences, we insert an attention module at the last of each Siamese branch, that is, the CBAM module mentioned above. The original CBAM module contains two parts: the channel and the spatial attention. In the channel attention module, the network compresses the feature map in the spatial dimension, obtains a one-dimensional vector, so as to obtain the channel weighted vector and extract important channel information. As CBAM is a lightweight module, the computing resource consumption of this module is almost negligible. On the other hand, spatial attention pays more attention to the spatial information of data. As mentioned before, the T-Net has eliminated the impact of pose changing, therefore, spatial attention has little improvement to our network. More importantly, spatial attention will highlight the main characteristics of the data while ignore the neighborhood characteristics, which will have a negative impact on the feature description. Therefore, we dropped the spatial attention mechanism and only used the channel attention in our network.

After the above modifications, the final structure of our network is shown in Fig. 3. In the Siamese part of the network, the two branches deal with the feature points from the source point cloud and the target point cloud, respectively. Specifically, the local neighborhood data set of each feature point \(\left( {x_{1} ,x_{2} ,...x_{n} } \right)\) is mapped into a one-dimensional vector after a multi-layer representation network, as shown in Eq. (2):

$$F = f\left( {x_{1} ,x_{2} ,...x_{n} } \right) = \gamma \left( {\mathop {\max \left\{ {h\left( {x_{i} } \right)} \right\}}\limits_{i = 1,...n} } \right),$$
(2)

where \(f\) represents the mapping from a point set to the vector, which is invariance to the order of the input points, \(F\) is the result vector, \(\gamma\) and \(h_{*}\) are continuous functions, which represent multi-layer perceptron networks.

Next, we project the feature vector \(F\) into the channel attention module to generate a new feature map to emphasize the differences of channels. The process is shown in the Eq. (3):

$$\begin{gathered} M_{C} \left( F \right) = s\left( {MLP\left( {{\text{AvgPool}}\left( F \right)} \right) + MLP\left( {{\text{MaxPool}}\left( F \right)} \right)} \right) \hfill \\ \quad \quad \quad \; = s\left( {W_{1} \left( {W_{0} \left( {F_{{{\text{avg}}}}^{C} } \right)} \right) + W_{1} \left( {W_{0} \left( {F_{\max }^{C} } \right)} \right)} \right), \hfill \\ \end{gathered}$$
(3)

where \(M_{C} \left( F \right)\) is the output of the attention module, \(W_{0} \in R^{C/r \times C}\), \(W_{1} \in R^{{C \times \frac{C}{r}}}\), \(\sigma\) represents sigmoid, \(r\) is the reduction ratio.

It can be seen from the equation that the feature vector passes through the average pool layer and the maximum pool layer in parallel, then passes through a shared MLP, and finally generates a 1024-dimensional feature description vector through sigmoid processing and maximum pool.

3.3 Registration with virtual feature points

In order to ensure the stability of the calculation, it usually needs at least four pairs of matching points to get the registration parameters for traditional methods. However, when the overlapping of two point clouds is too small or the matching is disturbed by noise and outliers, no enough matching points can be found. In this paper, we propose the concept of virtual feature points. As shown in Fig. 5a, b, ▲and ●are the two real matching points. In order to complete the calculation of registration parameters, we move these matching points along their normal vector for a distance d to generate virtual matching points (represented by dotted lines). Finally, with the help of virtual corresponding points, the number of matching points meets the requirements, and then the registration parameters can be calculated.

Fig. 5
figure 5

Registration with virtual feature points. a Given two pairs of real matching points in source and target point cloud and then generate virtual corresponding feature points. b Using RANSAC for registration. c Real and virtual matching points after registration

At the same time, a fewer matching points means fewer sampling in the RANSAC algorithm. According to the sampling formula as shown in Eq. (4), if we set the sampling success confidence \(\tau\) to 0.99 and the proportion w of real matching points in the initial result to 50%, then in classical RANSAC, the minimum number of samples n is 4, that is, the required sampling times k is 71. Instead, in virtual corresponding point algorithm, n is 2, and the number of sampling required is 16, which shows that our algorithm is more efficient than traditional methods.

$$k = {{\log \left( {1 - t} \right)} \mathord{\left/ {\vphantom {{\log \left( {1 - t} \right)} {\log \left( {1 - w^{n} } \right)}}} \right. \kern-0pt} {\log \left( {1 - w^{n} } \right)}}.$$
(4)

Based on the above analysis, the calculation process of registration parameters combined with virtual corresponding points and RANSAC is as follows:

  1. (1)

    Randomly select two pairs of matching feature points in the source point set and the target point set, and then move the two points along their normal vector for a distance of 10 \(\rho\), where \(\rho\) is the average sample density.

  2. (2)

    Check the distance between the virtual corresponding points and the distance between the real corresponding points. If the difference exceeds 3 \(\rho\), it indicates that they are not real matching points, and return to step (1).

  3. (3)

    Calculate the registration parameter R|t using the corresponding points, and transform all feature points in source point cloud by R|t.

  4. (4)

    Search as many matching pairs as possible in the remaining feature points. That is, if the distance between the feature points in the source point cloud and a feature point in the target point cloud is less than 3 \(\rho\), it will be added to the matching point set.

  5. (5)

    Optimize registration parameters using all matching point pairs found.

  6. (6)

    Repeat step (1–5) according the Eq. (4), and select the one that finds the most corresponding points as the criterion to obtain the optimal transformation parameters.

It is noteworthy that even if the accuracy of normal vector may be affected by noise, our method is not expected to be so since the major focus is on the consistency of normal vector. By using a larger neighborhood to calculate the normal vector, we can ensure that the consistency of the normal vector is not affected by the noise and improve the efficiency in the meantime. Although this may lead to the increase of absolute error in numerical value compared to the general method, it will not affect the final result as long as the consistency of normal vector is maintained..

4 Experiments

This section verified the algorithm in this paper from the aspects of repeatability of feature extraction, ablation experiment for feature point detection, discrimination of descriptors, universality of the algorithm, etc. The algorithm is implemented in pytorch and tested on a PC with Intel Core i7 8700 CPU@3.2 GHz, 16GB RAM and NVDIA GTX 1080Ti GPU. When testing the alignment effect, we mainly use the data set from the Stanford 3D scan repository, in addition to data sets from [20] and [57], and some models from ModelNet40 [58] and the KITTI outdoor LiDAR dataset [59].

For training details, we train our feature point detection network/feature description network using Stochastic Gradient Descent for 50/70 epochs, with initial learning rate 0.01, momentum 0.9, and weight decay 10–4. The learning rate is exponentially decayed by 0.5/0.1 every 10 epochs. To ensure the smooth form in the loss trajectory, we use batch size 128 and 64, respectively.

4.1 Repeatability of feature points

In order to fully verify the effect of our algorithm, we selected four models (armadillo, bunny, dragon and Buddha) as the test objects and take repeatability as a metric to measure the results. Specifically, we calculate the distance between each pair of adjacent points and take 5 \(\rho\)(the average sampling density of a point cloud) as threshold. If the position draft after noise is added is less than the threshold, the feature point is recognized as repeatable. We report the percentage of repeatable in Table 1.

Table 1 Average repeatability

To simulate the real data, we add the noise along the normal vector, that is, along the surface of the object (coordinate z axis), which produces the maximum error. In the experiments, we test repeatability by adding 0.5 \(\rho\) noise to the point cloud models and comparing the feature position offset with that before adding. Also, we compared our results with three well-known feature point detectors, ISS, Harris and UKPGAN [14].

Figure 6 shows the experimental results of the armadillo model. The blue points in the figure are the feature points detected without noise, and the purple points are the feature points detected with 0.5 \(\rho\) noise added to the model. Statistic shows that the repetition rate of our algorithm is 83.41%, that of UKPGAN is 80.91%, that of Harris/ISS is 76.83%/72.87%, which means our method is robust and stable.

Fig. 6
figure 6

Feature point drift under 0.5 \(\rho\) noise for the Armadillo model

Using the same criteria, we tested the repeatability of feature detection on the other three models, and the qualitative results are shown in Figs. 7, 8 and 9, respectively. Statistics show that the average repeatability of our algorithm is 17% higher than that of ISS, 10% higher than that of Harris and as accurate as UKPGAN. The reason for this is that, both ISS and Harris methods calculate directly on the discrete point cloud. Therefore, the surface information at non-sampling points cannot be used, and the change of sampling points caused by noise is more likely to affect the stability of calculation. UKPGAN perform better than ISS and Harris thanks to estimating local reference frame (LRF). Its salient information distillation can force UKPGAN to extract irrelevant points of point cloud models so that it can achieves high repeatability on many models. However, the focus of its training was not the model with rich surface feature information, which resulted in his performance in model armadillo and Buddha being inferior to ours. We use a continuous surface to fit the local curvature distribution, which can make more comprehensive use of the potential surface information. Also, the least square criterion used in the fitting can further filter out the influence of noise. More importantly, out lightweight network is much simpler than UKPGAN, which greatly improves the efficiency of subsequent registration. Consequently, out method outperforms UKPGAN and is significantly better than Harris and ISS (Fig. 10).

Fig. 7
figure 7

Feature point drift under 0.5 \(\rho\) noise for the Bunny model

Fig. 8
figure 8

Feature point drift under 0.5 \(\rho\) noise for the Buddha model

Fig. 9
figure 9

Feature point drift under 0.5 \(\rho\) noise for the dragon model

Fig. 10
figure 10

Registration result with describers from original PointNet (b), our feature description network without (c) and with attention module (d)

4.2 Ablation experiment

4.2.1 Feature detection network

In an attempt to ensure that our detection network has the highest accuracy and the minimum network computation, we compared the detection accuracy of the network with and without transformation matrix, and when the number and dimensions of convolutional layers are different.

Accuracy and recall are common measures to determine the validity of a detection network. They are defined as:

$${\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$
(5)
$${\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(6)

Here, TP, FP and FN are the true positives, false positives and false negatives, respectively.

PointNet set up two layers of STN network to solve the problem of invariance under transformations. They are designed to adjust the position of point cloud in space and for the alignment of features, respectively. PointNet’s original task was to classify 3D objects, while we classify points in the point cloud model, so the features of each point can be extracted well without adding the transformation matrix. We specifically selected the armadillo model and add 5 \(\rho\) noise for ablation experiments. Specifically, we take the feature points under the clean point cloud as the benchmark and enumerate seven schemes for comparison. Among them, scheme 7 is to calculate the curvature extremum points with the traditional algorithm and screen the feature points with the idea of NMS (so-called NMS).

As shown in Table 2, the first three schemes contain STN and convolutional layers, while the last three schemes only contain convolutional layers. The sizes of the convolutional layers a, b, c and d are 4 × 64 × 1,64 × 64 × 1,64 × 64 × 1,64 × 64 × 1,64 × 128 × 1, respectively. We can see from the table that deep learning schemes performed better than traditional methods, and the scheme with three convolutional layers has the best results. The precision of scheme 2 is slightly better than that of scheme 5, but the recall is not as good as the latter. Moreover, since scheme 2 contains two STN networks, its computation is much larger. In summary, we choose the scheme 5 as our final feature point detection model.

Table 2 Precision and recall

4.2.2 Channel attention

To achieve the best registration effects, we registered the Armadillo model with original PointNet, our revised feature description network without/with attention module as backbone, respectively. As shown in Fig. 7, when using original training weight of the PointNet, there is a large deviation on the model legs, and the overlap rate is only 64.15% after the registration, which shows that the PointNet is not suitable for point cloud registration before modification. After we simplified the MLP layer and turned the network into a Siamese structure, the generated descriptors become more in line with the needs of point cloud registration, and the registration effect was greatly improved, with an overlap rate of 94.04%. Finally, when the attention module is added, the network learns the differences between channels, and the registration rate is improved to 99.41%, which verified the effect of the attention module.

4.3 Discrimination of descriptors

In order to verify the discrimination of feature descriptors, we first extract 500 curvature extreme points from the registered data as feature points; then, the feature descriptors are calculated by using the source point cloud and target point cloud before registration. Finally, we establish the matching relationship of feature descriptors through KD tree, and count the proportion of correctly matched feature points to the overall feature points (we call this criterion Fetch-Ratio later). Obviously, the better the discrimination of feature descriptors, the more feature points can be correctly matched.

In Table 3, we list the Fetch-Ratio of the four models under different noise levels and compare with 3DMatch and DIP [49]. Horizontally, in each noise level, the Fetch-Ratio of our feature description network is generally greater than DIP and 3dmatch, which means that our network can make better use of the feature information contained in the point cloud. DIP is only focused on the local geometric information, which makes its descriptors more distinctive on different models. Consequently, DIP outperforms our method in model bunny and dragon. Vertically, both algorithms are inevitably affected by noise, but our method is still higher than DIP and 3dmatch at each noise level. DIP is particularly affected by noise, which means although the features extracted by DIP are rotation invariant, they are not robust and general when being applied to models with strong noise. On the whole, the descriptors we generated are more robust and discriminative than the descriptor of DIP and 3dmatch, and will also bring some improvement in the subsequent registration effect.

Table 3 Fetch-ratio

4.4 Registration effect

In this section, we tested the registration effects of ISS + FPFH, 3DMatch, DIP [49], Predator [48] and our algorithm under the conditions of noise-free, 0.1 \(\rho\), 0.5 \(\rho\) noise levels and 10% outliers. The dataset is shown in Fig. 11 and their average sampling density is 0.001mm. We recorded the average angle of the normal vector and translation in the common part of the data as the initial conditions. The angle and translation in the common part of armadillo is 111°and 0.079mm, respectively.

Fig. 11
figure 11

Original data

In all experiments, the number of match points used by 3dmatch/DIP/Predator is 500(which is the default value), and the number of key points selected by our algorithm is 128. We report all the registration rate in Table 4.

Table 4 Coarse registration rate

Figure 12 shows the registration effects of the three algorithms for the noise-free point cloud. It can be found that although ISS filters out the points with significant features in the point cloud, the position of the extracted feature points were changed greatly due to the resampling and noise, which makes registration results deviate greatly at the head and legs. In Fig. 12b, with the better descriptor generated by 3DMatch, the registration effect of the head and legs of the model is improved. The overlap-attention block in Predator can exchange early information between the latent encodings of the two point clouds which greatly improves registration performance. Both DIP and Predator perform better than the first two methods, except for some details like ears and hands. Our algorithm adopts curvature extreme points with better repeatability, and uses the channel attention mechanism to strengthen the discrimination of descriptors further. Therefore, the registration effect is the best of the five, and the model fitted well at the head, legs and hands.

Fig. 12
figure 12

Registration result for noise-free point cloud

In Fig. 13, we added Gaussian noise with amplitude of 0.1 \(\rho\) to the point cloud to verify the robustness of our algorithm. The results show that the ISS + FPFH has poor robustness, mainly because the LCS constructed by FPFH is affected by the noise, resulting in deviation of the registration. 3DMatch performs better than ISS + FPFH, but there is still a large deviation at the head of the model. DIP and Predator pay few attentions to noisy data when training, so they are affected and the registration rate decreases about 6%. Our feature network is trained with noisy data, so the algorithm has the best robustness in the case of noise, and the registration rate still reached 95.84%.

Fig. 13
figure 13

Registration result of 0.1 \(\rho\) noise level

In the experiment in Fig. 14, we increased the noise amplitude to 0.5 \(\rho\). The registration effect of ISS + FPFH is further deteriorated, and the hand and head of the model were deviation greatly. The registration effect of 3DMatch was also worse. As previously analyzed, DIP is vulnerable to noise and Predator is committed to improving registration effect in low-overlap scenario, which leads to a further deterioration of the effect. In contrast, our algorithm still performed well, the registration results in the parts with significant characteristics such as the head, hands and legs of the model are comparable to the case of noise-free, which means our algorithm still has high robustness in the case of strong noise.

Fig. 14
figure 14

Registration result of 0.5 \(\rho\) noise level

In the experiment of Fig. 15, we added 10% outliers to both models. It can be seen from the results that the ISS algorithm has poor robustness to outliers, resulting in registration failure and the point cloud model is directly flipped. 3DMatch and DIP random sample the point cloud to generate matching points, so it has a considerable chance to include outliers to the matching set, resulting in a poor registration effect. The overlap-attention block of Predator can predict salient points lying in the overlap region, which makes it achieve an unexpected good result on models with outliers. In our algorithm, when selecting curvature extreme points, a distance threshold is used to filter outliers, so the probability that an outlier point is selected is rare. At the same time, we also use a distance threshold constraint on the virtual corresponding points, even if outliers are selected, they will be filtered at this stage, which makes the final registration accuracy not be reduced too much.

Fig. 15
figure 15

Registration result of 10% outliers

As shown in Table 4, we take the coarse registration rate as our metric to evaluate the effect of our algorithm on the armadillo model. Specifically, we calculated the proportion of the number of points whose nearest neighbor distance meets the threshold (0.1 \(\rho\)) after coarse registration and fine registration, and calculated the ratio of the two. Horizontally, our algorithm performs well in all the noise cases. The registration rate reaches 97.08% for the noise-free situation, and only decreases by 1.24% and 0.24% in the case of 0.1 \(\rho\) and 0.5 \(\rho\) noise, respectively. Although the registration rate decreases by 14.23% when 10% outliers are added, it is still much higher than other methods. In other words, the registration accuracy of our algorithm will not differ much with noise and outliers. However, 3DMatch performs much worse than our algorithm. The highest registration accuracy is only 70.45%, and it drops to 60.56% rapidly when outliers are added. The registration accuracy of ISS + FPFH under 0.5 \(\rho\) noise and 10% outliers drop by 23.07% and 47.97%, respectively, which is much more obvious, and the registration accuracy is only 46.7% and 21.8%. DIP and Predator seem to have a greater decline than 3DMatch when noise is added, but their overall performance is still better than first two methods. Vertically, our algorithm also performs better than other four methods in all cases, which shows that the registration accuracy of our algorithm is greatly improved compared with both deep learning algorithms and traditional algorithm.

4.5 Algorithm universality

Aim to verify the generality of our algorithm, we select five more models including the Dancing Children, the Horse, the Bunny, the Buddha and the Dragon. The original data is shown in Fig. 16 and their average sampling density is 0.001mm. For initial conditions, the average angle and translation in the common part of five models is 7°and 0.1 mm, respectively. Among them, the Dancing Children model and the Horse model are produced manually by obtaining visible point clouds from different perspectives, while the other three models are the actual scanning data from the Stanford 3D scanning repository. We also add 0.5 \(\rho\) noise to all those models to increase the challenge.

Fig. 16
figure 16

Original data

As shown in Figs. 17, 18, 19, 20, 21, ISS + FPFH and 3DMatch perform poorly in several models. For example, in Fig. 17, they deviations a lot at the head and base of the second child, and there are also large deviations in the early faucet in Fig. 21. DIP and Predator achieve an excellent result in Figs.17, 18, 19. However, they perform not so well in models with rich surface feature information like Figs. 20 and 21. The registration effect of our algorithm has significant improvements over ISS + FPFH and 3DMatch. When facing models with smooth surface, we can do as well as DIP and Predator. And, our algorithm can still maintain a high level in Figs. 20 and 21 where other methods achieve unsatisfactory results.

Fig. 17
figure 17

Registration result of the Dancing Children model

Fig. 18
figure 18

Registration result of the Horse model

Fig. 19
figure 19

Registration result of the Bunny model

Fig. 20
figure 20

Registration result of the Buddha model

Fig. 21
figure 21

Registration result of the Dragon model

We report all the coarse registration rate on five models in this section. We can see from Table 5 that although our algorithm cannot get best results on every model, the maximum gap between us and the best method is less than 2%. Moreover, our average registration rate has reached 90%, which means the generalization ability of our algorithm.

Table 5 Coarse registration rate on dataset of generality test

4.6 Registration on more dataset

Recently, more and more papers [38, 41,42,43,44,45, 49] focus on indoor scenes, ModelNet40 [58] and KITTI outdoor LiDAR dataset [59]. In order to verify the university of our algorithm further, we select several models from 3DMatch [20], NDT [57], ModelNet40 and KITTI to compare with two state-of-the-art methods. Specifically, Fig. 22 is from 3DMatch and its angle and translation in the common part is 13 and 0.15mm, respectively; Fig. 23 is from NDT and its angle and translation in the common part is 5°and 0.037 mm, respectively; Figs. 24 and 25 is from KITTI; Figs. 26, 27, 28 is from ModelNet40.

Fig. 22
figure 22

Registration result of the indoor scene model (1)

Fig. 23
figure 23

Registration result of the indoor scene model (2)

Fig. 24
figure 24

Registration result of KITTI (1)

Fig. 25
figure 25

Registration result of KITTI (2)

Fig. 26
figure 26

Registration result of ModelNet40 (1)

Fig. 27
figure 27

Registration result of ModelNet40 (2)

Fig. 28
figure 28

Registration result of ModelNet40 (3)

Indoor scene point cloud data is usually characterized by structural occlusion and self-similarity. These factors have brought some difficulties to the registration of indoor scene registration. ModelNet40 is usually used to evaluate the performance of 3D shape classification algorithms. KITTI includes multi-view images taken by autonomous vehicles and some lidar data. It is a great challenge to perform well on these datasets. As shown in Figs. 22, 23, 24, 25, 26, 27, 28, with highly differentiated descriptor, all DIP, Predator and our algorithm achieve an excellent result on all datasets. According to statistics, the average registration rate of three methods is higher than 85%, which also shows that we are able to complete the task of registration on these challenging datasets.

4.7 Registration with two matching points

[48] points out that although the low-overlap regime is very relevant for practical applications, the registration performance of some state-of-the-art methods deteriorates rapidly when the overlap between the two point clouds is very low (< 30%).When the overlapping area of two point clouds is too small, it will bring great challenges to alignment, because it will lead to that only few feature points can be searched in the overlapping area. Also, the feature points in non-overlapping areas will bring more interference to feature point matching in that situation. Figures 29 and 30 show such scenarios. In the overlapping region of the two models in the figure, only three matching points located in the overlapping region can be found in armadillo and two in dragon. Therefore, the traditional methods cannot stably calculate the registration parameters. However, the overlap-attention block can greatly improve performance in the low-overlap scenario. We introduced the concept of virtual feature points, so more virtual matching points can be constructed by offsetting the matched points along its normal vector. Finally, enough corresponding points can still be got to calculate the registration parameters. It can be seen from the figure that our method outperforms Predator because the overlapping region is too small for Predator to extract good enough features for registration. Thanks to our proposed virtual matching points, two models coincide well after registration and our registration rate reaches 86%, which is higher comparison than Predator’s 79%.

Fig. 29
figure 29

Registration with virtual feature points (armadillo)

Fig. 30
figure 30

Registration with virtual feature points (dragon)

5 Conclusion

In this paper, a coarse point cloud registration algorithm based on local extremum feature and depth descriptor matching is proposed. The algorithm identifies the curvature extreme points as feature points through a lightweight network, which improves the repeatability and robustness of feature detection; meanwhile, it generated highly differentiated descriptors through a lightweight attention-based feature description network. We also propose a method to generate virtual corresponding points based on feature points and their normal vectors, which reduces the minimum number of feature points required for registration from 4 to 2. Experiments show that the proposed feature extraction, description and registration methods have more advantages than traditional methods, and have achieved good registration results in various challenging scenes.

In the future, we will do more research from the following aspects: first of all, when testing Fetch-Ratio, we found that although the performance of our network is still better than 3dmatch and DIP, the drop of Fetch-Ratio of our network is relatively large when the noise level reaches a very high level. How to further improve the robustness of the network in high noise environment is one of our research directions. Secondly, we find that the distance along the normal vector needs to be turned manually when generating virtual matching points. In future studies, we will further improve the algorithm.