Keywords

1 Introduction

Relocalization within an existing 3D map is a critical functionality for numerous applications in robotics  [3] and autonomous driving  [35, 53]. A common strategy is to split the problem into two subtasks, namely global place recognition and local 6DoF pose refinement. A lot of effort has been focused on tackling the problem using 2D images  [6, 44, 46, 51], where the 3D maps are usually defined as image feature points reconstructed in 3D using Structure from Motion (SfM). The coarse global place recognition is achieved by image retrieval, whereas accurate local 6DoF pose refinement is addressed separately by feature matching and PnP. With the progress of deep learning for image descriptor extraction  [2, 19] and 2D keypoints detection/description  [14, 15, 40, 60], image based methods have significantly gained in robustness to variations in viewpoint and illumination.

Fig. 1.
figure 1

Left: We propose a hierarchical network for large-scale point cloud based relocalization. The network consumes raw 3D points and performs local feature detection, description and global descriptor extraction in one forward pass. The global descriptor is used to retrieve similar scenes from the database. Accurate 6DoF pose is then obtained by matching the local features. Right: Our local descriptors trained on LiDAR points work favorably on the sparse point clouds generated by a visual SLAM method without fine-tuning. The point clouds are generated from sequences with different weathers, lighting conditions and scene layouts, thus have significantly different distributions.

Alternatively one can tackle these variations by working on 3D point clouds since these are inherently invariant to such issues. Moreover, there exist numerous SLAM pipelines that generate accurate large-scale point clouds using sensory input from LiDAR  [13, 63, 64] or camera  [17, 55]. While there is great potential to rely on such data, research on point cloud based relocalization is significantly less matured compared to the image-based counterpart  [15, 40]. Especially, deep learning on 3D descriptors emerged only roughly 3 years ago. With most of the early attempts focusing on small-scale tasks like object classification, detection and segmentation  [28, 37, 39, 57, 68], only a limited number of networks have been proposed for large-scale localization  [30, 31, 58]. Moreover, among these few attempts, global place recognition  [1, 65] and local 6DoF pose refinement  [8, 18, 23] have been addressed isolatedly, despite the fact that both tasks depend on the same low level geometric clues.

In this paper, we propose a hierarchical deep network for large-scale point clouds based relocalization – see Fig. 1. The network directly consumes unordered 3D points and performs keypoint detection and description, as well as global point cloud descriptor extraction in a unified manner. In contrast to the conventional detect-then-describe pipeline, our local features are learned with the detect-and-describe concept. We estimate a confidence map of the discriminativeness of local features explicitly and learn to select keypoints that are well-suited for matching in an unsupervised manner. The local features are aggregated into a global descriptor for global retrieval, attaining a consistent workflow for large-scale outdoor 6DoF relocalization. Our main contributions are summarized as follows:

  • We propose the first work that unifies point cloud based global place recognition and 6DoF pose refinement. Our method performs local feature detection and description, as well as global descriptor extraction in one forward pass, running significantly faster than previous methods.

  • We propose to use FlexConv and SE block to integrate multi-level context information and channel-wise relations into the local features, thus achieve much stronger performance on feature matching and also boost the global descriptor.

  • We introduce a describe-and-detect approach to explicitly learn a 3D keypoint detector in an unsupervised manner.

  • Both our local and global descriptors achieve state-of-the-art performances on point cloud registration and retrieval across multiple benchmarks.

  • Furthermore, our local descriptors trained on LiDAR data show competitive generalization capability when applied to the point clouds generated by a visual SLAM method, even though LiDAR and visual SLAM point clouds exhibit very different patterns and distributions.

2 Related Work

Handcrafted Local Descriptors encode local structural information as histograms over geometric properties e.g., surface normals and curvatures. Spin image (SI)  [24] projects 3D points within a cylinder onto a 2D spin image. Unique Shape Context (USC)  [52] deploys a unique local reference frame to improve the accuracy of the well-know 3D shape context descriptor. Point Feature Histogram (PFH)  [43] and Fast PFH (FPFH)  [42] describe the relationships between a point and its neighbors by calculating the angular features and normals. While these handcrafted methods have made great progress, they generalize poorly to large-scale scenarios and struggle to handle noisy real-world data.

Learned Local Descriptors. To cope with the inherent irregularity of point cloud data, researchers have suggested to convert 3D points to regular representations such as voxels and multi-view 3D images  [38, 47, 49, 50, 54]. As a pioneering work, PointNet  [37] proposed to apply deep networks directly to raw 3D points. Since then, new models  [28, 39, 66] and flexible operators on irregular data  [21, 56, 59] have been emerging. Accompanied by these progresses, learning-based 3D local descriptors such as 3DMatch  [62], PPFNet  [11] and PPF-FoldNet  [10, 12] have been proposed for segment matching, yet they are designed for RGB-D based indoor applications. Recently, Fully Convolutional Geometric Features (FCGF)  [8] was proposed to extract geometric features from 3D points. Yet, all these methods do not tackle feature detection and get their features rather by sampling. Another class of methods utilizes deep learning to reduce the dimensions of the handcrafted descriptors, such as Compact Geometric Features (CGF)  [25] and LORAX  [16]. In the realm of large-scale outdoor relocalization, 3DFeatNet  [23] and L3-Net  [31] extract local feature embedding using PointNet, whereas 3DSmoothNet  [18] and DeepVCP  [30] rely on 3D CNNs. In contrary to registration based on feature matching, DeepVCP and Deep Closest Point  [58] learn to locate the correspondences in the target point clouds.

3D Keypoint Detectors. There are three representative hand-crafted 3D detectors. Intrinsic Shape Signatures (ISS)  [67] selects salient points with large variations along the principal axes. SIFT-3D  [29] constructs a scale-space of the curvature with the DoG operator. Harris-3D  [48] calculates the Harris response of each 3D vertex based on first order derivatives along two orthogonal directions on the 3D surface. Despite the increasing number of learning-based 3D descriptors, only a few methods have been proposed to learn to detect 3D keypoints. 3DFeatNet  [23] and DeepVCP  [30] use an attention layer to learn to weigh the local descriptors in their loss functions. Recently, USIP  [27] has been proposed to specifically detect keypoints with high repeatability and accurate localization. It establishes the current state of the art for 3D keypoint detectors.

Handcrafted Global Descriptors. Most 3D global descriptors describe places with handcrafted statistical information. Rohling et al.  [41] propose to describe places by histograms of points elevation. Cop et al.  [9] leverage LiDAR intensities and present DELIGHT. Cao et al.  [5] transform a point cloud to a bearing-angle image and extract ORB features for bag-of-words aggregation.

Learned Global Descriptors. Granström et al.  [20] describe point clouds with rotation invariant features and input them to a learning-based classifier for matching. LocNet  [61] inputs range histogram features to 2D CNNs to learn a descriptor. In these methods, deep learning essentially plays the role of post-processing the handcrafted descriptors. Kim et al.  [26] transform point clouds into scan context images and feed them into CNNs for place recognition. PointNetVLAD  [1] first tackles place recognition in an end-to-end way. The global descriptor is computed by a NetVLAD  [2] layer on top of the feature map extracted using PointNet  [37]. Following this, PCAN  [65] learns attentions for points to produce more discriminative descriptors. These two methods extract local features using PointNet, which projects each point independently into a higher dimension and thus does not explicitly use contextual information.

3 Hierarchical 3D Descriptors Learning

For large-scale relocalization, an intuitive approach is to tackle the problem hierarchically in a coarse-to-fine manner: local descriptors are extracted, aggregated into global descriptors for coarse place recognition, and then re-used for accurate 6DoF pose refinement. While being widely adopted in image the domain  [33, 44, 45], this idea has not been addressed by the deep learning community for 3D. As a result, seeking for 6DoF relocalization for point clouds, one has to perform local feature detection, description, global descriptor extraction separately, possibly running an independent network for each. To address this problem, we design a hierarchical network operating directly on a point cloud, delivering local descriptors, a keypoint score map and a global descriptor in a single forward pass. Point cloud based relocalization thus can be performed hierarchically: a coarse search using the global descriptor retrieves 3D submap candidates, which are subsequently verified by local 3D feature matching to estimate the 6DoF poses. An overview of our system is provided in Fig. 1.

3.1 3D Local Feature Encoder and Detector

3DFeatNet  [23] is a seminal work that learns both 3D local feature detection and description. Nevertheless, the following two points potentially limit its discriminative power: (1) Its detector is an attention map learned directly from the input points. During inference descriptors are only extracted for the keypoints defined by the attention map. Such classical detect-then-describe approach, as discussed in  [15, 40], typically focuses on low-level structures of the raw input data, and cannot utilize the high level information encoded in the descriptors. (2) Its feature description is PointNet-based, the symmetric function of which tends to provide only limited structural information of local clusters. To resolve these limitations, we propose to use Flex Convolution (FlexConv)  [21] and Squeeze-and-Excitation (SE) block  [22] to respectively fuse multi-level spatial contextual information and channel-wise feature correlations into the local descriptors. The describe-and-detect pipeline  [15, 40] is adopted to postpone our detection stage to employ higher-level information in the learned descriptors.

Fig. 2.
figure 2

Left: the flow of local feature descriptor and detector learning. Right: the architecture of our Local Feature Encoder. In Conv(D, \(K\times K\)) and FlexConv(D, k, d), D: output dimension, K: filter size, k: neighborhood size, d: dilation rate.

FlexConv Layer. Considering a 3D point \(\textit{\textbf{p}}_l\) with its \(k\) neighbors \(N_k(\textit{\textbf{p}}_l) = \lbrace {\textit{\textbf{p}}_{l_1},\cdots , \textit{\textbf{p}}_{l_k}}\rbrace \) as a graph \(\mathcal {G} = (\mathcal {V}, \mathcal {E})\), where \(\mathcal {V}=\{0,..,k\}\) are the vertices and \(\mathcal {E} \subseteq \mathcal {V}\times \mathcal {V}\) the edges. A 3D operator on an edge of this graph can be formulated as \(\mathbf {e}_{ll_{k}}=f_{\mathbf {\Theta }}(\textit{\textbf{p}}_l, \textit{\textbf{p}}_{l_k})\), with \(\mathbf {\Theta }\) the set of learnable parameters. As pointed out by  [59], PointNet is a special case of \(f_{\mathbf {\Theta }}(\textit{\textbf{p}}_l, \textit{\textbf{p}}_{l_k}) = f_{\mathbf {\Theta }}(\textit{\textbf{p}}_l)\), thus encodes only global shape information and ignores the local neighborhood structure. In contrary, FlexConv can be abstracted as \(f_{\mathbf {\Theta }}(\textit{\textbf{p}}_l, \textit{\textbf{p}}_{l_k}) = f_{\mathbf {\Theta }}(\textit{\textbf{p}}_l - \textit{\textbf{p}}_{l_k}, \textit{\textbf{p}}_{l_k})\), therefore can effectively encode local information, which we believe is crucial for learning discriminative local descriptors. Formally, FlexConv is a generalization of the conventional grid-based convolution and is defined as:

$$\begin{aligned} {f_{FlexConv}}(\textit{\textbf{p}}_l) = \sum _{\textit{\textbf{p}}_{l_i}\in N_k(\textit{\textbf{p}}_l)}\omega ( \textit{\textbf{p}}_{l_i}, \textit{\textbf{p}}_l) \cdot h(\textit{\textbf{p}}_{l_i}), \end{aligned}$$
(1)

where \(h(\textit{\textbf{p}}_{l_i}) \in \mathbb {R}^C\) is a point-wise encoding function projecting a point to the high-dimensional feature space. It is convolved with a filter-kernel \(\omega : \mathbb {R}^3 \times \mathbb {R}^3 \rightarrow \mathbb {R}^C\) that is computed by the standard scalar product in the Euclidean space, with learnable parameters \({\theta } \in \mathbb {R}^{C\times 3}\), \({\theta _{b}} \in \mathbb {R}^{C}: w(\textit{\textbf{p}}_l, \textit{\textbf{p}}_{l_i} \mid {\theta }, {\theta _b}) = \langle \theta , \textit{\textbf{p}}_l - \textit{\textbf{p}}_{l_i}\rangle + \theta _{b}\). It can be considered as a linear approximation of the traditional filter-kernel which uses the location information explicitly. In addition, \(\omega \) is everywhere well-defined and allows us to perform back-propagation easily.

Squeeze-and-Excitation (SE) Block. While FlexConv models spatial connectivity patterns, SE blocks  [22] are further used to explicitly model the channel-wise inter-dependencies of the output features from the FlexConv layers. Let \(\textit{\textbf{U}} = \lbrace u_1, \cdots , u_C \rbrace \in \mathbb {R}^{N\times C}\) denote the input feature map to the SE block, where \(u_c \in \mathbb {R}^N\) represents the \(c{\text {-th}}\) channel vector of the output from the last FlexConv layer. The squeeze operation first “squeezes” \(\textit{\textbf{U}}\) into a channel-wise descriptor \(z \in \mathbb {R}^{C}\) as \(f_{sq}: \mathbb {R}^{N\times C} \rightarrow \mathbb {R}^C, \quad z = f_{sq} (\textit{\textbf{U}})\), where \(f_{sq}\) is implemented as a global average pooling to aggregate across spatial dimensions. \(z \in \mathbb {R}^{C}\) is an embedding containing global information which is then processed by the excitation operation \(f_{ex}: \mathbb {R}^{C} \rightarrow \mathbb {R}^C, \quad s = f_{ex} (z)\), where \(s \in \mathbb {R}^{C}\) is implemented as two fully connected layers with ReLU to fully capture channel-wise dependencies and to learn a nonlinear relationship between the channels. In the end, the learned channel activations are used to recalibrate the input across channels achieving the attention selection of different channel-wise features \(\tilde{u}_c = f_{scale} (u_c, s_c) = s_c \cdot u_c\), where \(\tilde{U} = \lbrace \tilde{u}_1, \cdots , \tilde{u}_C\rbrace \) refers to the output of the SE block.

Encoder Architecture. The architecture of the encoder module is illustrated in Fig. 2. In comparison to 3DFeatNet  [23] which relies on PointNet and only operates on one level of spatial granularity, our encoder extracts structural information from two spatial resolutions. At each resolution, the following operations are conducted: one \(1\times 1\) convolution, two consecutive FlexConv layers and a SE block. Taking a point cloud \(\textit{\textbf{P}} = \lbrace \textit{\textbf{p}}_{1},\cdots ,\textit{\textbf{p}}_{N}\rbrace \in \mathbb {R}^{N \times 3}\) as input, the encoder fuses multi-level contextual information by adding the outputs from the two resolutions and produces the feature map \(\varvec{\varPsi }\). It is then L2-normalized to give us the final local descriptor map . Benefited from the better policies for integrating contextual information, when compared to 3DFeatNet, our local features are much more robust to points densities and distributions, thus generalize significantly better to point clouds generated by different sensors (more details in Sect. 4.4).

Description Loss. In feature space, descriptors of positive pairs are expected to be close and those of negative pairs should keep enough separability. Instead of using the simple Triplet Loss as in 3DFeatNet, we adopt the N-tuple loss  [11] to learn to differentiate as many patches as possible. Formally, given two point clouds \(\textit{\textbf{P}}\), \(\textit{\textbf{P}}'\), the two following matrices can be computed: a feature space distance matrix \(\textit{\textbf{D}}\ \in \mathbb {R}^{N\times N} \) with \(\textit{\textbf{D}}(i,j) = \Vert \textit{\textbf{x}}_i - \textit{\textbf{x}}_j \Vert \), a correspondence matrix \(\textit{\textbf{M}}\in \mathbb {R}^{N\times N}\) with \(\textit{\textbf{M}}_{i,j}\in \{0,1\}\) indicating whether two point patches \(\textit{\textbf{p}}_{i} \in \textit{\textbf{P}}\) and \(\textit{\textbf{p}}_{j} \in \textit{\textbf{P}}'\) form a positive pair, i.e., if their distance is within a pre-defined threshold. The N-tuple loss is formulated as:

(2)

where \(\sum ^*(\cdot )\) is element-wise sum, \(\circ \) element-wise multiplication, \(\Vert \cdot \Vert _F\) the Frobenius norm, \(\eta \) a hyper-parameter balancing matching and non-matching pairs. The loss is divided by the number of true/false matches to remove the bias introduced by the larger number of negatives.

3D Local Feature Detection. Contrary to the classical detect-then-describe approaches, we postpone the detection to a later stage. To this end, we produce a keypoint saliency map \(\textit{\textbf{S}} \in \mathbb {R}^{N\times 1}\) from the extracted point-wise descriptors instead of from the raw input points. Our saliency map is thus estimated based on the learned local structure encoding and thus is less fragile to low level artifacts in the raw data, and provides significantly better generalization. Another benefit of the describe-and-detect pipeline is that feature description and detection can be performed in one forward pass, unlike the detect-then-describe approaches that usually need two stages. Our detector consumes the local feature map \(\varvec{\varPsi }\) by a series of four \(1\times 1\) convolution layers, terminated by the sigmoid activation function (more details in the supplementary document).

As there is no standard definition of a keypoint’s discriminativeness for outdoor point clouds, keypoint detection cannot be addressed by supervised learning. In our case, since the learned descriptors are to be used for point cloud registration, we propose to optimize keypoint confidences by leveraging the quality of descriptor matching. Local descriptor matching essentially boils down to nearest neighbor search in the feature space. Assuming the descriptor is informative enough and the existence of correspondence is guaranteed, a reliable keypoint, i.e., a keypoint with a high score \(s\), is expected to find the correct match with high probability. We therefore can measure the quality of the learned detector using \(\eta _i = (1-s_i)\cdot (1-\textit{\textbf{M}}_{i,j}) + s_i\cdot \textit{\textbf{M}}_{i,j}\), where \(s_i \in [0,1]\) is an element of \(\textit{\textbf{S}}\) and \(j\) refers to the nearest neighbor in \(\varvec{\varPsi }'\). A simple loss function can be formulated as \(L_{det} = \frac{1}{N}\sum _{i=1}^{N} 1 - \eta _i\). However, we find that only using the nearest neighbor to define \(\eta \) is too strict on the learned feature quality and the training can be unstable. We thus propose a new metric called average successful rate (\(\text {AR}\)): given a point \(\textit{\textbf{p}}_i \in \textit{\textbf{P}}\) and its feature \(\varvec{\psi }_i \in \varvec{\varPsi }\), we find the \(k\) nearest neighbors in \(\varvec{\varPsi }'\). The \(\text {AR}\) of \(\textit{\textbf{p}}_i\) is computed as: \(\text {AR}_i =\frac{1}{k} \sum _{j=1}^{k} c_{ij}\), where \(c_{ij} = 1\) if at least one correct correspondence can be found in the first \(j\) candidates, otherwise is 0.Footnote 1 Now we can measure \(\eta \) with \(\text {AR}\) which is a real number in the range \([0,1]\) instead of a binary number and the loss above can be rewritten as:

$$\begin{aligned} L_{det} = \frac{1}{N}\sum _{i=1}^{N} 1 - [ \kappa (1-s_i) + s_i \cdot \text {AR}_i ], \end{aligned}$$
(3)

where \(\kappa \in [0,1]\) is a hyperparameter indicating the minimum expected \(\text {AR}\) per keypoint. To minimize the new loss, the network should predict \(s_i\) to be close to 0 if \(\text {AR}_i < \kappa \) and to be near 1 conversely.

3.2 Global Descriptor Learning

As a key concept of this work, we propose to re-use the local descriptors for global retrieval. This early sharing of the computation is natural as both local and global descriptors are based on the same low-level geometric clues. The upcoming question is, how can the local descriptors be aggregated to a global one? While there exist many ways to do so, e.g., pooling, PointNet++  [39], FlexConv, PointCNN  [28], Dynamic Graph CNN  [59], we claim that the PCAN  [65] (PointNetVLAD extended by adding attention) gives the best performance among the many and provide an ablation study in the supplementary material.

Our global aggregation network is depicted in Fig. 3. Before the NetVLAD module, two FlexConv layers are added to project the local features to a higher dimension for a more retrieval relevant encoding. The attention predictor takes these features and outputs a per-point attention map, which is followed by a NetVLAD layer to generate a compact global representation. As the output of a NetVLAD layer usually has very high dimension which indicates expensive nearest neighbor search, a FC layer is used to compress it into a lower dimension. The global descriptor assembler is trained using the same lazy quadruplet loss as used in  [1, 65], to better verify our idea of using the learned local descriptor for global descriptor aggregation.

Fig. 3.
figure 3

The architecture of the global descriptor assembler.

Attention Map Prediction. As it has been observed for image retrieval, visual cues relevant to place recognition are generally not uniformly distributed across an image. Therefore focusing on important regions is the key to improve the performance  [7, 34]. However, such attention mechanism has only been explored recently for point cloud retrieval. Inspired by PCAN  [65], we integrate an attention module that weighs each local descriptor before aggregation. As a key difference to PCAN, the input to our attention predictor is the learned descriptors which already encapsulate fairly good contextual information. Thus our predictor is not in charge of aggregating neighborhood information and needs a dedicated design to reflect such benefit. We thus construct our attention predictor by only chaining up three \(1 \times 1\) Conv layers followed by softmax to ensure the sum of the attention weights is 1. We will show that, although our attention predictor has a much simpler structure than PCAN, yet it is effective. When combined with our descriptor, it still offers better global retrieval performance. More details on the network structure are provided in the supplementary document.

4 Experiments

The LiDAR point clouds from the Oxford RobotCar dataset  [32] are used to train our network. Additionally, the ETH dataset  [36] and the point clouds generated by Stereo DSO  [55], a direct visual SLAM method are used to test the generalization ability of the evaluated methods. The margin used in \(L_{desc}\) is set to \(\mu = 0.5\), the minimum expected \(\text {AR}\) in Eq. 3 \(\kappa =0.6\) with \(k=5\), \(N\) in Eq. 2 is set to 512. We use a weighted sum of \(L_{desc}\) and \(L_{det}\) as the loss function to train our network \(L = L_{desc} + \lambda L_{det}\).

To train the local part of our network, we use Oxford RobotCar and follow the data processing procedures in  [23]. We use 35 traversals and for each create 3D submaps using the provided GPS/INS poses with a 20 m trajectory and a 10 m interval. The resulting submaps are downsampled using a voxel grid with grid size of 0.2 m. In total 20,731 point clouds are collected for training. As the provided ground truth poses are not accurate enough to obtain cross-sequence point-to-point correspondences, we generate training samples with synthetic transformations: for a given point cloud we create another one by applying an arbitrary rotation around the upright axis and then adding Gaussian noise \(\mathcal {N}(0,\sigma _{noise})\) with \(\sigma _{noise}=0.02\,\mathrm{m}\). Note that as a point cloud is centered wrt. its centroid before entering the network, no translation is added to the synthetic transformations. For the global part, we use the dataset proposed in PointNetVLAD  [1]. Specifically, for each of the 23 full traversals out of the 44 selected sequences from Oxford RobotCar, a testing reference map is generated consisting of the submaps extracted in the testing part of the trajectory at 20 m intervals. More details on preparing the training data are left to the supplementary document.

Runtime. For a point cloud with 8192 points, our local (including feature description and keypoint detection) and global descriptors can be extracted in one forward pass in 80 ms. As comparison, 3DFeatNet takes 400 ms (detection) + 510 ms (NMS) + 18 ms (512 local descriptors); 3DSmoothNet needs 270 ms (preprocessing) + 144 ms (512 local descriptors).

Fig. 4.
figure 4

Relative repeatability when different number of keypoints are detected.

Fig. 5.
figure 5

Qualitative point cloud registration on Oxford RobotCar. Green lines show the inliers of RANSAC. (Color figure online)

Table 1. Point cloud registration performance on Oxford RobotCar. Each row in the table corresponds to a keypoint detector and each column refers to a local 3D descriptor. In each cell, we show Relative Translational Error (RTE), Relative Rotation Error (RRE), the registration success rate and the average number of RANSAC iterations. The methods are evaluated on the testing set provided by  [23]. The top three results of each metric are highlighted in / / .

4.1 3D Keypoint Repeatability

We use relative repeatability to quantify the performance of our keypoint detector. Given two point clouds \(\{\textit{\textbf{P}}, {\textit{\textbf{P}}'}\}\) related by a transformation T, a keypoint detector detects keypoints \(K = [K_1, K_2, \cdots , K_m]\) and \({K'} = [{K}_1', {K}_2', \cdots , {K}_m']\) from them. \(K_i \in K\) is repeatable if the distance between \(\text {T}({K}_i)\) and its nearest neighbor \({K}_j' \in {K}'\) is less than 0.5 m. Relative repeatability is then defined as \(|K_{rep}|/|K|\) with \(K_{rep}\) the repeatable keypoints. We use the Oxford RobotCar testing set provided by 3DFeatNet  [23], which contains 3426 point cloud pairs constructed from 794 point clouds. We compare to three handcrafted 3D detectors, ISS  [67], SIFT-3D  [29] and Harris-3D  [48] and two learned ones 3DFeatNet  [23] and USIP  [27]. The results are presented in Fig. 4. As the most recently proposed learning based 3D detector that is dedicatedly designed for feature repeatability, USIP apparently dominates this benchmark. It is worth noting that the keypoints detected by USIP are highly clustered, which is partially in favor of achieving a high repeatability. Moreover, USIP is a pure detector, while 3DFeatNet and ours learn detection and description at the same time. Our detector outperforms all the other competitors by a large margin when detecting more than 64 keypoints. In the case of 256 keypoints, our repeatability is roughly 1.75x than the best follower 3DFeatNet. This clearly demonstrates that, when learning detector and descriptors together, describe-and-detect is superior than detect-then-describe. It is yet interesting to see how the key ideas of USIP  [27] can be merged into this concept.

Table 2. Point cloud registration performance on ETH. The top three results of each metric are highlighted in / / . Note that RANSAC does not converge within the max. iterations (10000) with FCGF.

4.2 Point Cloud Registration

Geometric registration is used to evaluate 3D feature matching. A SE3 transformation is estimated based on the matched keypoints using RANSAC. We compute Relative Translational Error (RTE) and Relative Rotation Error (RRE) and consider a registration is successful when RTE and RRE are below 2 m and \(5^\circ \), respectively. We compare to two handcrafted (ISS  [67] and Harris-3D  [48]) and two learned (3DFeatNet  [23] and USIP  [27]) detectors, and three handcrafted (SI  [24], USC  [52] and FPFH  [42]) and three learned (3DSmoothNet  [18], 3DFeatNet  [23] and FCGF  [8]) descriptors. The average RTE, RRE, the success rate and the average number of RANSAC iterations on Oxford RobotCar of each detector-descriptor combination are shown in Table 1. Note that due to space limitation, only the best performing handcrafted descriptor is shown (the same applies to Table 2 and 4). Although USIP shows significantly better performance on repeatability, our method now delivers competitive or even better results when applied for registration, where both keypoint detector and local feature encoder are needed. This on one hand demonstrates the strong discriminative power of our local descriptors, on the other hand also supports our idea of learning detector and descriptors in the describe-and-detect manner. Another thing to point out is that FCGF was trained on KITTI, which might explain its relatively bad results in this evaluation. Some qualitative results can be found in Fig. 5.

Unlike Oxford RobotCar, the ETH dataset  [36] contains largely unstructured vegetations and much denser points, therefore is used to test the generalizability. The same detectors and descriptors as above are tested and the results are shown in Table 2. We notice that 3DSmoothNet shows the best performances on ETH. One important reason is that 3DSmoothNet adopts a voxel grid of size 0.02 m to downsample the point clouds, while our DH3D and other methods use size 0.1 m. Thus 3DSmoothNet has finer resolution and is more likely to produce smaller errors. Apart from that, our detector performs fairly well (last row) and when combine with our descriptor, it achieves the smallest rotation error. With all the detectors, FCGF descriptors cannot make RANSAC converge within the maximum number of iterations. The bad performance of FCGF is also noticed by the authors of  [4] and was discussed on their GitHub pageFootnote 2.

4.3 Point Cloud Retrieval

Table 3. Average recall (%) at top 1% and top 1 for Oxford RobotCar.
Fig. 6.
figure 6

Average recall of the top 25 retrievals on Oxford RobotCar.

We compare our method against the two state-of-the-art approaches, PCAN  [65] and PointNetVLAD (PN_VLAD)  [1]. We also report the results of PN_MAX presented in PN_VLAD, which consists of the original PointNet architecture with the maxpool layer and a fully connected layer to produce a global descriptor. Note that both PN_VLAD and PCAN take submaps of size 4096, whereas ours takes 8192 to favor local feature extraction. We thus add a downsampling layer before the final NetVLAD layer to make sure the same size of 4096 points enter the final aggregation procedure. For demonstration, we also report the results of using the default setting. We first evaluate the average recall at top 1% and top1 and show the results in Table 3. Our method with both settings outperforms all the other methods. We further show the recall curves of the top25 retrieval matches in Fig. 6, where both of our networks consistently outperform the other two state-of-the-art approaches. The evaluation results prove that our network can effectively take advantage of the informative local features and produce more discriminative global descriptors. Also note that even with the number of points halved before entering the NetVLAD layer, the performance drop of our method is very small, which shows that our local features integrate sufficient contextual information for global retrieval. Some qualitative retrieval results are provided in the supplementary document.

4.4 Application to Visual SLAM

Table 4. Generalization of point cloud registration for visual SLAM. In this experiment, point clouds are generated by running Stereo DSO  [55] on Oxford. For learning based methods, models trained on LiDAR points are used without fine-tuning. The top three results of each metric are highlighted in / / .
Fig. 7.
figure 7

Registration on point clouds generated by Stereo DSO  [55]. The first two columns display frames from the reference and the query sequences. The last two columns show the matches found by RANSAC and the point clouds after alignment.

In this section, we demonstrate the generalization capability of our method to a different sensor modality by evaluating its performance on the point clouds generated by Stereo DSO  [55]. As a direct method, Stereo DSO has the advantage of delivering relatively dense 3D reconstructions, which can provide stable geometric structures that are less affected by image appearance changes. We therefore believe that it is worth exploring extracting 3D descriptors from such reconstructions which can be helpful for loop-closure and relocalization for visual SLAM. To justify this idea, Stereo DSO is used to generate point clouds of eight traversals from Oxford RobotCar, which covers a wide range of day-time and weather conditions. This gives us 318 point cloud pairs with manually annotated relative poses. Each point cloud is cropped with a radius of 30 m and randomly rotated around the vertical axis. We use the same parameters as in Sect. 4.2 without fine-tuning our network and evaluate the geometric registration performance against other methods in Table 4. As shown in the table, our approach achieves the best rotation error (1.58\(^{\circ }\)) and success rate (90.6%) and second best translation error (0.36 m) among all the evaluated methods. It can also be noticed that most evaluated methods show significant inferior performances compared to the results in Table 1, e.g., the successful rates of 3DFeatNet+3DFeatNet, USIP+3DSmoothNet and USIP+3DFeatNet drop from 98.1%, 98.0% and 99.1% to 84.1%, 77.3% and 69.9%, respectively. This is largely because the point clouds extracted from LiDAR scannings have quite different distributions as those from Stereo DSO. Our model is still able to achieve a successful rate of 90.6%, showing the least degree of degeneracy. This further demonstrates the good generalization ability of the proposed method. Some qualitative results are shown in Fig. 7.

4.5 Ablation Study

Effectiveness of Different Components. We carry out three experiments to explore the contributions of different components of our method: (1) We remove the detector module and only \(L_{desc}\) is used to train the local feature encoder; (2) The weak supervision at the submap level proposed by 3DFeatNet  [23] is used (details in the supplementary material); (3) We remove the SE blocks. As shown in Table 5, the largest performance decrease comes with (2), which verifies our idea of generating a supervision signal by synthesizing transformations. Results of (1) indicate that learning an effective confidence map \(S\) can improve the quality of the learned local descriptors for matching. The results of (3) show that SE blocks contribute to learning more informative local descriptors and therefore are helpful to 3D feature matching.

Robustness Test. We assess the robustness of our model for both point cloud retrieval and registration against three factors, i.e., noise, rotation and downsampling: We add Gaussian noise \(\mathcal {N}(0,\sigma _{noise})\) to the point clouds; The range of the rotation test is set between 0 and 90\(^\circ \); Point clouds are downsampled using a set of factors \(\alpha \). For the local part, as shown in Fig. 8, our descriptors has shown excellent rotation invariance. When noise is added, our method can still achieves >90% success rate for \(\sigma _{noise}<\) 0.15 m. The performance significantly drops for \(\sigma _{noise}>\) 0.2 m, possibly due to the fact that training samples are filtered by a voxel grid with size 0.2 m, thus strong noise tends to heavily change the underlying point distribution. A similar explanation applies to the case of downsampling for a factor \(\alpha > 2\). Nevertheless, our model can still guarantee 90% success rate for \(\alpha \leqslant 1.5\). We conduct the same robustness tests for our global descriptors. Figure 9(a) demonstrates that our global descriptors possess good robustness to noise. Contrary to the local descriptors, the global descriptor seems to be less robust against rotations, which needs further investigation. Similar to the local descriptor, the quality of global feature is not affected too much for \(\alpha \leqslant 1.5\).

Table 5. Effects of different components for point cloud registration on Oxford RobotCar.
Fig. 8.
figure 8

Local detector and descriptor robustness test evaluated by the success rate of point cloud registration.

Fig. 9.
figure 9

Global descriptor robustness against random noise, rotation and downsampling. The \(x\) axes show the number of top retrieved matches.

5 Conclusion

We introduced a hierarchical network for the task of large-scale point cloud based relocalization. Rather than pursuing the traditional strategy of detect-then-describe or separately computing local and global descriptors, our network performs local feature detection, local feature description and global descriptor extraction in a single forward pass. Experimental results demonstrate the state-of-the-art performance of both our local and global descriptors across multiple benchmarks. Our model trained on LiDAR points also shows favorable generalization ability when applied to point clouds generated by visual SLAM methods. Future work is focused on further exploring the robustness of the learned descriptors to various perturbations.