Keywords

1 Introduction

The task of odometry is to estimate 3D translation and orientation of autonomous vehicles which is one of key steps in SLAM. Autonomous vehicles usually collect information by perceiving the surrounding environment in real time and use on-board sensors such as lidar, Inertial Measurement Units (IMU), or camera to estimate their 3D translation and orientation. Lidar can provide high-precision 3D measurements but also has no requirement for light. The point clouds generated by the lidar can provide high-precision 3D measurements, but if it has large translation or orientation in a short time, the continuously generated point clouds will only get few matching points, which will affect the accuracy of odometry. IMU has advantages of high output frequency and directly outputting the 6DOF information to predict the initial translation and orientation that the localization failure phenomenon can be reduced when lidar has large translation or orientation.

The traditional methods [1, 18, 19, 23] are mainly based on the point registration and work well in ideal environments. However, due to the sparseness and irregularity of the point clouds, these methods are difficult to obtain enough matching points. Typically, ICP [1] and its variants [14, 18] iteratively find matching points which depend on nearest-neighbor searching and optimize the translation and orientation by matching points. This optimization procedure is sensitive to noise and dynamic objects and prone to getting stuck into the local minima.

Thanks to the recent advances in deep learning, many approaches adopt deep neural networks for lidar odometry, which can achieve more promising performance compared to traditional methods. But most of them are supervised methods [10, 12, 13, 20, 21]. However, supervised methods require ground truth pose, which consumes a lot of manpower and material resources. Due to the scarcity of the ground truth, recent unsupervised methods are proposed [5, 15, 22], but some of them obtain unsatisfactory performance, and some need to consume a lot of video memory and time to train the network.

Two issues exist in these methods. First, these methods ignore IMU, which often bring fruitful clues for accurate lidar odometry. Second, those methods do not make full use of the normals, which only take the point clouds as the inputs. Normals of point clouds can indicate the relationship between a point and its surrounding points. And even if those approaches [12] who use normals as network input, they simply concatenate points and normals together and put them into network, but only orientation between two point clouds relates to normals, so normals should not be used to estimate translation.

To circumvent the dependence on expensive ground truth, we propose a novel framework termed as UnDeepLIO, which makes full use of the IMU and normals for more accurate odometry. We compare against various baselines using point clouds from the KITTI Vision Benchmark Suite [7] which collects point clouds using a \(360^\circ \) Velodyne laser scanner.

Our main contributions are as follows:

  • We present a self-supervised learning-based approach for robot pose estimation. our method can outperform [5, 15].

  • We use IMU to assist odometry. Our IMU feature extraction module can be embedded in most network structures [5, 12, 15, 21].

  • Both points and its normals are used as network inputs. We use feature of points to estimate translation and feature of both of them to estimate orientation.

2 Related Work

2.1 Model-Based Odometry Estimation

Gauss-Newton iteration methods have a long-standing history in odometry task. Model-based methods solve odometry problems generally by using Newton’s iteration method to adjust the transformation between frames so that the “gap” between frames keeps getting smaller. They can be categorized into two-frame methods [1, 14, 18] and multi-frame methods [19, 23].

Point registration is the most common skill for two-frame methods, where ICP [1] and its variants [14, 18] are typical examples. The ICP iteratively search key points and its correspondences to estimate the transformation between two point clouds until convergence. Moreover, most of these methods need multiple iterations with a large amount of calculation, which is difficult to meet the real-time requirements of the system.

Multi-frame algorithms [2, 19, 23] often relies on the two-frame based estimation. They improve the steps of selecting key points and finding matching points, and use additional mapping step to further optimize the pose estimation. Their calculation process is generally more complicated and runs at a lower frequency.

2.2 Learning-Based Odometry Estimation

In the last few years, the development of deep learning has greatly affected the most advanced odometry estimation. Learning-based model can provide a solution only needs uniformly down sampling the point clouds without manually selecting key points. They can be classified into supervised methods and unsupervised methods.

Supervised methods appear relatively early, Lo-net [12] maps the point clouds to 2D “image” by spherical projection. Wang et al. [21] adopt a dual-branch architecture to infer 3-D translation and orientation separately instead of a single network. Velas et al. [20] use point clouds to assist 3D motion estimation and regarded it as a classification problem. Differently, Li et al. [13] do not simply estimate 3D motion with fully connected layer but Singular Value Decomposition (SVD).

Unsupervised methods appear later. Cho et al. [5] first apply unsupervised approach on deep-learning-based LiDAR odometry which is an extension of their previous approach [4]. The inspiration of its loss function comes from point-to-plane ICP [14]. Then, Nubert et al. [15] report methods with similarly models and loss function, but they use different way to calculate normals of each point in point clouds and find matching points between two continuous point clouds.

3 Methods

3.1 Data Preprocess

Data Input. At every timestamp \(k\in \mathbb {R}^+\), we can obtain one point clouds \(P_k\) of \(N*3\) dimensions and between every two timestamps we can get S frames IMU \(I_{k,k+1}\) of \(S*6\) dimensions including 3D angular velocity and 3D linear acceleration. We take above data as the inputs.

Vertex Map. In order to circumvent the disordered nature of point clouds, we project the point clouds into the 2D image coordinate system according to the horizontal and vertical angle. We employ projection function \(\Pi :{\mathbb {R}}^3\mapsto {\mathbb {R}}^2\). Each 3D point \(\boldsymbol{p}=(p_x,p_y,p_z)\) in a point clouds \(P_k\) is mapped into the 2D image plane (wh) represented as

$$\begin{aligned} \begin{aligned} \left( \begin{array}{c} w \\ h \\ \end{array} \right) =&\left( \begin{array}{c} (f_w-\arctan (\frac{p_y}{p_x}))/\eta _w \\ (f_h-\arcsin (\frac{p_z}{d}))/\eta _h\\ \end{array} \right) , \\&H> h \ge 0, W > w \ge 0, \end{aligned} \end{aligned}$$
(1)

where depth is \(d=\sqrt{{p_x}^2+{p_y}^2+{p_z}^2}\). \(f_w\) and \(f_h\) are the maximum horizontal and vertical angle. H and W are shape of vertex map. \(f_h\) depends on the type of the lidar. \(\eta _w\) and \(\eta _h\) control the horizontal and vertical sampling density. If several 3D points correspond the same pixel values, we choose the point with minimum depth as the final result. If one pixel coordinate has no matching 3D points, the pixel value is set to (0, 0, 0). We define the 2D image plane as vertex map \(\boldsymbol{V}\).

Normal Map. The normal vector of one point includes its relevance about the surrounding points, so we compute a normal map \(\boldsymbol{N}\) which consists of normals \(\boldsymbol{n}\) and has the same shape as corresponding vertex map \(\boldsymbol{V}\). We adopt similar operations with Cho et al. [5] and Li et al. [12] to calculate the normal vectors. Each normal vector \(\boldsymbol{n}\) corresponds to a vertex \(\boldsymbol{v}\) with the same image coordinate. Due to sparse and discontinuous characteristics of point clouds, we pay more attention on the vertex with small Euclidean distance from the surrounding pixel via a pre-defined weight, which can be expressed as \(w_{a,b}=e^{\{-0.5|d(v_a)-d(v_b)|\}}\). Each normal vector \(\boldsymbol{n}\) is represented as

$$\begin{aligned} \boldsymbol{n}_p=\sum _{i\in [0,3]}w_{p_i,p}(v_{p_i}-v_p)\times w_{p_{i+1},p}(v_{p_{i+1}}-v_p), \end{aligned}$$
(2)

where \(p_i\) represents points in 4 directions of the central vertex p (0-up, 1-right, 2-down, 3-left).

3.2 Network Structure

Network Input. Our pipeline is shown in the Fig. 1. Each point clouds associates with a vertex/normal map of (3, HW) dimensions, so we concatenate the vertex/normal map of k and \(k+1\) timestamp to get vertex/normal pair of (2, 3, HW) dimensions. We take a pair of vertex/normal maps and IMU between k and \(k+1\) timestamp as the inputs of our model, where the IMU consists of the linear acceleration and angular velocity both of (S, 3) dimensions, and S is the length of IMU sequence. Our model outputs relative pose \(T_{k,k+1}\), where \(R_{k,k+1}\) is orientation and \(t_{k,k+1}\) is translation.

$$\begin{aligned} T_{k,k+1}^{4\times 4}= \left[ \begin{array}{cc} R_{k,k+1}^{3\times 3} &{} t_{k,k+1}^{3\times 1} \\ 0 &{} 1 \end{array} \right] , \end{aligned}$$
(3)
Fig. 1.
figure 1

The proposed network and our unsupervised training scheme. FC represents fully connected layer. t means translation and q means Euler angle of orientation. LSTM takes continuous frames of IMU as inputs and output initial relative pose \(\hat{T}\). \(\hat{T}\) are used to transform two maps of current frame to last frame. Then we send the remapped maps into ResNet Encoder, which outputs feature maps, including vertex and normal features. From the features, we propose an attention layer to estimate residual pose \(\delta T\). The final output is their sum \(T=\delta T\hat{T}\).

Estimating Initial Relative Pose from IMU. Linear acceleration is used to estimate translation and angular velocity is used to estimate orientation. We employ LSTM on IMU to extract the features of IMU. Then the features are forwarded into the FC layer to estimate initial relative translation or orientation.

Mapping the Point Clouds of Current Frame to the Last Frame. Each vertex/normal pair consists of last and current frames. They are not in the same coordinate due to the transformation. The initial relative pose can map current frame in current coordinate to last coordinate, then we can obtain the remapped current map with the same size as the old one. The relationship between two maps are shown as formula (4). Take the \(\boldsymbol{v}_{k+1,p}^{k}\) for example, it is the mapped vertex at timestamp k from timestamp \(k+1\) via the initial pose.

$$\begin{aligned} \boldsymbol{v}_{k+1,p}^{k}=R_{k,k+1}\boldsymbol{v}_{k+1,p}^{k+1}+t_{k,k+1}, \end{aligned}$$
(4)
$$\begin{aligned} \boldsymbol{n}_{k+1,p}^{k}=R_{k,k+1}\boldsymbol{n}_{k+1,p}^{k+1}. \end{aligned}$$
(5)

Estimating Residual Relative Pose from the Remapped Maps. We use ResNet Encoder (see Fig. 2) as our map feature extractor. ResNet [8] is used in image recognition. Its input is the 2D images similar to us. Therefore, this structure can extract feature in our task as well. We send the remapping vertex/normal map pair into siamese ResNet Encoder, which outputs feature maps, including vertex and normal features. From the features, we propose an attention layer (by formula (6), x is input) which is inspired by LSTM [9] to estimate residual pose \(\delta T\) between last frame and the remapped current frame. Among them, vertex and normal features are combined to estimate orientation, but only vertex is used to estimate translation because the change of translation does not cause the change of normal vectors. Together with initial relative pose, we can get final relative pose T.

$$\begin{aligned} \begin{aligned}&i=\sigma (W_{i}x+b_i), \\&g=tanh(W_{g}x+b_g), \\&o=\sigma (W_{o}x+b_o), \\&out=o*tanh(i*g) . \end{aligned} \end{aligned}$$
(6)
Fig. 2.
figure 2

The detail structure of ResNet Encoder + Avgpool + FC part.

Fig. 3.
figure 3

Point downsample process, including point (up) and normal (down).

3.3 Loss Function

For unsupervised training, we use a combination of geometric losses in our deep learning framework. Unlike Cho et al. [5] who use pixel locations as correspondence between two point clouds, we search correspondence on the whole point clouds. For speeding up calculation, we first calculate the normals \(NP_i\) of whole point clouds \(P_k\) by plane fitting \(\Phi \) [17], and then remove its ground points by RANSAC [6], at last perform voxel grid filtering \(\Downarrow \) (the arithmetic average of all points in voxel as its representation. The normal vectors of voxel are processed in the same way and standardized after downsample.) to downsample to about K points (The precess is shown in Fig. 3). Given the predicted relative pose \(T_{k,k+1}\), we apply it on preprocessed current point clouds \(DP_{k+1}\) and its normals \(NP_{k+1}\). For the correspondence search, we use KD-Tree [3] to find the nearest point in the last point clouds \(DP_k\) of each point in the transformed current point clouds \(\overline{DP}_{k+1}\).

$$\begin{aligned} \begin{aligned}&DP_{k}=\Downarrow (\mathsf {RANSAC}(P_{k})), \\&NP_{k}=\Downarrow (\mathsf {RANSAC}(\Phi (P_{k})), \end{aligned} \end{aligned}$$
(7)
$$\begin{aligned} \begin{aligned}&\overline{NP}_{k+1}=R_{k,k+1}NP_{k+1}, \\&\overline{DP}_{k+1}=R_{k,k+1}DP_{k+1}+t_{k,k+1}. \end{aligned} \end{aligned}$$
(8)

Point-to-Plane ICP Loss. We use every point \(\overline{dp}_{k+1}\) in current point clouds \(\overline{DP}_{k+1}\), corresponding point of \({dp}_k\) and normal vector of \({np}_k\) in last point clouds \({DP}_{k}\) to compute the distance between point and its matching plane. The loss function \(\mathcal {L}_{po2pl}\) is represented as

$$\begin{aligned} \mathcal {L}_{po2pl}=\sum _{\overline{dp}_{k+1}\in \overline{DP}_{k+1}} |(\overline{dp}_{k+1}-{dp}_k)\centerdot {np}_k|_1, \end{aligned}$$
(9)

where \(\centerdot \) denotes inner product.

Plane-to-Plane ICP Loss. Similarly to point-to-plane ICP, we use normal \(\overline{np}_{k+1}\) of every point in \(\overline{NP}_{k+1}\), corresponding normal vector of \({np}_k\) in \({NP}_{k}\) to compute the angle between a pair of matching plane. The loss function \(\mathcal {L}_{pl2pl}\) is represented as

$$\begin{aligned} \mathcal {L}_{pl2pl}=\sum _{\overline{np}_{k+1}\in \overline{NP}_{k+1}} |\overline{np}_{k+1}-{np}_k|^2_2. \end{aligned}$$
(10)

Overall Loss. Finally, the overall unsupervised loss is obtained as

$$\begin{aligned} \mathcal {L}=\alpha \mathcal {L}_{po2pl}+\lambda \mathcal {L}_{po2pl}, \end{aligned}$$
(11)

where \(\alpha \) and \(\lambda \) are balancing factors.

4 Experiments

In this section, we first introduce implementation details of our model and benchmark dataset used in our experiments and the implementation details of the proposed model. Then, comparing to the existing lidar odometry methods, our model can obtain competitive results. Finally, we conduct ablation studies to verify the effectiveness of the innovative part of our model.

4.1 Implementation Details

The proposed network is implemented in PyTorch [16] and trained with a single NVIDIA Titan RTX GPU. We optimize the parameters with the Adam optimizer [11] whose hyperparameter values are \(\beta _1 = 0.9\), \(\beta _2 = 0.99\) and \(w_{decay} = 10^{-5}\). We adopt step scheduler with a step size of 20 and \(\gamma = 0.5\) to control the training procedure, the initial learning rate is \(10^{-4}\) and the batch size is 20. The length S of IMU sequence is 15. The maximum horizontal and vertical angle of vertex map are \(f_w=180^\circ \) and \(f_h=23^\circ \), and density of them are \(\eta _w = \eta _h = 0.5\). The shapes of input maps are \(H = 52\) and \(W = 720\). The loss weight of formula (11) is set to be \(\alpha = 1.0\) and \(\lambda = 0.1\). The initial side length of voxel downsample is set to 0.3m, it is adjusted according to the number of points after downsample, if points are too many, we increase the side length, otherwise reduce. The adjustment size is 0.01m per time. The number of points after downsample is controlled within \(K\pm 100\) and \(K=10240\).

4.2 Datasets

The KITTI odometry dataset [7] has 22 different sequences with images, 3D lidar point clouds, IMU and other data. Only sequences 00-10 have an official public ground truth. Among them, only sequence 03 does not provide IMU. Therefore, we do not use sequence 03 when there exists the IMU assist in our method.

4.3 Evaluation on the KITTI Dataset

We compare our method with the following methods which can be divided into two types. Model-based methods are: LOAM [23] and LeGO-LOAM [19]. Learning-based methods are: Nubert et al. [15], Cho et al. [5] and SelfVoxeLO [22].

In model-based methods, we show the lidar odometry results of them with mapping and without mapping.

In learning-based methods, we use two ways to divide the train and test set. First, we use sequences 00-08 for training and 09-10 for testing, as Cho et al. [5] and Nubert et al. [15] use Sequences 00-08 as their training set. We name it as “Ours-easy”. Then, we use sequences 00-06 for training and 07-10 for testing, to compare with SelfVoxeLO which uses Sequences 00-06 as training set. We name it as “Ours-hard”.

Table 1 contains the details of the results: \(t_{rel}\) means average translational RMSE (%) on length of 100 m–800 m and \(r_{rel}\) means average rotational RMSE (\(^\circ \)/100 m) on length of 100 m–800 m. LeGO-LOAM is not always more precise by adding imu, traditional method is more sensitive to the accuracy of imu (In sequence 00, there exists some lack of IMU), which is most likely the reason for its accuracy drop. Even if the accuracy of the estimation is improved by the IMU, the effect is not obvious, especially after the mapping step. Our method gains a significant improvement by using IMU in test set, and has a certain advantage over traditional method without mapping, and is not much lower than with mapping. In the easy task (For trajectories results, see Fig. 4), our method without imu assist is also competitive compared to Cho et al. [5] and Nubert et al. [15] which also project the point clouds into the 2D image coordinate system. Our method can acquire a lot of improvements with imu. In the hard task, comparing to the most advanced method SelfVoxeLO [22] which uses 3D convolutions on voxels and consumes much video memory and training time, our method also can get comparable results with IMU. Since they did not publish the code, we are unable to conduct experiments on their method with imu.

Fig. 4.
figure 4

2D estimated trajectories of our method on sequence 09 and 10.

Table 1. KITTI odometry evaluation.

4.4 Ablation Study

In order to prove the effectiveness of each proposed module, we conduct ablation experiments on KITTI, use sequences 00-08 as trainset and sequences 09-10 as testset.

IMU. As mentioned earlier, IMU can greatly improve the accuracy of odometry, but the role played by different IMU utilization methods is also different. If only use IMU to extract features through the network, and directly merge with the feature of the point clouds, the effect is limited (see Fig. 5). Our method uses IMU and LSTM network to estimate a relative initial pose, project vertex image and normal vector image of the original current frame, and then send the projection images into the point clouds feature extraction network, so that the IMU can not only have a direct connection with the final odometry estimate network, but also make the coordinate of two consecutive frames closer. The comparison is shown in Table 2.

Fig. 5.
figure 5

Use IMU only as feature.

Different Operations to Obtain the Rotation and Translation Features. The normal vector contains the relationship between a point and its surrounding points, and can be used as feature of pose estimation just like the point itself. Through the calculation formula of the normal vector, we can know that the change of the normal vector is only related to the orientation, and the translation will not bring about the change of the normal vector. Therefore, we only use the feature of the point to estimate the translation. We compare the original method with the two strategies of not using normal vectors as the network input and not distinguishing feature of the normals and points (see Fig. 6). The comparison is shown in Table 3.

Table 2. Comparison among different ways to preprocess imu and whether using imu.
Fig. 6.
figure 6

The network structure of learning translation and rotation features from concatenated vetex and normal features simultaneously (left) and the network structure without the normal feature (right).

Table 3. Comparison among whether distinguishing features (dist) and whether using normal.

Attention. After extracting the features of the vertex map and the normal map, we add an additional self-attention module to improve the accuracy of pose estimation. The attention module can self-learn the importance of features, and give higher weight to more important features. We verify its effectiveness by comparing the result of the model which replaces the self-attention module with a single FC layer with activation function (as formula (12)). The comparison is show in Table 4.

$$\begin{aligned} out=tanh(W_2(tanh(W_{1}x+b_1))+b_2). \end{aligned}$$
(12)
Table 4. Comparison among whether using attention module.

Loss Function. Cho et al. [5] adopt the strategy of using the points with the same pixel in last and current vertex map as the matching points. Although the calculation speed is fast, the matching points found in this way are likely to be incorrect. Therefore, we and Nubert et al. [15] imitate ICP algorithm, using the nearest neighbor as the matching point(see Fig. 7). Although we use the same loss functions and the same matching point search strategy (nearest neighbor) as Nubert et al. [15], we search in the entire point clouds space, and maintain the number of points in search space not too large by removing most of the ground points and operating voxel grids downsample on point clouds. The number of points even is only 1/3 of the points sampled by the 2D projection which used in [15]. Table 5 shows the necessity of two loss parts and strategy of searching matching points in the entire point clouds.

Fig. 7.
figure 7

Matching points search strategy of Cho et al.(pixel-to-pixel), our and Nubert et al.(point-to-point).

Table 5. Comparison among different loss functions and matching point search strategy.

5 Conclusion

In this paper, we proposed UnDeepLIO, an unsupervised learning-based odometry network. Different from other unsupervised lidar odometry methods, we additionally used IMU to assist odometry task. There have been already many IMU and lidar fusion algorithms in the traditional field for odometry, and it has become a trend to use the information of both at the same time. Moreover, we conduct extensive experiments on kitti dataset and experiments verify that our method is competitive with the most advanced methods. In ablation study, we validated the effectiveness of each component of our model. In the future, we will study how to incorporate mapping steps into our network framework and conduct online tests.