A Registration Method for 3D Point Clouds with Convolutional Neural Network

Ai, Shangyou; Jia, Lei; Zhuang, Chungang; Ding, Han

doi:10.1007/978-3-319-65298-6_35

Shangyou Ai¹⁷,
Lei Jia¹⁷,
Chungang Zhuang¹⁷ &
…
Han Ding¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10464))

Included in the following conference series:

International Conference on Intelligent Robotics and Applications

5321 Accesses
3 Citations

Abstract

Viewpoint independent 3D object pose estimation is one of the most fundamental step of position based vision servo, autopilot, medical scans process, reverse engineering and many other fields. In this paper, we presents a new method to estimate 3D pose using the convolutional neural network (CNN), which can apply to the 3D point cloud arrays. An interest point detector was proposed and interest points were computed in both source and target point clouds by region growing cluster method during offline training of CNN. Rather than matching the correspondences by rejecting and filtering iteratively, a CNN classification model is designed to match a certain subset of correspondences. And a 3D shape representation of interest points was projected onto an input feature map which is amenable to CNN. After aligning point clouds according to the prediction made by CNN, iterative closest point (ICP) algorithm is used for fine alignment. Finally, experiments were conducted to show the proposed method was effective and robust to noise and point cloud partial missing.

Access provided by CONRICYT-eBooks. Download conference paper PDF

End to End Robust Point-Cloud Alignment Using Unsupervised Deep Learning

A Deep Learnable Framework for 3D Point Clouds Pose Transformation Regression

Six-Degree-of-Freedom Pose Estimation Method for Multi-Source Feature Points Based on Fully Convolutional Neural Network

Article Open access 07 September 2024

Keywords

1 Introduction

The scope of object pose estimation ranges from medical data process to automation in industry. For example, position based vision servo (PBVS) [1] is one of the two basic approaches in the field of visual servo control, and it necessitates the pose of the robot with respect to a specific coordinates prior to be known before subsequent execution. The variation of illumination conditions, background clutter, and occlusion makes conventional image-based techniques ineffective. Since 3D LIDAR scanner is far more accessible in recent years, one can obtain the 3D point cloud of an object much easier than before. It becomes very attractive to do the registration work for 3D point clouds as well as for images [2,3,4,5]. Among plenty of approaches, iterative closest points (ICP) [6] is a well-known method to solve the registration problem numerically. However, it always suffers from the local minima because of the non-convex characteristic and the iteration nature of the ICP approach. [4] Provides a globally optimal ICP solution based on branch-and-bound method, in exchange of time consumption. Here we focus on coarse registration method to provide initial transformation before using ICP.

Some research focuses on intelligence method for point cloud processing, e.g. convolutional neural networks (CNN). In most of cases, CNN deals with the feature maps which have intuitive interpretation, like the image [7]. In order to deal with point clouds using CNN, an elaborate feature map for point clouds have to be generated. In [8], a Hough accumulator is designed associated with every points in 3D point cloud for normal estimation, and the image-like structure of the accumulator is amenable to CNN.

Since massive unstructured point clouds are difficult to find the point-to-point correspondences between target and source point clouds, point detectors are always designed for reduction of computation complexity [9]. Interest points are selected by detectors according to a specific criterion, which is invariant to rigid transformation. And a correspondence is identified by point descriptor if the similarity between two points greater than a threshold. Many research focuses on design distinctive point descriptor for 3D point clouds [3]. In [10], the geodesic graph model (GGM) was proposed, the method utilized the fact that geodesic-like distance is an invariant structure feature during non-rigid deformation.

Once the interest points are detected, a typical method for estimating the transformation is Random Sample Consensus (RANSAC) [11]. The RANSAC method estimates a transformation for a given set of correspondences iteratively, and yields to the best one that eliminates most of outliers. In this paper, instead of using RANSAC for transformation estimation, we treat the correspondence matching problem as a classification task using CNN. Owing to the effectiveness of our designed point detector, only a few points were efficient for transformation estimation. As mentioned before, a new feature map associated with interest points is also derived to be fed to CNN. After matching the correspondences predicted by CNN, singular value decomposition (SVD) is used for transformation estimation. And ICP is used as fine registration method.

2 Methodology

2.1 Registration Problem

Given two 3D point clouds, addressed as source point cloud $ S $ and target point cloud $ T $ respectively (source point cloud is available as reference, and target point loud is often acquired by a 3D scanner). We want to find a rigid transformation $ \mu (R,p) $ which minimize the error $ E $:

$$ E(R,t) = \sum\nolimits_{i = 1}^{N} {\left\| {(Rs_{i} + p) - t_{i} } \right\|}^{2} $$

(1)

where the set $ \{ (s_{i} ,t_{i} )\,\text{with}\,s_{i} \in S,\,t_{i} \in T,\,i \in 1 \cdots N\} $ forms the correspondences between source and target point clouds. In the cases that different number of points in two point clouds (e.g. partial missing in the target point cloud), only a part of matches are expected, and a rejection scheme is sometimes desirable that discards the points without counterparts [9]. In addition, accurate pair-wise matching for all the points is infeasible in practice due to the high cardinality of point clouds.

3D interest point detectors are always designed to reduce the complexity in correspondences matching [12,13,14]. The consistency of detected interest points should be guaranteed with the presence of noise and outliers during rigid transformation, i.e., the point detector have to be as discriminative as possible to keep the local shape information invariant to rigid transformation and robust to other disturbances. Interest points are detected in the source and target point cloud respectively, and we obtain interest points set $ P^{S} = \{ p_{1}^{S} , \ldots ,p_{{K_{S} }}^{S} \} \subset S $, $ P^{T} = \{ p_{1}^{T} , \ldots ,p_{{K_{T} }}^{T} \} \subset T $.

A correspondence is identified usually by using point descriptors which describe local neighborhood of each interest point [15]. A correspondence $ (a,b) $ hold if:

$$ \left\| {S(D(a)) - S(D(b))} \right\| > \tau $$

(2)

where D is the descriptor function that mapping local neighborhood of a point to a set of scalar, S is a similarity measurement function, and $ \tau $ is a predefined threshold. We do not require the correspondences identification by descriptor function in this paper, only interest points are required to match correspondences.

We proposed the interest point detector to denoise the information underlying in point cloud transformation. To this end, a region growing clustering is implemented to ensure the consistency of detected interest points.

2.2 Point Detector

A set of interest points are detected to represent the pose of point cloud. As a premise, sampling strategy is implemented to select the salient points in the model of the point cloud, thereby the original point cloud is represented by small number of points, a region growing clustering is carried out and the interest points are designed to be the center of the clusters with most amount of points. Details are described as follows.

First, we down-sample the source and target point clouds respectively by the strategy of choosing the salient points with a significance metric proposed by [12].

For every point in the point cloud, the covariance matrix is computed according to its K nearest neighbors:

$$ COV(p_{i} ) = \sum\nolimits_{j = 1}^{K} {(p_{j} - p_{i} )(p_{j} - p_{i} )^{T} } $$

(3)

the smallest eigenvalue of COV(p _i) was chosen to be the significance assigned to each points, which measures the variance of its neighborhood. And the salient points were selected to be the top $ \eta_{s} \times n $ points among all the points in terms of their significance. Here $ \eta_{s} $ is the sampling rate and n is the total number of points. Second, the salient points were gathered into clusters by a region growing method [16]. A seed point is randomly chosen which had not been clustered, and its neighbor points are gathered into a cluster. Intra-cluster distance threshold T _i was set to keep differences between points in clusters subtle, and inter-cluster distance threshold T _c was set to avoid difference between clusters too small, thus decrease the ambiguity for correspondences matching. Finally, the interest points were set to be centers of the L clusters with most number of points. Note that while increasing the number of interest points can increase the robustness of pose representation to noise and occlusion, but also increase the cost of computation during registration period. Figure 1 shows the process of interest point detection.

2.3 Matching with CNN

Because the results of region growing clustering will deviate a bit based on the initial state of iteration and other disturbances, the final clusters result will not be identical for the same point cloud in every experiment. Then a deterministic algorithm which sort interest points into a canonical order for correspondence matching will not work. Therefore, we proposed the CNN classification model to achieve automatic correspondences matching. The representation of internal relationship of interest points is set to be the input feature map of CNN. Since the internal relationship between interest points are invariant to rigid transformation, the CNN helps to recover the complex mapping from the representation of interest points to correct correspondences.

As for source and target point clouds, interest points were computed in the source set $ \{ p_{1}^{S} \ldots p_{{K_{S} }}^{S} \} $ and target set $ \{ p_{1}^{T} \ldots p_{{K_{T} }}^{T} \} $ respectively. In the training step, $ K_{T} $ interest points in the source set were randomly selected, and for the chosen $ K_{T} $ points, the corresponding selection is assigned to a given category, which will be set as the training set target of the CNN. The categorical procedure is specified as follows.

Assume that every detected points in the target set can be matched to a detected point in the source set, there will be a total number of $ C_{{K_{S} }}^{{K_{T} }} $ possible combinations. Then the category of one possible combinations is assigned to one of the $ C_{{K_{S} }}^{{K_{T} }} $ selections, and the mapping from the selection to the category is trivial. Note that though the rapid growth of possible combinations against the number of interest points will increase the complexity of computation incredibly, relatively small number of interest points are chosen in practice (at least three for rigid transformation) make it feasible for point cloud registration.

Instead of input the raw point coordinates to the neural networks, the weighted adjacency matrix of interest points is computed for the input feature map of CNN. Regarding interest points as vertexes in a complete graph, the $ K_{T} $ interest points were then mapped to a weighted adjacency matrix $ M_{t} $ using the Euclidian distance between interest points as weight:

$$ M_{t} = (m_{ij} )_{{K_{T} \times K_{T} }} , \, m_{ij}^{T} = \sqrt {(p_{i}^{T} - p_{j}^{T} )^{2} } $$

(4)

Figure 2 illustrates the matching procedure with CNN.

The reason for this procedure is twofold. First, in order to match the correspondences between the source and target point cloud, the feature map should be invariant to rigid transformational, and robust to noise and outliers, and Euclidian distance between points meet the requirements. Second, the dataset can be transfer from raw arrays of coordinates into an organized feature map, which is amenable for CNN. And taking advantages of local conjunctions detection and shared weights of CNN, the point correspondences which woven in a tangle way originally can be found correctly by the information encoded in the weighted complete graph.

2.4 Pose Estimation

In the online registration step, the weighted adjacency matrix was computed from target set by the same pipeline. Applying the prediction of CNN, a set of points in the source set are assigned as correspondences. And the least-square error transformation is estimated by the SVD method [17], which is used to associate correspondences by all the possible permutation, and select the one with least error in (1).

After coarse registration computed by SVD, a fine registration is performed by implementing the ICP method.

3 Experiments

3.1 Region Growing Cluster

We choose the Stanford happy Buddha and a valve model for evaluation of our proposed method. Figure 3(b) and (c) show the results of region growing clustering. The sampling rate for choosing the salient point is set to be 20%, and the thresholds $ T_{i} $,$ T_{c} $ for clustering are defined according to the range $ R_{d} $ of point cloud.

$$ R_{d} = \mathop {\hbox{max} }\limits_{{}} \sqrt {(p_{i}^{S} - p_{j}^{S} )^{2} } $$

(5)

Here we set $ T_{i} \;{ = }\; (1/30)R_{d} $, $ T_{c} = (1/10)R_{d} $. Figures 3(b) and (c) show the results of clustering for happy Buddha, clusters are shown in colorful blobs. The target point cloud was scanned by laser scanner, and only the points in the front of the model were present. A rejection scheme is implemented to reject clusters according to two parameters for any cluster. First one $ \phi_{j} = \sum\nolimits_{i = 1}^{L} {(p_{i} - p_{j} } )^{2} $ which measures the total distance from other clusters for cluster j, the second is $ \psi_{j} = \hbox{max} \sqrt {(p_{m} - p_{n} )^{2} } $ with $ m,n \in j $, which indicates the diameter of cluster j. We compute the two parameters for all clusters, and reject the clusters that the ratio $ \psi_{j} /\phi_{j} $ are larger than others. After the rejection, final detected interest points were supposed to be the most distinguishable points, and were shown in red crosses in Figs. 3(b) and (c).

3.2 CNN Architecture

The CNN classification model with architecture is shown in Fig. 4. The input to the network is K _T × K _T weighted adjacency matrix of interest points, we choose K _T = 3 here for preliminary experiment.

The first hidden layer convolves 10 filters of kernel size 2 × 2 with stride 1 and zero padding 1 with the input feature map, and then apply a rectified linear unit (abbreviated as ReLu).

The second layer is a convolutional layer of kernel size 2 × 2 with stride 1 and zero padding 1. Two fully-connected layer with 120 neurons is followed behind, and the final layer is a softmax layer.

The training data was generated from source point cloud by randomly choosing 10000 weighted adjacency matrices with corresponding categories mentioned in Sect. 2.3. A normalization is also implemented to reduce influence of the variations of scale of point clouds. We implement normalization by multiplying input feature map by a constant $ \alpha $ inverse proportional to $ R_{d} $ so that $ \alpha R_{d} = 200 $.

3.3 Performance Analysis and Results

The proposed CNN model was trained from scratch, and after applied the proposed method to the test data, matching accuracy achieves 91%. Figures 3 and 5 shows the registration result of proposed method on both happy Buddha and the valve model. Since the number of interest points is relatively small, predicting correspondences from the CNN require less than 0.1 s on a 3.3 GHz Core i5 machine with 8 GB memory. We test the samples which have correct prediction by CNN, and reach the final RMS error 0.0082 (divided by R _d achieves relative error 4.1%) and angular-axis error 0.0837 in average without fine registration. Accuracy can be improved conceivably by increasing the number of interest points, in exchange for more time consumption and complexity of the CNN model. Tables 1 and 3 present an example of detected interest points in happy Buddha and valve model respectively, the prediction made by CNN is No.2, No.6, and No.8 points in the source set for happy Buddha, No.1, No.6, No.8 points for valve model. Tables 2 and 4 present the corresponding matrices of target set and predicted points in source set. Comparing with the ground truth rigid transformation, the computed transformation using SVD is 0.0077 for RMS error, 0.0923 for angular-axis error.

Table 1. Interest points detected in happy Buddha example.

Full size table

Table 2. Weighted adjacency matrices in happy Buddha example. (Predicted points are the No.2, No.6, and No.8 points in the source set of Table 1).

Full size table

Table 3. Interest points detected in valve model example.

Full size table

Table 4. Weighted adjacency matrices in valve model. (Predicted points are the No.1, No.6, and No.8 points in the source set of Table 3).

Full size table

Research points out that point detectors may have the drawback of being sensitive to noise [5]. Experiments have been conducted on the valve model. We randomly generate considerable number of noise in the bounding box of point cloud and Fig. 6 shows the linear growth of error against noise. The experiments indicate that with the help of CNN, correspondences matching using local interest points can be robust to noise.

4 Conclusion

We proposed a 3D point cloud registration method, with convolutional neural network for correspondences matching. In this method, only interest points are required to be detected and no requirement for correspondences identification by point descriptors. The feature map of the CNN is the weighted adjacency matrix of complete graph generated by detected interest points. Experimental results show the effectiveness of our proposed method. This method presents a new potential application of CNN in correspondences matching, where limitless ground truth data can be generate to be fed into CNN, and a set of interest points which are detected in target point cloud can be matched to the correct counterparts. Our future research includes utilizing other local descriptions, feature map representation, and some strategies focusing on rejecting interest points.

References

Chaumette, F., Hutchinson, S.: Visual servo control. I. basic approaches. IEEE Robot. Autom. Magvol 13(4), 83–90 (2006)
Google Scholar
Jiang, J., Cheng, J., Chen, X.: Registration for 3-D point cloud using angular-invariant feature. Neurocomputing 72, 3839–3844 (2009)
Article Google Scholar
Rusu, R.B., Blodow, N., Beetz, M.: Fast point feature histograms (FPFH) for 3D Registration. In: IEEE International Conference on Robotics Automation, pp. 1848–1853 (2009)
Google Scholar
Yang, J., Li, H., Jia, Y.: Go-ICP: solving 3D registration efficiently and globally optimally. IN: IEEE International Conference on Computer Vision, pp. 1457–1464 (2013)
Google Scholar
Drost, B., Ulrich, M., Navab, N., Ilic, S.: Model globally, match locally: efficient and robust 3D object recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 998–1005 (2010)
Google Scholar
Besl, P.J., McKay, N.D.: A method for registration of 3-D shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14(2), 239–256 (1992)
Article Google Scholar
Miao, S., Wang, Z.J., Liao, R.: A CNN regression approach for real-time 2D/3D registration. IEEE Trans. Med. Image 35(5), 1352–1363 (2016)
Google Scholar
Boulch, A., Marlet, R.: Deep learning for robust normal estimation in unstructured point clouds. In: Eurographics Symposium on Geometry Processing, pp. 281–290 (2016)
Google Scholar
Diez, Y., Roure, F., Llado, X., Salvi, J.: A qualitative review on 3D registration methods. ACM Comput. Surv. 47(3), 45 (2015)
Google Scholar
Qian, D., Chen, T., Qiao, H.: A new algorithm for non-rigid point matching using geodesic graph model. In: International Conference on Mechatronics and Automation, pp. 1174–1180 (2015)
Google Scholar
Papazov, C., Burschka, D.: An efficient RANSAC for 3D object recognition in noisy and occluded scenes. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010. LNCS, vol. 6492, pp. 135–148. Springer, Heidelberg (2011). doi:10.1007/978-3-642-19315-6_11
Chapter Google Scholar
Y. Zhong.: Intrinsic shape signatures: A shape descriptor for 3D object recognition. International Conference on Computer Vision Workshop 3D representation Recognition, pp. 689–696 (2010)
Google Scholar
Mian, A.S., Bennamoun, M., Owens, R.A.: On the repeatability and quality of keypoints for local feature-based 3D object retrieval from cluttered scenes. Int. J. Comput. Vision 89(2–3), 348–361 (2008)
Google Scholar
Chen, H., Bhanu, B.: 3D free form object recognition in range images using local surface patches. Pattern Recogn. Lett. 28(10), 1252–1262 (2007)
Article Google Scholar
Salti, S., Tombari, F., Stefano, L.D.: A performance evaluation of 3D keypoint detectors. In: IEEE International Conference on 3D Imaging, Modeling, Processing, Visualization, and Transmission, pp. 236–243 (2011)
Google Scholar
Pratt, W.K.: Digital Image Processing, 4th edn., pp. 590–595. Wiley, LosAltos (2007)
Google Scholar
Arun, K.S., Huang, T.S., Blostein, S.D.: Least-squares fitting of two 3-D point sets. IEEE Trans. Pattern Anal. Machine Intell. 9, 698–700 (1987)
Google Scholar

Download references

Acknowledgements

This work is partially supported by the National Natural Science Foundation of China (51375309).

Author information

Authors and Affiliations

School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, 200240, People’s Republic of China
Shangyou Ai, Lei Jia, Chungang Zhuang & Han Ding

Authors

Shangyou Ai
View author publications
You can also search for this author in PubMed Google Scholar
Lei Jia
View author publications
You can also search for this author in PubMed Google Scholar
Chungang Zhuang
View author publications
You can also search for this author in PubMed Google Scholar
Han Ding
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chungang Zhuang .

Editor information

Editors and Affiliations

School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan, China
YongAn Huang
School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan, China
Hao Wu
Institute of Industrial Research, University of Portsmouth, Portsmouth, United Kingdom
Honghai Liu
School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan, China
Zhouping Yin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ai, S., Jia, L., Zhuang, C., Ding, H. (2017). A Registration Method for 3D Point Clouds with Convolutional Neural Network. In: Huang, Y., Wu, H., Liu, H., Yin, Z. (eds) Intelligent Robotics and Applications. ICIRA 2017. Lecture Notes in Computer Science(), vol 10464. Springer, Cham. https://doi.org/10.1007/978-3-319-65298-6_35

Download citation

DOI: https://doi.org/10.1007/978-3-319-65298-6_35
Published: 06 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-65297-9
Online ISBN: 978-3-319-65298-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Registration Method for 3D Point Clouds with Convolutional Neural Network

Abstract

Similar content being viewed by others

End to End Robust Point-Cloud Alignment Using Unsupervised Deep Learning

A Deep Learnable Framework for 3D Point Clouds Pose Transformation Regression

Six-Degree-of-Freedom Pose Estimation Method for Multi-Source Feature Points Based on Fully Convolutional Neural Network

Keywords

1 Introduction