Keywords

1 Introduction

The scope of object pose estimation ranges from medical data process to automation in industry. For example, position based vision servo (PBVS) [1] is one of the two basic approaches in the field of visual servo control, and it necessitates the pose of the robot with respect to a specific coordinates prior to be known before subsequent execution. The variation of illumination conditions, background clutter, and occlusion makes conventional image-based techniques ineffective. Since 3D LIDAR scanner is far more accessible in recent years, one can obtain the 3D point cloud of an object much easier than before. It becomes very attractive to do the registration work for 3D point clouds as well as for images [2,3,4,5]. Among plenty of approaches, iterative closest points (ICP) [6] is a well-known method to solve the registration problem numerically. However, it always suffers from the local minima because of the non-convex characteristic and the iteration nature of the ICP approach. [4] Provides a globally optimal ICP solution based on branch-and-bound method, in exchange of time consumption. Here we focus on coarse registration method to provide initial transformation before using ICP.

Some research focuses on intelligence method for point cloud processing, e.g. convolutional neural networks (CNN). In most of cases, CNN deals with the feature maps which have intuitive interpretation, like the image [7]. In order to deal with point clouds using CNN, an elaborate feature map for point clouds have to be generated. In [8], a Hough accumulator is designed associated with every points in 3D point cloud for normal estimation, and the image-like structure of the accumulator is amenable to CNN.

Since massive unstructured point clouds are difficult to find the point-to-point correspondences between target and source point clouds, point detectors are always designed for reduction of computation complexity [9]. Interest points are selected by detectors according to a specific criterion, which is invariant to rigid transformation. And a correspondence is identified by point descriptor if the similarity between two points greater than a threshold. Many research focuses on design distinctive point descriptor for 3D point clouds [3]. In [10], the geodesic graph model (GGM) was proposed, the method utilized the fact that geodesic-like distance is an invariant structure feature during non-rigid deformation.

Once the interest points are detected, a typical method for estimating the transformation is Random Sample Consensus (RANSAC) [11]. The RANSAC method estimates a transformation for a given set of correspondences iteratively, and yields to the best one that eliminates most of outliers. In this paper, instead of using RANSAC for transformation estimation, we treat the correspondence matching problem as a classification task using CNN. Owing to the effectiveness of our designed point detector, only a few points were efficient for transformation estimation. As mentioned before, a new feature map associated with interest points is also derived to be fed to CNN. After matching the correspondences predicted by CNN, singular value decomposition (SVD) is used for transformation estimation. And ICP is used as fine registration method.

2 Methodology

2.1 Registration Problem

Given two 3D point clouds, addressed as source point cloud \( S \) and target point cloud \( T \) respectively (source point cloud is available as reference, and target point loud is often acquired by a 3D scanner). We want to find a rigid transformation \( \mu (R,p) \) which minimize the error \( E \):

$$ E(R,t) = \sum\nolimits_{i = 1}^{N} {\left\| {(Rs_{i} + p) - t_{i} } \right\|}^{2} $$
(1)

where the set \( \{ (s_{i} ,t_{i} )\,\text{with}\,s_{i} \in S,\,t_{i} \in T,\,i \in 1 \cdots N\} \) forms the correspondences between source and target point clouds. In the cases that different number of points in two point clouds (e.g. partial missing in the target point cloud), only a part of matches are expected, and a rejection scheme is sometimes desirable that discards the points without counterparts [9]. In addition, accurate pair-wise matching for all the points is infeasible in practice due to the high cardinality of point clouds.

3D interest point detectors are always designed to reduce the complexity in correspondences matching [12,13,14]. The consistency of detected interest points should be guaranteed with the presence of noise and outliers during rigid transformation, i.e., the point detector have to be as discriminative as possible to keep the local shape information invariant to rigid transformation and robust to other disturbances. Interest points are detected in the source and target point cloud respectively, and we obtain interest points set \( P^{S} = \{ p_{1}^{S} , \ldots ,p_{{K_{S} }}^{S} \} \subset S \), \( P^{T} = \{ p_{1}^{T} , \ldots ,p_{{K_{T} }}^{T} \} \subset T \).

A correspondence is identified usually by using point descriptors which describe local neighborhood of each interest point [15]. A correspondence \( (a,b) \) hold if:

$$ \left\| {S(D(a)) - S(D(b))} \right\| > \tau $$
(2)

where D is the descriptor function that mapping local neighborhood of a point to a set of scalar, S is a similarity measurement function, and \( \tau \) is a predefined threshold. We do not require the correspondences identification by descriptor function in this paper, only interest points are required to match correspondences.

We proposed the interest point detector to denoise the information underlying in point cloud transformation. To this end, a region growing clustering is implemented to ensure the consistency of detected interest points.

2.2 Point Detector

A set of interest points are detected to represent the pose of point cloud. As a premise, sampling strategy is implemented to select the salient points in the model of the point cloud, thereby the original point cloud is represented by small number of points, a region growing clustering is carried out and the interest points are designed to be the center of the clusters with most amount of points. Details are described as follows.

First, we down-sample the source and target point clouds respectively by the strategy of choosing the salient points with a significance metric proposed by [12].

For every point in the point cloud, the covariance matrix is computed according to its K nearest neighbors:

$$ COV(p_{i} ) = \sum\nolimits_{j = 1}^{K} {(p_{j} - p_{i} )(p_{j} - p_{i} )^{T} } $$
(3)

the smallest eigenvalue of COV(p i ) was chosen to be the significance assigned to each points, which measures the variance of its neighborhood. And the salient points were selected to be the top \( \eta_{s} \times n \) points among all the points in terms of their significance. Here \( \eta_{s} \) is the sampling rate and n is the total number of points. Second, the salient points were gathered into clusters by a region growing method [16]. A seed point is randomly chosen which had not been clustered, and its neighbor points are gathered into a cluster. Intra-cluster distance threshold T i was set to keep differences between points in clusters subtle, and inter-cluster distance threshold T c was set to avoid difference between clusters too small, thus decrease the ambiguity for correspondences matching. Finally, the interest points were set to be centers of the L clusters with most number of points. Note that while increasing the number of interest points can increase the robustness of pose representation to noise and occlusion, but also increase the cost of computation during registration period. Figure 1 shows the process of interest point detection.

Fig. 1.
figure 1

Visualization of interest point detection. From left to right: Origin point cloud; salient points selected; clusters by region growing clustering (shown in colorful blobs) and detected 7 interest points (shown in red crosses). (Color figure online)

2.3 Matching with CNN

Because the results of region growing clustering will deviate a bit based on the initial state of iteration and other disturbances, the final clusters result will not be identical for the same point cloud in every experiment. Then a deterministic algorithm which sort interest points into a canonical order for correspondence matching will not work. Therefore, we proposed the CNN classification model to achieve automatic correspondences matching. The representation of internal relationship of interest points is set to be the input feature map of CNN. Since the internal relationship between interest points are invariant to rigid transformation, the CNN helps to recover the complex mapping from the representation of interest points to correct correspondences.

As for source and target point clouds, interest points were computed in the source set \( \{ p_{1}^{S} \ldots p_{{K_{S} }}^{S} \} \) and target set \( \{ p_{1}^{T} \ldots p_{{K_{T} }}^{T} \} \) respectively. In the training step, \( K_{T} \) interest points in the source set were randomly selected, and for the chosen \( K_{T} \) points, the corresponding selection is assigned to a given category, which will be set as the training set target of the CNN. The categorical procedure is specified as follows.

Assume that every detected points in the target set can be matched to a detected point in the source set, there will be a total number of \( C_{{K_{S} }}^{{K_{T} }} \) possible combinations. Then the category of one possible combinations is assigned to one of the \( C_{{K_{S} }}^{{K_{T} }} \) selections, and the mapping from the selection to the category is trivial. Note that though the rapid growth of possible combinations against the number of interest points will increase the complexity of computation incredibly, relatively small number of interest points are chosen in practice (at least three for rigid transformation) make it feasible for point cloud registration.

Instead of input the raw point coordinates to the neural networks, the weighted adjacency matrix of interest points is computed for the input feature map of CNN. Regarding interest points as vertexes in a complete graph, the \( K_{T} \) interest points were then mapped to a weighted adjacency matrix \( M_{t} \) using the Euclidian distance between interest points as weight:

$$ M_{t} = (m_{ij} )_{{K_{T} \times K_{T} }} , \, m_{ij}^{T} = \sqrt {(p_{i}^{T} - p_{j}^{T} )^{2} } $$
(4)

Figure 2 illustrates the matching procedure with CNN.

Fig. 2.
figure 2

Illustration of our proposed CNN matching process. 10 and 5 interest points are detected in the source point cloud and target point cloud respectively. Here \( m_{ij}^{T} \) indicates the weight in target cloud. The prediction made by CNN is a set of source interest points. We consider the graph as undirected graph and \( m_{ij}^{T} \) = \( m_{ji}^{T} \).

The reason for this procedure is twofold. First, in order to match the correspondences between the source and target point cloud, the feature map should be invariant to rigid transformational, and robust to noise and outliers, and Euclidian distance between points meet the requirements. Second, the dataset can be transfer from raw arrays of coordinates into an organized feature map, which is amenable for CNN. And taking advantages of local conjunctions detection and shared weights of CNN, the point correspondences which woven in a tangle way originally can be found correctly by the information encoded in the weighted complete graph.

2.4 Pose Estimation

In the online registration step, the weighted adjacency matrix was computed from target set by the same pipeline. Applying the prediction of CNN, a set of points in the source set are assigned as correspondences. And the least-square error transformation is estimated by the SVD method [17], which is used to associate correspondences by all the possible permutation, and select the one with least error in (1).

After coarse registration computed by SVD, a fine registration is performed by implementing the ICP method.

3 Experiments

3.1 Region Growing Cluster

We choose the Stanford happy Buddha and a valve model for evaluation of our proposed method. Figure 3(b) and (c) show the results of region growing clustering. The sampling rate for choosing the salient point is set to be 20%, and the thresholds \( T_{i} \),\( T_{c} \) for clustering are defined according to the range \( R_{d} \) of point cloud.

Fig. 3.
figure 3

Registration result of happy Buddha. (a) Model demonstration. (b) and (c) Interest points detection of source and target point cloud respectively. Colorful blobs show the result of top 15 clusters, the final interest points were shown in red crosses after rejection. Note that three correspondences can be found, since interests points can be found in the same region of source and target point cloud. (d) Initial state before registration, target and source point cloud were shown in red and green respectively. (e) Estimated coarse registration. (f) Estimated fine registration using ICP. (Color figure online)

$$ R_{d} = \mathop {\hbox{max} }\limits_{{}} \sqrt {(p_{i}^{S} - p_{j}^{S} )^{2} } $$
(5)

Here we set \( T_{i} \;{ = }\; (1/30)R_{d} \), \( T_{c} = (1/10)R_{d} \). Figures 3(b) and (c) show the results of clustering for happy Buddha, clusters are shown in colorful blobs. The target point cloud was scanned by laser scanner, and only the points in the front of the model were present. A rejection scheme is implemented to reject clusters according to two parameters for any cluster. First one \( \phi_{j} = \sum\nolimits_{i = 1}^{L} {(p_{i} - p_{j} } )^{2} \) which measures the total distance from other clusters for cluster j, the second is \( \psi_{j} = \hbox{max} \sqrt {(p_{m} - p_{n} )^{2} } \) with \( m,n \in j \), which indicates the diameter of cluster j. We compute the two parameters for all clusters, and reject the clusters that the ratio \( \psi_{j} /\phi_{j} \) are larger than others. After the rejection, final detected interest points were supposed to be the most distinguishable points, and were shown in red crosses in Figs. 3(b) and (c).

3.2 CNN Architecture

The CNN classification model with architecture is shown in Fig. 4. The input to the network is K T  × K T weighted adjacency matrix of interest points, we choose K T  = 3 here for preliminary experiment.

Fig. 4.
figure 4

Architecture of the CNN classification model. Layer’s size may be changed according to the input size \( K_{T} \times K_{T} \).

The first hidden layer convolves 10 filters of kernel size 2 × 2 with stride 1 and zero padding 1 with the input feature map, and then apply a rectified linear unit (abbreviated as ReLu).

The second layer is a convolutional layer of kernel size 2 × 2 with stride 1 and zero padding 1. Two fully-connected layer with 120 neurons is followed behind, and the final layer is a softmax layer.

The training data was generated from source point cloud by randomly choosing 10000 weighted adjacency matrices with corresponding categories mentioned in Sect. 2.3. A normalization is also implemented to reduce influence of the variations of scale of point clouds. We implement normalization by multiplying input feature map by a constant \( \alpha \) inverse proportional to \( R_{d} \) so that \( \alpha R_{d} = 200 \).

3.3 Performance Analysis and Results

The proposed CNN model was trained from scratch, and after applied the proposed method to the test data, matching accuracy achieves 91%. Figures 3 and 5 shows the registration result of proposed method on both happy Buddha and the valve model. Since the number of interest points is relatively small, predicting correspondences from the CNN require less than 0.1 s on a 3.3 GHz Core i5 machine with 8 GB memory. We test the samples which have correct prediction by CNN, and reach the final RMS error 0.0082 (divided by R d achieves relative error 4.1%) and angular-axis error 0.0837 in average without fine registration. Accuracy can be improved conceivably by increasing the number of interest points, in exchange for more time consumption and complexity of the CNN model. Tables 1 and 3 present an example of detected interest points in happy Buddha and valve model respectively, the prediction made by CNN is No.2, No.6, and No.8 points in the source set for happy Buddha, No.1, No.6, No.8 points for valve model. Tables 2 and 4 present the corresponding matrices of target set and predicted points in source set. Comparing with the ground truth rigid transformation, the computed transformation using SVD is 0.0077 for RMS error, 0.0923 for angular-axis error.

Fig. 5.
figure 5

Registration results of valve model. (a) Model demonstration. (b) Interest points detected, shown in red crosses. (c) Initial state before registration, source and target point cloud were shown in ‘+’ and ‘o’ respectively. (d) Estimated coarse registration. (e) Estimated fine registration using ICP. (Color figure online)

Table 1. Interest points detected in happy Buddha example.
Table 2. Weighted adjacency matrices in happy Buddha example. (Predicted points are the No.2, No.6, and No.8 points in the source set of Table 1).
Table 3. Interest points detected in valve model example.
Table 4. Weighted adjacency matrices in valve model. (Predicted points are the No.1, No.6, and No.8 points in the source set of Table 3).

Research points out that point detectors may have the drawback of being sensitive to noise [5]. Experiments have been conducted on the valve model. We randomly generate considerable number of noise in the bounding box of point cloud and Fig. 6 shows the linear growth of error against noise. The experiments indicate that with the help of CNN, correspondences matching using local interest points can be robust to noise.

Fig. 6.
figure 6

Result of test on sensitivity to noise. (a) RMS error and angle-axis error against noise in average, bars indicate the range of error. (b) Registration result with 20% of noise.

4 Conclusion

We proposed a 3D point cloud registration method, with convolutional neural network for correspondences matching. In this method, only interest points are required to be detected and no requirement for correspondences identification by point descriptors. The feature map of the CNN is the weighted adjacency matrix of complete graph generated by detected interest points. Experimental results show the effectiveness of our proposed method. This method presents a new potential application of CNN in correspondences matching, where limitless ground truth data can be generate to be fed into CNN, and a set of interest points which are detected in target point cloud can be matched to the correct counterparts. Our future research includes utilizing other local descriptions, feature map representation, and some strategies focusing on rejecting interest points.