1 INTRODUCTION

Registration of 3D point clouds plays a key role in robotics and computer vision, allowing identification of a rigid transformation for alignment of two point clouds with unknown point relations. Such a process has been widely applied in various fields, including reconstruction of 3D scenes [1–3], localization [4], and autonomous driving [5]. The conventional iterative closest point (ICP) method has long been a registration standard [6–14], involving alternating steps of solving point alignment and applying rigid transformations. However, ICP relies on good initial alignment and often converges to local minima. Recent advent of deep learning models [15–23] has revolutionized computer vision, leading to the creation of faster and more reliable point cloud registration algorithms compared to classical methods. Several algorithms based on deep learning (e.g., deep closest point (DCP) [18], PRNet [19], RPMNet [20], and IDAM [21]) employ neural networks to establish alignment using multidimensional descriptors. The DCP neural network using Dynamic Graph CNN (DGCNN) extracts local objects from point clouds for soft fitting and least squares solutions, assuming a one-to-one correspondence in two point clouds. PRNet extends DCP by including a keypoint detection module for partial registration. RPMNet uses annealing to obtain soft assignments to point alignment from hybrid objects obtained from both spatial coordinates and local geometry. IDAM combines features and Euclidean information into a corresponding matrix, using a two-stage learning method to eliminate points for registration.

On the contrary, the proposed method employs a two-branch strategy for object description that includes both location information and rotation-independent local objects. This strategy aims to achieve multi-dimensional point cloud embedding, differing from existing methods that substantially rely on similarity of object descriptors and face problems related to large rotations and significant differences in cloud coordinates. In this work, we propose a simple neural network architecture that uses soft alignment of points of source and target clouds and a weighted two-point ICP functional instead of a conventional two-point ICP functional [6, 24]. The weight of a pair of points is chosen based on the probability that the pair belongs to common subsets of cloud points. The point weights are calculated based on the point descriptors calculated in the proposed architecture. The proposed neural network algorithm uses elements of the PointNet++ network [25]. The network was trained on the ModelNrt40 database [26]. Computer simulation illustrates the operation of the proposed algorithm.

The structure of the article is as follows. Section 1 presents the formulation of the problem and describes the algorithms for solving it. Section 2 presents the results of computer modeling. Section 3 contains the conclusions.

2 NEURAL NETWORK ARCHITECTURE

This paper describes a simple architecture of neural network using PointNet++ network elements [25]. The proposed neural network algorithm calculates descriptors of all points in point clouds, determines point alignment, and calculates transformations from the SE(3) group to align two given point clouds.

Let P = \({{p}_{1}},...,{{p}_{{{{s}_{p}}}}}\) and Q = \({{q}_{1}},...,{{q}_{{{{s}_{q}}}}}\) be the source and target point clouds (\({{p}_{i}},{{q}_{j}} \in {{R}^{3}},i = 1,...,{{s}_{p}},\) and \(j = 1,...,{{s}_{q}}\)). We consider point \({{p}_{i}} \in P\) and its neighborhood in P with radius r. In this neighborhood, we consider K points. Additional points are ignored if the neighborhood contains more than K points, and copies of a point from the neighborhood are added if the neighborhood contains less than K points. Thus, we map a set containing K elements to point pi and do this for all points from P and Q. The descriptors of all points in P are calculated as follows.

Let B be the batch size.

First layer. The original tensor has dimension (B, 3, sp). We use the previously calculated neighborhoods of all points sp and obtain tensor (B, sp, K, 3); after rearranging the dimensions, the tensor dimension is (B, 3, K, sp). We use 2D convolution with parameters in_channels = 3, out_channels = 64, kernel_size = 1, stride = 1, and padding = 0, so that the tensor dimension after convolution is (B, 64, K, sp). The ReLu activation function and batch normalization are used. After convolution, the maximum in the tensor along the dimension with index 2 (the dimension is K) is calculated and the output tensor has parameters (B, 64, sp).

Third layer. The input tensor has dimension (B, 128, sp). We use the previously calculated neighborhoods of all points sp and obtain tensor (B, sp, K, 128); after rearranging the dimensions, the tensor dimension is (B, 128, K, sp). We use 2D convolution with parameters in_channels = 128, out_channels = 256, kernel_size = 1, stride = 1, and padding = 0, so that the tensor dimension after convolution is (B, 256, K, sp). The ReLu activation function and batch normalization are used. After convolution, the maximum in the tensor along the dimension with index 2 (the dimension is K) is calculated and the output tensor has parameters (B, 256, sp).

Fourth layer. The input tensor has dimension (B, 256, sp). We use the previously calculated neighborhoods of all points sp and obtain tensor (B, sp, K, 256); after rearranging the dimensions, the tensor dimension is (B, 256, K, sp). We use 2D convolution with parameters in_channels = 256, out_channels = 256, kernel_size = 1, stride = 1, and padding = 0, so that the tensor dimension after convolution is (B, 256, K, sp). The ReLu activation function and batch normalization are used. After convolution, the maximum in the tensor along the dimension with index 2 (the dimension is K) is calculated and the output tensor has parameters (B, 256, sp).

Fifth layer. The input tensor has dimension (B, 256, sp). We use the previously calculated neighborhoods of all points sp and obtain tensor (B, sp, K, 256); after rearranging the dimensions, the tensor dimension is (B, 256, K, sp). We use 2D convolution with parameters in_channels = 256, out_channels = 512, kernel_size = 1, stride = 1, and padding = 0, so that the tensor dimension after convolution is (B, 512, K, sp). The ReLu activation function and batch normalization are used. After convolution, the maximum in the tensor along the dimension with index 2 (the dimension is K) is calculated and the output tensor has parameters (B, 512, sp).

A vector of size 512 is considered as a descriptor of point P. Similarly, we calculate the descriptors of points in the point cloud Q. When the descriptors of all points in P and Q are calculated, we generate matrix W of size sp × sq. Elements of the matrix Wij are the probabilities of alignment of points pi and qj (i = 1, ..., sp, j = 1, ..., sq). Matrix W is used to form a virtual point cloud \(({v}Q) = {{({v}q)}_{1}},...,{{({v}q)}_{{{{s}_{p}}}}}\)

$${{({v}q)}_{i}} = \sum\limits_{j = 1}^{{{s}_{q}}} {{{W}_{{ij}}}{{q}_{j}}} ,$$
(1)

where i = 1, ..., sp. We assume that point pi corresponds to point \({{({v}q)}_{i}}\) (i = 1, ..., sp). Thus, we set up a soft alignment of point clouds P and Q. Then, we search for the maximum value wi in the ith row of matrix W. The value is considered as weight coefficient wi of the pair (\({{p}_{i}},{{({v}q)}_{i}}\)) (i = 1, ..., sp). We consider the following functional:

$$J(R,T) = \sum\limits_{i = 1}^{{{s}_{p}}} {{{w}_{i}}{{{\left\| {R{{p}_{i}} + T - {{{({v}q)}}_{i}}} \right\|}}^{2}}} ,$$
(2)

where \(R \in SO(3)\), \(T \in {{R}^{3}}\) and the conditional variational problem

$$({{R}_{*}},{{T}_{*}}) = \mathop {\arg \min }\limits_{(R,T)} J(R,T),$$
(3)

provided that \(R \in SO(3)\). Let Cp be the center of mass of cloud P

$${{C}_{p}} = \left( {{1 \mathord{\left/ {\vphantom {1 {\sum\limits_{i = 1}^{{{s}_{p}}} {{{w}_{i}}} }}} \right. \kern-0em} {\sum\limits_{i = 1}^{{{s}_{p}}} {{{w}_{i}}} }}} \right)\sum\limits_{j = 1}^{{{s}_{p}}} {{{p}_{j}}} ,$$
(4)

and P ′ be the centroid of cloud P

$$p_{i}^{'} = {{p}_{i}} - {{C}_{p}},$$
(5)

where i = 1, ..., sp. Let P' be the centroid of matrix P

$${\mathbf{P}}{\kern 1pt} ' = \left( {\begin{array}{*{20}{c}} {p_{{11}}^{'}}&{}&{p_{{{{s}_{p}}1}}^{'}} \\ {p_{{12}}^{'}}& \cdots &{p_{{{{s}_{p}}2}}^{'}} \\ {p_{{13}}^{'}}&{}&{p_{{{{s}_{p}}3}}^{'}} \end{array}} \right).$$
(6)

Similarly, matrix (vQ)' is the centroid of (vQ)'

$${\mathbf{P}}{\kern 1pt} ' = \left( {\begin{array}{*{20}{c}} {({v}q)_{{11}}^{'}}&{}&{({v}q)_{{s1}}^{'}} \\ {({v}q)_{{12}}^{'}}& \cdots &{({v}q)_{{s2}}^{'}} \\ {({v}q)_{{13}}^{'}}&{}&{({v}q)_{{s3}}^{'}} \end{array}} \right).$$
(7)

The following matrix is also introduced:

$${\mathbf{P}}{\kern 1pt}^ {"} = \left( {\begin{array}{*{20}{c}} {{{w}_{1}}p_{{11}}^{'}}&{}&{{{w}_{{{{s}_{p}}}}}p_{{{{s}_{p}}1}}^{'}} \\ {{{w}_{1}}p_{{12}}^{'}}& \cdots &{{{w}_{{{{s}_{p}}}}}p_{{{{s}_{p}}2}}^{'}} \\ {{{w}_{1}}p_{{13}}^{'}}&{}&{{{w}_{{{{s}_{p}}}}}p_{{{{s}_{p}}1}}^{'}} \end{array}} \right).$$
(8)

Variational problem (3) can be reduced to the conditional variational problem

$${{R}_{*}} = \mathop {\arg \max }\limits_R \left\langle {R{\mathbf{P}}{\kern 1pt}^ {"},({\mathbf{vQ}}){\kern 1pt} '} \right\rangle ,$$
(9)

provided that \(R \in SO(3)\). Matrix M is given by

$$M = ({\mathbf{vQ}}){\kern 1pt} '{{({\mathbf{P}}{\kern 1pt}^ {"})}^{t}}.$$
(10)

Matrix M can be represented as a matrix product using decomposition with respect to singular values:

$$M = US{{V}^{t}},$$
(11)

where U and Vt are orthogonal matrices and S is a diagonal matrix. Note that PyTorch supports the SVD decomposition and backpropagation of error for it. The solution to variational problem (9) is written as

$${{R}_{*}} = \left\{ \begin{gathered} U{{V}^{t}},\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{\text{if}}\,\,\det (U)\det ({{V}^{t}}) = 1 \hfill \\ Udiag(1,1, - 1){{V}^{t}},\,\,{\text{if}}\,\,\det (U)\det ({{V}^{t}}) = - 1 \hfill \\ \end{gathered} \right.,$$
(12)

and the parallel translation vector can be calculated as

$${{T}_{*}} = \left( {{1 \mathord{\left/ {\vphantom {1 {\sum\limits_{i = 1}^{{{s}_{p}}} {{{w}_{i}}} }}} \right. \kern-0em} {\sum\limits_{i = 1}^{{{s}_{p}}} {{{w}_{i}}} }}} \right)\sum\limits_{j = 1}^{{{s}_{p}}} {{{w}_{j}}({{{({v}q)}}_{j}} - {{R}_{*}}{{p}_{j}})} .$$
(13)

The number of trainable weights of the resulting network is 480 344. The error function is the MSE weighted sum between the estimated and true transformation (rotation and translation) matrices.

The Adam algorithm was used with a learning rate parameter of 0.001.

The batch size is B = 10, and the neighborhood size is K = 32. The network was trained using pairs of point clouds from part of the ModelNet40 database [26].

The network was trained for 5 epochs. The loss functions were loss = 0.66, 0.0339, 0.0114, 0.0081, 0.0076, and 0.0075 prior to training and after the first, second, third, fourth, and fifth epochs, respectively.

3 COMPUTER SIMULATION

We apply the proposed neural network to the ModelNet40 database [26]. The database contains 12311 mesh CAD models belonging to 40 classes. The training part of the database consists of 9843 point clouds, and the testing part consists of 2468 point clouds. For each model, 1024 points are uniformly selected from the mesh faces. Only information from the database about the coordinates of the point is used. For each cloud P = \(\{ {{p}_{1}},...,{{p}_{{1024}}}\} \), rotation angles relative to the x, y, and z axes are randomly selected from an interval of [–π/8; π/8] and coordinates of the translation vector are randomly selected from an interval of [−0.25; 0.25]. The matrix (in homogeneous coordinates) of this transformation is denoted as Mtrue. We apply matrix Mtrue to matrix P related to cloud P and obtain matrix Q related to cloud Q = \(\{ {{q}_{1}},...,{{q}_{{1024}}}\} \):

$${\mathbf{Q}} = {{M}_{{{\text{true}}}}}{\mathbf{P}}.$$
(14)

We use the proposed neural network (CNN) as an algorithm for rough alignment of point clouds and point-to-point ICP (PtP) as a refinement algorithm. To demonstrate the efficiency of the proposed method, ten experiments were performed, and the results are presented in the following figures and tables.

Figure 1a shows the initial position of source point cloud P and target point cloud Q, Fig. 1b shows the position of the point clouds after CNN, and Fig. 1c shows the position of the point clouds after CNN + PtP. Table 1 presents the errors of the considered algorithms with respect to the L1 and L2 norms.

Fig. 1.
figure 1

(a) Point clouds: (gray dots) source and (black dots) target; (b) results of the CNN registration; and (c) result of the CNN + PtP registration.

Table 1. Accuracy of calculation of rotation and parallel translation with respect to norms L1 and L2

Figure 2a shows the initial position of source point cloud P and target point cloud Q, Fig. 2b shows the position of the point clouds after CNN, and Fig. 2c shows the position of the point clouds after CNN + PtP. Table 2 presents the errors of the considered algorithms with respect to the L1 and L2 norms.

Fig. 2.
figure 2

(a) Point clouds: (gray dots) source and (black dots) target; (b) results of the CNN registration; and (c) result of the CNN + PtP registration.

Table 2. Accuracy of calculation of rotation and parallel translation with respect to norms L1 and L2

Figure 3a shows the initial position of source point cloud P and target point cloud Q, Fig. 3b shows the position of the point clouds after CNN, and Fig. 3c shows the position of the point clouds after CNN + PtP. Table 3 presents the errors of the considered algorithms with respect to the L1 and L2 norms.

Fig. 3.
figure 3

(a) Point clouds: (gray dots) source and (black dots) target; (b) results of the CNN registration; and (c) result of the CNN + PtP registration.

Table 3. Accuracy of calculation of rotation and parallel translation with respect to norms L1 and L2

Figure 4a shows the initial position of source point cloud P and target point cloud Q, Fig. 4b shows the position of the point clouds after CNN, and Fig. 4c shows the position of the point clouds after CNN + PtP. Table 5 presents the errors of the considered algorithms with respect to the L1 and L2 norms.

Fig. 4.
figure 4

(a) Point clouds: (gray dots) source and (black dots) target; (b) results of the CNN registration; and (c) result of the CNN + PtP registration.

Table 4. Accuracy of calculation of rotation and parallel translation with respect to norms L1 and L2
Table 5. Accuracy of calculation of rotation and parallel translation with respect to norms L1 and L2

Figure 5a shows the initial position of source point cloud P and target point cloud Q, Fig. 5b shows the position of the point clouds after CNN, and Fig. 5c shows the position of the point clouds after CNN + PtP. Table 5 presents the errors of the considered algorithms with respect to the L1 and L2 norms.

Fig. 5.
figure 5

(a) Point clouds: (gray dots) source and (black dots) target; (b) results of the CNN registration; and (c) result of the CNN + PtP registration.

Figure 6a shows the initial position of source point cloud P and target point cloud Q, Fig. 6b shows the position of the point clouds after CNN, and Fig. 6c shows the position of the point clouds after CNN + PtP. Table 6 presents the errors of the considered algorithms with respect to the L1 and L2 norms.

Fig. 6.
figure 6

(a) Point clouds: (gray dots) source and (black dots) target; (b) results of the CNN registration; and (c) result of the CNN + PtP registration.

Table 6. Accuracy of calculation of rotation and parallel translation with respect to norms L1 and L2

Figure 7a shows the initial position of source point cloud P and target point cloud Q, Fig. 7b shows the position of the point clouds after CNN, and Fig. 7c shows the position of the point clouds after CNN + PtP. Table 7 presents the errors of the considered algorithms with respect to the L1 and L2 norms.

Fig. 7.
figure 7

(a) Point clouds: (gray dots) source and (black dots) target; (b) results of the CNN registration; and (c) result of the CNN + PtP registration.

Table 7. Accuracy of calculation of rotation and parallel translation with respect to norms L1 and L2

Figure 8a shows the initial position of source point cloud P and target point cloud Q, Fig. 8b shows the position of the point clouds after CNN, and Fig. 8c shows the position of the point clouds after CNN + PtP. Table 8 presents the errors of the considered algorithms with respect to the L1 and L2 norms.

Fig. 8.
figure 8

(a) Point clouds: (gray dots) source and (black dots) target; (b) results of the CNN registration; and (c) result of the CNN + PtP registration.

Table 8. Accuracy of calculation of rotation and parallel translation with respect to norms L1 and L2

Figure 9a shows the initial position of source point cloud P and target point cloud Q, Fig. 9b shows the position of the point clouds after CNN, and Fig. 9c shows the position of the point clouds after CNN + PtP. Table 9 presents the errors of the considered algorithms with respect to the L1 and L2 norms.

Fig. 9.
figure 9

(a) Point clouds: (gray dots) source and (black dots) target; (b) results of the CNN registration; and (c) result of the CNN + PtP registration.

Table 9. Accuracy of calculation of rotation and parallel translation with respect to norms L1 and L2

Figure 10a shows the initial position of source point cloud P and target point cloud Q, Fig. 10b shows the position of the point clouds after CNN, and Fig. 10c shows the position of the point clouds after CNN + PtP. Table 10 presents the errors of the considered algorithms with respect to the L1 and L2 norms.

Fig. 10.
figure 10

(a) Point clouds: (gray dots) source and (black dots) target; (b) results of the CNN registration; and (c) result of the CNN + PtP registration.

Table 10. Accuracy of calculation of rotation and parallel translation with respect to norms L1 and L2

We use large rotation angles in experiment 10, and the algorithm yields a false transformation, since the proposed CNN is trained on relatively small rotation angles.

4 CONCLUSIONS

We proposed a simple neural network algorithm for registering point clouds in 3D space. The proposed neural network is based on soft alignment of source and target point clouds and employs a weighted point-to-point ICP function to estimate an orthogonal transformation that matches the point clouds. Soft alignment of point clouds using a probability matrix describes the semantic similarity of points. The proposed network together with the two-point ICP shows good results in the reconstruction of a 3D model. The efficiency of the proposed algorithm is evaluated in the ModelNet40 database.