1 Introduction

The widespread abundance of affordable 3D sensing devices has encouraged many enthusiasts to contribute new solutions for 3D reconstruction [1]. The latter require data alignment tools that enable the recovery of the 6DOF regarding viewpoints where the different scans had been captured. Theoretically, each viewpoint has a different coordinate system. Knowledge of the transformation that maps a given 3D point from one frame to another, therefore, becomes necessary.

In practice, the alignment requires some keypoints from a Source and a Target point cloud. Hence, alignment problem amounts to the determination of the mapping between the source and the target frames. To this end, we assume the keypoints being available and we focus on 3D registration. Generally, the determination of the best transformation is based on \(L_2\) norm minimisation. However, \(L_2\) optimisers assume a prior availability of the entire datasets before processing takes place. From a practical point of view, such an assumption is too optimistic due to sizeable noisy data streamed at relatively high frame rates that one encounters in practice. For this reason, our novel 3D registration solution delivers the 6DOF pose between viewpoints recursively and is capable of handling 3D points’ noise and uncertainty for a more efficient estimation.

The remainder of this paper is organised as follows: In the first section, the related works about 3D registration are discussed and different alignment solutions that had been proposed so far are analysed. In the following section, 3D registration problem is formulated in a Least Squares (LS) form. In the next section the link between 3D registration and RLS is settled and fitted into Kalman filter’s (KF) equations [2]. The parametric uncertainty of the 3D feature points is afterwards determined to be later used in the Robust \(H_\infty \) (RF) modelling for sparse alignment. Our contribution is validated on both synthetic and real datasets. Lastly, the paper is concluded and potential future works are recommended.

2 Related works

Since its invention by Besl and McKay [3], the Iterative Closest Point algorithm (ICP) has been considered as a reference in point cloud alignment literature. However, a good initial guess or some feature correspondences are necessary to avoid local minima. Newer variants of that algorithm have been proposed to deal with its limitations such as EM-ICP [4] and Softassign [5]. Unlike the original implementation that assigns to every point in the source its closest correspondent in the target, subsequent variants allow each point to be checked against the entire target dataset. To this end, weighting coefficients are associated with the elements to discard the describe their quality [5]. Other variants inspired by the original algorithm (ICP) were further proposed such as nonlinear ICP [6], generalised ICP [7] as well as non-rigid ICP [8]. Larusso et al. [9] showed that all closed-form solutions are computationally similar. However, performance can significantly differ from one solution to another. Thus, no single algorithm is exclusively optimal for all scenarios. Umeyama [10] states in his work that Horn and Arun’s algorithms fail when the datasets become highly corrupted with noise. He further proposed an alternative solution that utilises Lagrange Multipliers [11].

A solution for the recursive estimation of rigid body transformations with the Extended Kalman Filter (EKF) was first proposed by Pennec and Thirion [12]. Ma and Ellis [13] followed the same strategy in order to align datasets contaminated with isotropic Gaussian noise using the Unscented Particle Filter (UPF) [14]. This algorithm can accurately estimate the parameters for very small datasets (less than one hundred elements). An Unscented Kalman Filter (UKF) algorithm was also adapted by Julier and Uhlmann [15] to align two datasets following a sequential estimation. All these recursive algorithms minimise \({\varvec{L}}_\mathbf{2}\) norm but consider the parameters being accurately determined beforehand. Nevertheless, it is impossible to assert the certainty of parameters in real scenarios. In our solution, however, we consider them (parameters) being uncertain, and we confine estimation error to a small range by optimising \({\varvec{L}}_\infty \) norm instead of \({\varvec{L}}_\mathbf{2}\).

Micusik and Pflugfelder [16] used a second-order cone programming (SOCP) to minimise the \({\varvec{L}}_\infty \) norm for non-overlapping cameras. They have shown a good performance with a fairly small error magnitude. Lee et al. [17] further claimed that by using \({\varvec{L}}_\infty \) a number of computer vision problems such as homography estimation can be formulated and solved using Bisection method.

In the light of this background, our work takes advantage of the mature recursive estimation framework in order to compute a robust and optimal solution for 3D registration problem by means of \({\varvec{L}}_\infty \) norm minimisation.

3 Problem statement

Given two sets of source and target 3D point clouds \({\varvec{Q}}=\left\{ {{\varvec{q}}_1, \ldots , {\varvec{q}}_{\varvec{n}}}\right\} , {\varvec{P}}=\left\{ {{\varvec{p}}_\mathbf{1}, \ldots , {\varvec{p}}_{\varvec{n}}}\right\} \) respectively. Each of the elements \({\varvec{p}}_{\varvec{i}}, {\varvec{q}}_{\varvec{i}}\) within the sets of points has three components \({\varvec{p}}_{\varvec{i}} =({{\varvec{x}}_{\varvec{p}}, {\varvec{y}}_{\varvec{p}}, {\varvec{z}}_{\varvec{p}}})_{\varvec{i}}\) and \({\varvec{q}}_{\varvec{i}}=({{\varvec{x}}_{\varvec{q}}, {\varvec{y}}_{\varvec{q}}, {\varvec{z}}_{\varvec{q}}})_{\varvec{i}}\). The \(\varvec{k}\)th point \(q_{k}\) in the source point cloud has been matched a priori with the \(\varvec{k}\)th point in the target point cloud \({\varvec{p}}_{\varvec{k}}\). The purpose of 3D registration is to find a rigid body transformation (\({\varvec{R}}\): rotation, \({\varvec{t}}\): translation) that maps the source \({\varvec{Q}}\) onto the target \({\varvec{P}}\). The determination of such a mapping can be modelled as an optimisation problem [18]. Nevertheless, due to noisy outputs streamed by the sensor, an exact solution is very unlikely to determine. Thus, a realistic model must take into account alignment error \({\varvec{e}}_{\varvec{i}}\) as follows:

$$\begin{aligned} p_i =Rq_i +t+e_i \end{aligned}$$
(1)

The rigid body transformation \([{\varvec{R, t}}]\) is optimal when the sum of the squares of errors \(({\varvec{e}}_i)\) becomes minimal:

$$\begin{aligned} e^{2}=\mathop {{\hbox {argmin}}}\limits _{R,t} \sum _{i=1}^n \Vert p_i -({Rq_i +t})^{2}\Vert \end{aligned}$$
(2)

where

$$\begin{aligned} R={\left[ {{\begin{array}{c@{\quad }c@{\quad }c} {r_{11}} &{} {r_{12}} &{} {r_{13}} \\ {r_{21}} &{} {r_{22}} &{} {r_{23}} \\ {r_{31}} &{} {r_{32}} &{} {r_{33}} \\ \end{array}}} \right] }; \quad t= {\left[ {{\begin{array}{c} {t_x} \\ {t_y} \\ {t_z} \\ \end{array}}} \right] } \end{aligned}$$
(3)

It is possible to simplify the problem of Eq. (2) by decoupling the translation vector \({\varvec{t}}\) and eliminating scale difference as follows:

$$\begin{aligned} {\hat{t}}= & {} {\bar{p}} -{\hat{R}} {\bar{q}}; \quad {\hat{s}} =\sum \limits _{i=1}^n \left( {\frac{{\bar{p}}_i {\hat{R}} {\bar{q}}_i}{\Vert {\bar{q}}_i\Vert ^{2}}} \right) \end{aligned}$$
(4)
$$\begin{aligned} {\tilde{t}}= & {} {\bar{p}} -{\bar{q}};\quad {\tilde{s}} =\sum \limits _{i=1}^n \left( {\frac{{\bar{p}}_i {\bar{q}}_i}{\Vert {\bar{q}}_i\Vert ^{2}}} \right) \end{aligned}$$
(5)

As claimed by Horn et al. [19], \({\bar{{\varvec{q}}}}\) and \({\bar{{\varvec{p}}}}\) are the centroids respective to the source and the target point clouds; \({\hat{{\varvec{t}}}}, {\hat{{\varvec{s}}}}\) are the optimal translation and scale between the two dataset. Whereas \({\tilde{{\varvec{t}}}}, {\tilde{{\varvec{s}}}}\) are their respective initial guesses when the initial rotation is assumed to be \({\varvec{R}}_{\varvec{0}} ={\varvec{I}}_\mathbf{3}\). As a result of this simplification, the problem of pose estimation in Eq. (2) is now reduced to:

$$\begin{aligned} e^{2}=\mathop {{\hbox {argmin}}}\limits _R \sum _{i=1}^n \Vert {\bar{p}}_i -R{\bar{q}}_i^{2}\Vert \end{aligned}$$
(6)

Once the optimal rotation \({\hat{{\varvec{R}}}}\) computed, \({\hat{{\varvec{t}}}}\) and \({\hat{{\varvec{s}}}}\) can be deduced using Eq. (4). On the other hand, the optimal rotation \({\hat{R}} \) can be obtained by minimising \(\sum \nolimits _{i=1}^n \Vert {\bar{p}}_i -R{\bar{q}}_i\Vert ^{2}\) using a LS optimiser. The resulting estimation is sufficient for most applications as long as robustness is not a determining factor. However, if the inputs become significantly contaminated with noise, the result becomes unstable (i.e., very sensitive to perturbations in the data) and more likely to drift away from the optimal solution.

4 3D registration with RLS

Despite the performance of time-varying filters, 3D registration has profited very poorly from their assets even after closed-form methods were proven weak in various practical situations. Moreover, the authors of a number of recent image registration surveys did not even allude to the possibility of solving 3D alignment with recursive filtering tools [20]. The power of the recursive solutions can be appreciated due to what has been claimed earlier and to the possibility of cooperation between different registration instances working together. The latter can share their most updated estimates instantaneously. As a result, they can benefit from each other’s contributions, which in turn reduces the probability of falling into a local minimum.

4.1 Recursive modelling of 3D registration

In order to express the cost function of Eq. (6) in a recursive fashion, the original problem should be rewritten as shown in Eqs. (7)–(10). Such a transformation allows us to fit 3D registration problem in a recursive least squares framework.

$$\begin{aligned}&p=Rq+e \end{aligned}$$
(7)
$$\begin{aligned}&\left\{ {{\begin{array}{l} {p_x =r_{11} q_x +r_{12} q_y +r_{13} q_z +e_x} \\ {p_y =r_{21} q_x +r_{22} q_y +r_{23} q_z +e_y} \\ {p_z =r_{31} q_x +r_{32} q_y +r_{33} q_z +e_z} \\ \end{array}}} \right. \end{aligned}$$
(8)

By analogy, the state variable \({\varvec{x}}_{\varvec{k}}\) now represents the rotation matrix \({\varvec{R}}\) of  Eq. (7). The optimiser uses pairs of corresponding points in order to refine the entries of the state vector now containing the entries of rotation matrix \({\mathcal {R}}_9\). For instance, at every time-step \({\varvec{k}}\) we have:

$$\begin{aligned} v= & {} \left[ {{\begin{array}{ccc} {q_x} &{} {q_y} &{} {q_z} \\ \end{array}}} \right] \nonumber \\ {\left[ {{\begin{array}{c} {p_x} \\ {p_y} \\ {p_z} \\ \end{array}}}\right] }= & {} {\left[ {{\begin{array}{c@{\quad }c@{\quad }c} v&{} &{} \\ &{} v&{} \\ &{} &{} v \\ \end{array}}} \right] }{\left[ {{\begin{array}{c} {{\begin{array}{l} {r_{11}} \\ {r_{12}} \\ {r_{13}} \\ \end{array}}} \\ {{\begin{array}{c} {r_{21}} \\ {r_{22}} \\ {r_{23}} \\ \end{array}}} \\ {{\begin{array}{c} {r_{31}} \\ {r_{32}} \\ {r_{33}} \\ \end{array}}} \\ \end{array}}} \right] }+ {\left[ {{\begin{array}{c} {e_x} \\ {e_y} \\ {e_z} \\ \end{array}}} \right] }\end{aligned}$$
(9)
$$\begin{aligned} p= & {} H_R \mathcal {R}_9 +e \end{aligned}$$
(10)

\({\varvec{x}}_{\varvec{k}} =\left[ \begin{array}{c@{\quad }c@{\quad }c@{\quad }c@{\quad }c@{\quad }c@{\quad }c@{\quad }c@{\quad }c} {\varvec{r}}_{\mathbf{11}}&{{\varvec{r}}}_{\mathbf{12}}&{{\varvec{r}}}_{\mathbf{13}}&{\varvec{r}}_{\mathbf{21}}&{\varvec{r}}_{\mathbf{22}}&{\varvec{r}}_{\mathbf{23}}&{\varvec{r}}_{\mathbf{31}}&{\varvec{r}}_{\mathbf{32}}&{\varvec{r}}_{\mathbf{33}} \end{array}\right] ^{{\varvec{T}}}\in {\varvec{\mathfrak {R}}}^{9}; {\varvec{A}}_{\varvec{k}} ={\varvec{I}}_\mathbf{9}; {\varvec{B}}_{\varvec{k}} =\mathbf 0 _\mathbf{9}\) as no control variable is required. \({\varvec{w}}_{\varvec{k}} \sim {\varvec{\mathcal {N}}}\left( {\mathbf{0},{\varvec{Q}}_{\varvec{k}}}\right) \) is a random variable representing process noise for which \({\varvec{Q}}_{\varvec{k}} = {\varvec{\sigma }}_{\varvec{k}}^\mathbf{2} {\varvec{I}}_\mathbf{9}; {\varvec{\sigma }}_{\varvec{k}} >\mathbf{0}\) should be small because the process is accurately determined. \({\varvec{z}}_{\varvec{k}} \in {\varvec{\mathfrak {R}}}^{\mathbf{3}}\) is the actual noisy measurement vector whose elements are the coordinates of the target feature point. \({\varvec{y}}_{\varvec{k}} \in {\varvec{\mathfrak {R}}}^{\mathbf{3}}\) is the predicted observation vector that contains the 3D position of the target feature point. \({\varvec{v}}_{\varvec{k}}\) is a random variable for which \({\varvec{R}}_{\varvec{k}} =\left[ \begin{array}{ccc} {\varvec{\sigma }}_{\varvec{x}}&{\varvec{\sigma }}_{\varvec{y}}&{\varvec{\sigma }}_{\varvec{z}}\end{array}\right] {\varvec{I}}_\mathbf{3}\), it represents noise process contaminating target feature point localisation. The complete scheme of KF-based registration is explained in Algorithm 1. The latter works as follows: (1) Initialise the state vector (rotation matrix) with the entries of \({\varvec{I}}_\mathbf{9}\). If available, an initial guess would be preferable. (2) Iterate over feature points; acquire a new target feature \({\varvec{z}}_{\varvec{k}}\) and build \({\varvec{H}}_{\varvec{k}}\). (3) KF prediction. (4) KF correction where the estimate \({\varvec{x}}_{\varvec{k}}\) and the covariance of error in estimation \({\varvec{P}}_{\varvec{k}}\) are corrected with \({\varvec{K}}_{\varvec{k}}\).

figure a

The computational complexity of KF registration is proportional to \({\varvec{O}}({\varvec{n}}\times \mathbf{9}^{\mathbf{3}})\) in the worst case, where \({\varvec{n}}\) is the number of keypoints used to compute the optimal registration and 9 is the size of the state vector. On the other hand, the best complexity regarding alternative registration algorithms such as ICP, EMICP and WICP is proportional to \({\varvec{O}}({{\varvec{n}}^{2}\times \mathbf{3}^{\mathbf{2.37}}})\). KF 3D registration can be easily expanded to include the three components of translation vector in \({\varvec{H}}_{\varvec{k}}\).

5 Robust \({\varvec{H}}_\infty \) registration

5.1 3D points uncertainty

In order to handle instability in parameters estimation, the uncertainties should be confined into a small range. To this end, the behaviour of the noisy inputs must be thoroughly studied. Uncertainties are modelled empirically by looking at how 3D points are distributed, and how do 3D sensors sense the real world.

5.2 z-Resolution of RGBD cameras

The authors have already shown in a previous research [21] that the points within a 3D image lie on parallel clusters that were named “Z-Levels”. Such a structure allows us to quantify correctly the amount of uncertainty in every feature point.

5.3 Depth noise statistics

RGBD sensors’ measurement-noise has a Gaussian distribution with varying standard deviations. These standard deviations rely on the range between the sensor and the scene. The standard deviation \({\varvec{\sigma }}_{{\varvec{zk}}}\) of a given Z-level \({\varvec{z}}_{\varvec{k}}\) is defined by the length of the interval where \({\varvec{z}}_{\varvec{k}}\) is expected to vary as shown below:

$$\begin{aligned} \sigma _{z_k} =( Z_{k+i} - Z_{k-i} ) / 2 \end{aligned}$$
(17)

Here, \({\varvec{\sigma }}_{{\varvec{z}}_{\varvec{k}}}\) represents the average distance separating the two Z-levels \({\varvec{Z}}_{{\varvec{k+i}}}\) and \({\varvec{Z}}_{{\varvec{k-i}}}\) and the central one \({\varvec{z}}_{\varvec{k}}\). Empirically, the best estimation of the standard deviation regarding noise affecting the 3D points lying on \({\varvec{z}}_{\varvec{k}}\) is obtained when \({\varvec{i}}=\mathbf{3}\). That is, the true depth \(\hat{{\varvec{z}}}_{\varvec{k}}\) taken by a given Z-level is expected to be equal to \({\varvec{z}}_{\varvec{k}} \pm (({\varvec{Z}}_{{\varvec{k+3}}} - {\varvec{Z}}_{{\varvec{k-3}}})/\mathbf{2})\). The standard deviations concerning the remaining two coordinates \(({\varvec{x}}_{\varvec{k}}, {\varvec{y}}_{\varvec{k}})\) are deduced from the intrinsic parameters of the camera \(({\varvec{f}}_{\varvec{x}}, {\varvec{f}}_{\varvec{y}}, {\varvec{c}}_{\varvec{x}}, {\varvec{c}}_{\varvec{y}})\) and \({\varvec{\sigma }}_{{\varvec{z}}_{\varvec{k}}}\) as follows:

$$\begin{aligned}&\left\{ {{\begin{array}{l} { u_i =(f_x /z_i) x_i +c_x} \\ {v_i =(f_y /z_i) y_i +c_y} \\ \end{array}}} \right. \end{aligned}$$
(18)
$$\begin{aligned}&\left\{ {{\begin{array}{l} {x_i =\left( {z_i /f_x}\right) \left( {u_i -c_x}\right) } \\ {y_i =\left( {z_i /f_y}\right) \left( {v_i -c_y}\right) } \\ \end{array}}} \right. \end{aligned}$$
(19)
$$\begin{aligned}&\left\{ {{\begin{array}{l} {\sigma _{z_k} =0.5(Z_{k+i} -Z_{k-i} )} \\ {\sigma _{x_k} =\left( {\sigma _{z_k} /f_x}\right) \left( {u_k -c_x}\right) } \\ {\sigma _{y_k} =\left( {\sigma _{z_k} /f_y}\right) \left( {v_k -c_y}\right) } \\ \end{array}}} \right. \end{aligned}$$
(20)

Every point is, therefore, affected by certain amount of noise characterised by the standard deviations \({\varvec{\sigma }}_{{\varvec{x}}_{\varvec{k}}}, {\varvec{\sigma }}_{{\varvec{y}}_{\varvec{k}}}, {\varvec{\sigma }}_{{\varvec{z}}_{\varvec{k}}}\) towards the directions of the axes \({\varvec{x}}, {\varvec{y}}\) and \({\varvec{z}}\), respectively. Hence, the covariance matrix attributed to each point \({\varvec{p}}({\varvec{x, y, z}})\) is described as:

$$\begin{aligned} C\left( {x,y,z} \right) = {\left[ {{\begin{array}{c@{\quad }c@{\quad }c} {\sigma _x^2} &{} {\sigma _x \sigma _y} &{} {\sigma _x \sigma _z} \\ {\sigma _y \sigma _x} &{} {\sigma _y^2} &{} {\sigma _y \sigma _z} \\ {\sigma _z \sigma _x} &{} {\sigma _z \sigma _y} &{} {\sigma _z^2} \\ \end{array}}} \right] } \end{aligned}$$
(21)

\({\varvec{C}}\) represents the spread of uncertainty around the point \({\varvec{p}}({{\varvec{x, y, z}}})\). As can be seen in Fig. 1 a, the projection of covariance ellipsoids of a given 3D point on the planes \({\varvec{zx, zy, yx}}\) yields three ellipses. The more accurately a feature point is captured, the smaller the norm of its covariance matrix (blue point in Fig. 1a). On the other hand, the less accurate the capture of a given feature is, the larger the norm of its covariance matrix (red point in Fig. 1a).

Fig. 1
figure 1

a 3D points uncertainty ellipses, b ground truth (OptiTrack) and real data acquisition with Kinect in an indoor scene, c, d 3D registration RMSE (m) of the new and the old Kinect, respectively; e, f time elapsed during registration for the new and the old Kinect, respectively. EMICP (pink), WICP (green), Horn (red), RF (black), KF (blue) (colour figure online)

Kanazawa and Kanatani [22] claimed that the incorporation of feature uncertainty does not contribute any further improvements to the estimation. On the other hand, Brooks et al. [23] as well as us in a previous work [24], both noticed a reduced error in estimation after considering uncertainty. Based on the conducted experiments with registration algorithms and the fact that Weighted-ICP (WICP takes into account data uncertainty) outperforms ICP, as will be shown in the results, it is obvious that the incorporation of feature-location uncertainty improves pose estimation remarkably.

6 Robust \({\varvec{H}}_{\infty }\) (RF) filter for 3D registration

In this section, we propose a time-varying registration algorithm that incorporates modelling and measurement uncertainties as follows:

$$\begin{aligned} x_k= & {} \left( {A_k +{\Delta } A_k}\right) {\hat{x}}_{k-1} +B_k u_k +w_k \end{aligned}$$
(22)
$$\begin{aligned} y_k= & {} (H_k +{\Delta } H_k) x_k +v_k \end{aligned}$$
(23)

\({\varvec{{\Delta }}} {\varvec{H}}_{\varvec{k}}\) represents the uncertainty in observation model, whereas \({\varvec{{\Delta }}} {\varvec{A}}_{\varvec{k}}\) is the uncertainty in process model. In our case, the two matrices take the values:

$$\begin{aligned}&{\Delta } A_k =\sigma _A I_9 \nonumber \\&\sigma _A =\left[ {\sigma _{r_{11}} \sigma _{r_{12}} \sigma _{r_{13}} \sigma _{r_{21}} \sigma _{r_{22}} \sigma _{r_{23}} \sigma _{r_{31}} \sigma _{r_{32}} \sigma _{r_{33}}} \right] \nonumber \\&V_k =\left[ {{\begin{array}{c@{\quad }c@{\quad }c} {q_x +\sigma _x}&{q_y +\sigma _y}&{q_z +\sigma _z} \end{array}}} \right] \end{aligned}$$
(24)
$$\begin{aligned}&{{\Delta } H}_k = {\left[ {{\begin{array}{ccc} {V_k} &{} &{} \\ &{} {V_k} &{} \\ &{} &{} {V_k} \\ \end{array}}} \right] } \end{aligned}$$
(25)

If these matrices cannot be determined, RF would still be able to control the instability disturbing its parameters [25] by assuming it being of the form:

$$\begin{aligned} {\left[ {{\begin{array}{c} {{\Delta } A_k} \\ {{\Delta } H_k} \\ \end{array}}} \right] } = {\left[ {{\begin{array}{c} {M_{1k}} \\ {M_{2k}} \\ \end{array}}} \right] }\varGamma _k N_k \end{aligned}$$
(26)

\({\varvec{M}}_{{\varvec{1k}}}, {\varvec{M}}_{{\varvec{2k}}}\) and \({\varvec{N}}_{\varvec{k}}\) are known matrices, \({\varvec{\varGamma }}_{\varvec{k}}\) is unknown but it should satisfy the bound:

$$\begin{aligned} \varGamma _k^{T}\varGamma _k \le I \end{aligned}$$
(27)

Our purpose is to design a state estimator of the form:

$$\begin{aligned} x_{k+1} = \tilde{A}_k x_k +\tilde{K}_k t_k \end{aligned}$$
(28)

The latter should be stable (the eigenvalues of \({\tilde{\varvec{A}}}_{\varvec{k}}\) must be less than one in magnitude). The determination of the parameters of the filter can be done through the procedure described in our previous work [24].

The adaptation of RF is proven to be flexible and capable of delivering accurate state estimations, however uncertain system’s parameters are. Estimation error compared to the ground truth measurements will show the effectiveness of RF 3D registration against alternative non-robust methods such as KF and the more established algorithms available in the literature. In real scenarios, the exact model is very unlikely to determine [26]. Yet the non-robust tools do not consider uncertainties in their parameters. Hence, if by chance the parameters are accurate, these tools perform as well as RF. On the other hand, when the system is not precisely characterised, they become significantly unstable. For instance, RF registration combines the robustness of \(\tilde{\varvec{H}}_\infty \) (it is less affected by the accuracy of system’s parameters) and the optimality of KF on linear systems to produce an accurate and stable estimate. Such a quality guaranties a high precision of estimation and more stability towards inputs’ perturbations.

7 Results and discussion

In this section, the results regarding KF and RF registration are validated with tests on real and synthetic 3D data. Our test benchmark includes: WICP [27]; Expectation Maximisation ICP algorithm (EMICP) [28] and Horn’s closed-form solution based on quaternions (HORN) [29].

Here, accuracy is measured by the distance separating the target and the source point clouds after the registration. In order to fairly assess every algorithm, processing time elapsed to find the best pose is also recorded. Throughout experiments, it is noticeable that the plotted metrics (processing time and \(\hbox {RMSE}=\sqrt{\frac{\mathbf{1}}{{\varvec{n}}}\sum \nolimits _{{\varvec{i=1}}}^{\varvec{n}} \Vert {\hat{{\varvec{R}}}} {\varvec{Q}}_{\varvec{i}} +{\hat{{\varvec{t}}}} -{\varvec{P}}_{\varvec{i}}\Vert ^{\mathbf{2}}})\) are not homogeneous. For this reason, a logarithmic scale was used to cope with the difference of scale within the same plot.

Fig. 2
figure 2

ac 3D registration RMSE of the small, average and large noise, respectively; df time elapsed during registration for small, average and large noise, respectively. EMICP (pink), WICP (green), Horn (red), RF (black), KF (blue) (colour figure online)

The number of keypoints extracted from every point cloud is about 400 points. In practice, an average-sized point cloud in a single frame contains up to 400 useful key points. Computation time has been calculated for the five algorithms running on an i7-2670QM working at 2.2 GHz, with 12.0 GB of memory. A sample is a set of 400 pairs of corresponding \({<}source, target{>}\) keypoints. 30 samples were tested in each of the following five scenario (two with real and three with synthetic data).

7.1 Real data

In this experiment, image data are delivered by two versions of KinectFootnote 1 sensor (Kinect 1 is based on structured light principle; whereas, Kinect 2 is a time-of-flight camera). In addition, SIFT3D extractor and CSHOT [21] descriptor were used to obtain feature points from the real data.

In order to collect real 3D point clouds, the camera was carried and moved around in an infinity-shaped \((\infty )\) trajectory within the arena of our autonomous navigation laboratory. Simultaneously, a high-quality tracking system (OptiTrackFootnote 2) was used as a ground truth reference, Fig. 1b. One hundred and twenty different pairs of overlapping point clouds were captured by each of the two Kinects. RGBD image data acquisition runs simultaneously as the robot moves around. At each time-step, we acquire a single pair of colour and depth images (both constitute a single point cloud) for the indoor scene. Hence, a total of 120 pairs of point clouds are aligned in a pairwise manner between \(({\varvec{C}}_{\varvec{i}}, {\varvec{C}}_{{\varvec{i}}+\mathbf{1}})\). The last sample \({\varvec{C}}_{\mathbf{120}}\) is registered against both \({\varvec{C}}_{\mathbf{119}}\) and \({\varvec{C}}_\mathbf{1}\) to test the loop closure.

Scenario 1: New Kinect

RMSE The average RMSE for the five algorithms (see Fig. 1c) was as follows: 0.27 m for EMICP (pink), 0.13 m for WICP (green), 0.28 m for Horn (red), 0.15 mm for RF (black) and 0.7 mm for KF (blue).

Scenario 2: Old Kinect

RMSE was 0.28 m for EMICP, 0.22 m for WICP, 0.3 m for Horn, 0.95 mm for RF and 1.13 mm for KF (see Fig. 1d).

Average processing time for both scenarios was 114.3 ms for EMICP, 26.7 ms for WICP, 1.05 ms for Horn, 23.1 ms for RF and 11.64 ms for KF (see Fig. 1e, f).

7.2 Synthetic data

In this experiment, we consider only artificial 3D keypoints, where, \({\varvec{Q}}_{\varvec{i}}\) (source keypoints) as well as a random 3D transformation \([{{\varvec{R}}_{\varvec{i}}, {\varvec{t}}_{\varvec{i}}}]\) had been generated randomly. The target 3D keypoints are built using the equation, \({\varvec{P}}_{\varvec{i}} = {\varvec{R}}_{\varvec{i}}\,{\varvec{Q}}_{\varvec{i}} +{\varvec{t}}_{\varvec{i}}\). To realistically simulate physical data, a normally distributed anisotropic white noise was added to the clean datasets. The latter had different magnitudes \({\varvec{\sigma }}_{\varvec{i}}\): large (\(20\,\mathrm{mm}\le \sigma _i \le 80\,\mathrm{mm})\), average (\(10\,\mathrm{mm}\le \sigma _i \le 20\,\mathrm{mm})\) and small (\(0.1\,\mathrm{mm}\le \sigma _i \le 10\,\mathrm{mm})\). For each, is generated 1000 point clouds, results were as follows:

Scenario 1: Small noise magnitude

RMSE was 0.42 m for EMICP, 0.18 m for WICP, 0.46 m for Horn, 0.18 mm for RF and 0.44 mm for KF (see Fig. 2a).

Scenario 2: Average noise magnitude

RMSE was 0.54 m for EMICP, 0.48 m for WICP, 0.56 m for Horn, 0.22 mm for RF and finally, 0.43 mm for KF (see Fig. 2b).

Scenario 3: Large noise magnitude

RMSE was 0.49 m for EMICP, 0.35 m for WICP, 0.51 m for Horn, 0.63 mm for RF and 0.89 mm for KF (see Fig. 2c).

Average processing time for all three scenarios was 115.1 ms for EMICP, 27.03 ms for WICP, 1.08 ms for Horn, 22.8 ms for RF and 10.43 ms for KF (see Fig. 2d–f).

As illustrated in Table 1, one can obviously notice how significantly poorly EMICP and Horn perform. This drawback often occurs when the shapes present some symmetry. On the other hand, WICP is better endowed to cope with such drawbacks since it leverages knowledge about the quality of features, which helps it in discarding noisy elements. More importantly, KF and RF are both comparably superior in term of accuracy, but RF is more precise due to the control of uncertainty in parameters.

Table 1 RMSE (mm) for the whole sets of samples: 1000 for each simulation scenario and 120 for every version of Kinects

8 Conclusion and future works

A novel approach for robust 3D point cloud registration was presented. This contribution is based on a recursive optimal state estimation. After establishing the link between WLS and its original counterpart (LS), 3D point cloud registration problem was fitted to KF scheme. However, since KF parameters for 3D registration (state and projection matrices) are built from noisy data, a non-negligible estimation instability was noticed. Consequently, we modelled the uncertainty and overcame it with an RF-based solution.

The accuracy of the proposed solution was tested on many synthetic as well as real 3D samples delivered by Kinect. Precision, on the other hand, can be seen on the relatively small difference in accuracy among comparably noisy samples (red error bars in Figs. 1c, d and 2a–c on the black line).

The proposed solution requires some feature points to be extracted from the source and the target point clouds before the alignment is carried out. The number of keypoints is relatively small compared to the size of point clouds. In addition, our solution can be extended to any dimension for data that can be point clouds, meshes as well as surfaces, given that some distinctive features are available.

As a future work, we intend to investigate alternative applications of recursive filtering algorithms in the field of computer vision. It would be also interesting to implement RF registration in the graphic processor to reach higher frame rates. In addition, in a multiview registration scenario (many sensors streaming images concurrently), data fusion algorithms open a new perspective for the users to reconstruct 3D scenes and to track moving objects cooperatively. This new horizon is convenient for the technologies of virtual and augmented reality.