Keywords

1 Introduction

The trajectory estimation problem is in the spotlight for many years, yielding many different solutions. The best approaches belong to the dense solution group and use all the information seen by the camera. Because this is computationally expensive and requires GPU acceleration, it is almost impossible to use such an approach in mobile robotics in a larger scale. For these reasons, we are concerned with the existing feature-based (sparse) approaches. While working with a single passive (RGB) camera it is impossible to obtain full depth information about the scene and in turn to accurately determine the scale of translation between two frames. However, it is still possible to compute rotational components of the trajectory and to use them for constraining the orientation of the robot/camera. This is of high importance, as small errors here can result in large translational errors. It was shown in [2] that a SLAM system can benefit from being augmented with the visual estimation of the frame-to-frame orientation change. In the last few years, the focus in visual trajectory estimation was put on Visual SLAM and Visual Odometry. The former type of approach builds a map, and then estimates the robot position with respect to it. The first working real-time Visual SLAM was demonstrated by Davison and Murray [11]. The latter type of approach tries to achieve the same goal without building a map, however with a larger number of frame-to-frame measurements. A comprehensive overview of the VO methods is presented in [12, 13], and in [14], where VO and SLAM algorithms are nicely and briefly described. The rest of this paper is structured as follows: Sect. 2 presents the used libraries and related work, Sect. 3 briefly describes the n-point relative pose computation algorithms, Sect. 4 details the methodology used to obtain both the quantitative and qualitative results, Sect. 5 describes and comments the obtained results, and Sect. 6 concludes the paper.

2 State of the Art

Currently, there are several libraries that help in obtaining rotation from RGB images, like OpenCV [8], OpenMVG and OpenGV. All of the mentioned libraries but OpenCV contain the 5-point Nister [3, 4], Stewenius [5], and 7-point and 8-point methods, which are in OpenCV, for finding the essential matrices from 2D-2D correspondences. OpenCV has two last methods to return fundamental matrix only. OpenMVG contains a 5-point method that uses Nister and Stewenius constraint expansions that find the essential matrix. The richest library of that kind of algorithms is OpenGV allowing for [9]: calculation pose of the camera from 2D-3D correspondences between points in the world frame and bearing vectors in the camera frame (this states as central absolute pose problem), and in multiple camera frames (non-central absolute pose case), finding the pose of one camera with respect to another camera given a number of 2D-2D correspondences between the bearing vectors in the camera frames (central relative pose), and in multiple camera frames (non-central relative pose problems). Some algorithms may return only rotation, only translation, several different probable rotations, rotation and translation, several probable rotations and translations, essential matrix, several probable essential matrices, or several probable complex essential matrices. In the past Murphy et. al. presented [14] an evaluation of three VO techniques: patch-based, Structure from Motion, and the VO approach of [15], but in a monocular version. Recently, there is a work in semi-dense approaches [1] that labels itself as VO method claiming frame-to-frame trajectory estimation based on semi-dense inverse depth map based. Nonetheless, to create those semi-dense depth maps their approach uses the history of RGB frames to find a possibly oldest frame, in which the pixel from the current frame was seen. That leads to building a form of map, and to the indirect use of it. Consequently, this type of solution does not fit well to the VO definition, being partially a SLAM system, yet without trajectory optimization. Worth mentioning are the experiments trying to combine both worlds of visual SLAM and VO. Earlier, Schmidt et al. [2] proposed to augment SLAM method by replacing orientation change every few steps of EKF prediction function with the one provided by separately running VO system. A comparison of SLAM and VO approaches for RGB-D data was presented in [6]. As for now, there is very little work done concerning evaluation of methods that estimate the frame-to-frame orientation change. Recently, Hartman et. al. [7] used 5-point Stewenius algorithm with a modified RANSAC approach. Ground truth data was collected with the use of IMU and GPS devices while the one used in experiments presented here were collected by a high-resolution multi-camera system [21], or the Vicon motion capture system [19].

3 Used Algorithms

In this section, we briefly describe used algorithms for solving geometric vision problem. We used SURF, ORB and AKAZE feature detector-descriptor pairs for preliminary image processing. 8-point [16, 17] algorithm is here the oldest algorithm for computing essential or fundamental matrix, depending if the calibration has been performed, based on 8 or more matched points. The essential matrix can be defined as:

(1)

where u’ and u are matched homogeneous image points, because of that the last element equals to 1, and E and F stand for the Essential and Fundamental matrix (without calibration data). The entry points are normalized to ensure that all entries of respective matrices will be treated approximately equally. After representing matrix Eq. 1 as a vector, set of linear equations is obtained:

(2)

where A is the equation matrix, e and f are 9-vector containing the entries of the matrix E or F. The equation is solved under the singularity constraint ‖f‖or ‖e‖= 1, where ‖e‖and ‖f‖are the norms, by checking if it is of rank 2 and if it is singular and by enforcing if it is not the case. Finally, the result matrix is replaced by closes singular matrix under Frobenius norm. Nister’s [3, 4] algorithm computes Essential matrix given 5 correspondences, which gives a rise to essential matrix constraint of the following form:

$$\begin{aligned} \mathbf E {} \mathbf E ^{T}{} \mathbf E - \frac{1}{2}{} \textit{trace}(\mathbf EE ^{T})\mathbf E = 0 \end{aligned}$$
(3)

After representing matrix Eq. 3 as vector, set of the linear equation a \(5\times 9\) matrix is obtained. Next, four vectors X̃, Ỹ, Z̃, W̃ that spans the right nullspace of that matrix are calculated.

$$\begin{aligned} \mathbf E = x \mathbf X + y \mathbf Y + z \mathbf Z + w \mathbf W \end{aligned}$$
(4)

They correspond directly to four matrices constraining the form of essential matrix 4 which is in turn inserted into nine cubic constraints 3. Calculations are performed by using Gröwner basis. Scalars x, y, z, w are defined up to a common scale factor and it is assumed that w = 1. As the output algorithm can return up to 10 essential matrices, but some of them are internally rejected. Kneip [18] proposed a direct method for finding exact rotation between two images independently of the translation. This results from the fact that rotational optical flow is fundamentally different from translational one. The constraints on the rotation are then as follows:

$$\begin{aligned} \left| (f_1 \times \mathbf R f^{\prime }_{1})(f_2 \times \mathbf R f^{\prime }_{2}) (f_3 \times \mathbf R f^{\prime }_{3}) \right| = 0 \end{aligned}$$
(5)

Because rotation encodes \(3^\circ \) of freedom at least 3 epipolar plane normal coplanarity constraints are required to fully constrain rotation. Two additional features allow for building the necessary system of equation:

$$\begin{aligned} \left| (f_1 \times \mathbf R f^{\prime }_{1})(f_2 \times \mathbf R f^{\prime }_{2}) (f_3 \times \mathbf R f^{\prime }_{3}) \right| = 0 \nonumber \\ \left| (f_1 \times \mathbf R f^{\prime }_{1})(f_2 \times \mathbf R f^{\prime }_{2}) (f_4 \times \mathbf R f^{\prime }_{4}) \right| = 0 \nonumber \\ \left| (f_1 \times \mathbf R f^{\prime }_{1})(f_2 \times \mathbf R f^{\prime }_{2}) (f_5 \times \mathbf R f^{\prime }_{5}) \right| = 0 \end{aligned}$$
(6)

To solve this Eq. 6 Gröne basis is used. As the output algorithm can return up to 20 different rotation matrices, resulting from 10 essential matrices, but some of them are internally rejected. Eigensolver method tries to find the eigenvalue that minimizes the fallowing function 7:

$$\begin{aligned} \mathbf R = argmin_\mathbf{R }\lambda _{M,min} \end{aligned}$$
(7)

where R is the rotation transforming the i-th bearin vector \(f_{i}\) to the corresponding one seen from the second frame \(f_{i}^{\prime }\). M has rank at most 2 and is a real symmetric and positive- defined matrix of fallowing form: \( M = \sum _{i=1}^{n}(f_i \times \mathbf R f_i^{\prime })(f_i \times \mathbf R f_i^{\prime })^{T}\). At best a non-linear optimization over three parameters is needed because rotation has \(3^\circ \) of freedom. To solve the problem Kneip et.al. [18] are enforcing the first-order partial derivates of \(\lambda _{M,min}\) to be zero and then are using Levenberg-Marquardt scheme. Also, this algorithm computes translation direction (but does not explicitly return it) as automatically given by the eigenvector that corresponds to the smallest eigenvalue. We have implemented the Kabsch algorithm [22, 23] with the help of information found in [24]. As this algorithm computes transformation of two aligned points sets, the camera movement is the opposite.

4 Experimental Methodology

The whole program is constructed similarly to the work presented in [10]: firstly detector/descriptor algorithm is used to extract silent features. Then RANSAC technique, with varying ejection threshold (and for Eigensolver method also varying confidence) is performed twice to obtain best results. At this stage, two different error measures are used: reprojection error and a simple Euclidean distance between bearing vectors. The second method omits the translation, assuming it is sufficiently small. Finally, a trajectory is computed with all the matches remaining. Kneip algorithm is an exception- it only accepts from 5 to 8 bearing vectors pairs. It is also the only algorithm that does not compute translation, thereby to be able to compute reprojection error we employ a two-point algorithm, that is also contained in the OpenGV library. Because some algorithms, namely Kneip, Nister and Eightpoint return several results, the best one is chosen based on appropriate error measure. In a case of an empty result set, resulted from in-algorithm suppression, 8 points are randomly chosen from the inliers set to compute rotation and translation as long as it does not exceed maximal RANSAC iterations. If the number of remaining inliers is insufficient for trajectory estimation the whole pipeline breaks as unable to correctly recover trajectory. For experiments we used 2 different trajectories: putkk_Dataset_1_Kin_1 data set described in [20], is further referred as dataset1, and fr3_long_office_household which will be referred as dataset2 from the TUM RGB-D Benchmark [19]. Dataset1 was collected by the Kinect sensor mounted on the moving in the laboratory wheeled robot. Ground truth was obtained from multicamera vision system PUT Ground Truth (PUT GT) [21]. The acquisition time of robot and GT cameras have been synchronized so no interpolation was performed. Dataset2 was collected by the hand-held Kinect camera. Ground truth was also acquired by a multicamera system but a time of acquisition was not enforced what resulted in the need for data interpolation. Timestamps are not perfectly aligned in the ground truth and data collected by the camera. To measure the quality of retrieved trajectory we use well known Absolute Trajectory Error (ATE) and Relative RPE metrics as described in [19]. The first error is the Euclidean distance between corresponding points of the ground truth trajectory and estimated one. RPE is a relative translational or rotational error of successive frames. As for evaluation Root Mean Squared Error (RMSE) of mentioned metrics are used.

Table 1. Comparison of absolute trajectory errors (ATE RMSE) and relative pose errors (RPE RMSE) for the putkk_Dataset_1_Kin_1

5 Results

In this section, in order to fit the data into the tables, we abbreviated some names. The translation recovery approaches that are based on the use of Centroid are abbreviated to C. and the Scale-based methods (depth is used only in the first two frames to determine the scale) are denoted as S. Consequently, the Reprojection error is denoted as R. and the Euclidean error as E. In the first experiment, it occurred that the scene was too demanding, or the found points were insufficient for Eigensolver and Nister algorithms to retrieve full trajectory in 3 out of 4 cases. It is possible, that those algorithms could restore trajectory if a different set of points was provided but regardless of this fact it shows that those algorithms are not as robust as rest of the investigated methods. With respect to rotational RPE RMSE errors, it is questionable if Kneip algorithm managed to retrieve the trajectory. The best of n-point relative pose algorithms was the 8-point, achieving best results with a constant scale, mainly if it was ignoring the translational part in RANSAC scheme. While this depends on the initialization stage and the detected points, the differences are not large. Still, it is outperformed by Kabsch algorithm, which is using full depth information, what is demonstrated in Fig. 1. This method combined with SURF detector/descriptor works twice better in ATE error terms than with other detector/descriptor pair. The full comparison is further given in Table 1. In the second experiment, all but Nister coupled with AKAZE detector/descriptor were able to reconstruct trajectory. The best results were achieved by the Kabsch algorithm with the use of depth data. The best algorithm for finding a rotation from solely RGB data is 8-point algorithm if ATE error is considered but if the RPE metrics is evaluated, the Eigensolver seems to be a bit better. It seems as 8-point algorithm works slightly better with proposed in-RANSAC Euclidean error measurement than with reprojection error. This is not the case for Eigensolver algorithm, as it benefits if combined with SURF algorithm, and suffers if combined with AKAZE in the context of constant scale. Also, Eigensolver sometimes returns no solution and thus is not as reliable as the 8-point algorithm. Kneip algorithm has the biggest RPE error from all of the investigated algorithms. This is probably due to its inability to perform computations on more than 8 points. It can be assumed that it “reconstructed” trajectory, as Fig. 1 shows, when was working without depth data, having only a scale and reprojection error but this is not the case if it is used with depth information in centroid fashion or Euclidean error. Still, RPE errors are high. The reason for this are residual outliers, gaining a lot of importance in the final computation of the estimate. Other algorithms perform some form of least-squares optimization, and thus the residual outliers are strongly suppressed. The results for all investigated algorithms for fr3_long_office_household sequence are collected in Table 2 and enhanced with Fig. 2.

Fig. 1.
figure 1

Trajectories estimated for the putkk_Dataset_1_Kin_1 (a) ORB ATE C.E. Eigensolver, (b) SURF ATE S.E. Nister, (c) SURF ATE Kabsch, (d) ORB RPE C.E. Eigensolver, (e) SURF RPE S.E. Nister, (f) SURF RPE Kabsch

Table 2. Comparison of absolute trajectory errors (ATE RMSE) and relative pose errors (RPE RMSE) for the fr3_long_office_household sequence
Fig. 2.
figure 2

Trajectories estimated for the fr3_long_office_household, (a) AKAZE ATE S.E. 8-point, (b) AKAZE ATE S.R. Kneip, (c) AKAZE ATE C.R. Kneip, (d) AKAZE RPE S.E. 8-point, (e) AKAZE RPE S.R. Kneip, (f) AKAZE RPE C.R. Kneip, (g) SURF ATE S.E. Eigensolver, (h) ORB ATE S.E. Nister, (i) AKAZE ATE Kabsch, (j) SURF RPE S.E. Eigensolver, (k) ORB RPE S.E. Nister, (l) AKAZE RPE Kabsch, (m) SURF RPE C.E. 8-point, (n) ORB RPE S.R. Kneip, (o) SURF RPE Kabsch

6 Conclusions

Taking into account the presented experimental results we can conclude, that the best algorithm for solving the frame-to-frame motion estimation problem is the Kabsch algorithm, which however requires depth data, e.g. from an RGB-D sensor. In passive monocular systems, the 8-point algorithm won the competition, as the Eigensolver, although quite accurate, not always was able to deliver a solution. Whenever no depth data is available, the 8-point algorithm can restore a reasonable trajectory in comparison with other methods. The remaining algorithms usually performed by far worse than the two previously mentioned methods. The superiority of the 8-point algorithm may result from a better least-square optimization. Also, when working with real, imperfect data, the assumptions underlying the 8-point algorithm are more realistic than the stricter assumptions of the 5-point algorithm. It seems that the choice of the detector/descriptor has little influence compared to the error metrics used in RANSAC. There is no significant difference in performance resulting from using different detector/descriptor algorithms. The Kabsch algorithm in one scene works better with SURF, but in another scene, AKAZE and ORB perform slightly better. This shows that for different scenes, a different approach to select the salient point features is needed and that there is no perfect feature detector yet. It seems that the 8-point algorithm works better with AKAZE or SURF features than with ORB. The ORB is known from its fast computation and thus is considered for the computational efficiency, but its performance in our experiments was rather disappointing. Taking into account that AKAZE is an open-source software, we also advise considering this detector/descriptor pair instead of SURF. In our future work, we plan to consider also some detector/descriptor combinations, such as ORB-AKAZE, which seem to be interesting.