Keywords

1 Introduction

Egomotion estimation is an intensively discussed issue in computer vision, which aims at understanding the six-degree-of-freedom (6-DoF) transformation (three for the rotation and three for the translation) of the visual sensor with reference to the input sequence of images. It has drawn a lot of attentions in numerous applications such as augmented reality, motion control and autonomous navigation [3, 4, 14, 20]. The term visual odometry (VO) was originally presented in the work [18] in 2004, which is the process of evaluating the egomotion of an agent such as mobile robot with only the input of a single or multiple cameras mounted to it. The approaches of VO can be classified into two major categories: one is the optical flow method based on pixel information [2, 5, 15], and the other is the vision-based method [8, 22]. Compared with the first one, the vision-based indirect method is much more robust because of its use of discernible feature points from image. Hence, this paper focuses on the study of feature-based egomotion estimation method which belongs to the second one.

For a calibrated camera, it needs only five points to estimate the 6-DOF pose between two consecutive views [17] while seven or eight points [6] are needed if the camera is not calibrated. Specifically, in the case of a 99-percent probability of success and a set of data with half-rate outliers, the linear 8-point essential matrix algorithm [7] requires about 1177 samples whereas the 5-point essential matrix algorithm [11, 13, 23] only needs 145 trials. Therefore, finding an approach with minimal points to meet the real-time requirements is necessary. To reduce the amount of needed point correspondences between frames, some reasonable hypotheses combining additional sensor data are needed. Consequently, based on the availability of low-cost RGB-D cameras, we mainly investigate the motion of the mobile robot in indoor environments where at least one plane is present in the scene. The RGB-D camera is mounted rigidly on a mobile robot and the robot is always under planar motion because of the flat indoor floor. So the pitch and roll angles remain constant during the whole process. Assuming the roll and pitch angles of the camera as known, we correct the RGB-D camera through rotating the generated point clouds so that the values of the roll and pitch angles are approximately equal to zeros. In this case, we just need to calculate a three-degree-of-freedom egomotion estimation problem, which consists of two horizontal translations and one yaw angle.

Despite of its many developments, egomotion estimation still remains somewhat challenging to be efficient and robust in structural and low-texture scenes (e.g., wall or hallway). To solve this problem, several SLAM systems using high geometric characteristics such as lines and planes have been proposed recently [1, 9, 10, 12]. While, Kim et al. [10] estimated the drift-free rotation by applying a mean-shift algorithm with the surface normal vector distribution. However this method required at least two orthogonal planes for demonstrating superior rotation estimation. Kaess [9] introduced a minimal representation with planar features using a hand-held RGB-D sensor and presented a relative plane formulation, which improved the convergence for faster pose optimization. While this method required plane extraction and matching at each frame to construct optimization function, and additional odometry sensors were utilized to perform plane matching, which increased the complexity for VO system.

In contrast to these methods, in this paper we obtain a rough plane-segmentation result using only RGB-D frame, which is fast to meet the real-time application instead of segmenting the scenes into very precise planar regions. We directly estimate the subsequent egomotion through extracting the normal in each plane from the inverse-depth-induced histograms. We will take different strategies according to the prior knowledge about the 3D scenarios. The major contributions of this work are twofold: First, if there exists at least a vertical plane, we realize the pose and location estimation with the normal of the vertical plane and one point correspondence, which is called the direction-plus-point algorithm. Second, if the plane orientation is completely not available, we propose an efficient 2-point minimal case algorithm for the homography-based method to estimate the egomotion. Compared with the classical 5-pt essential matrix estimation [17], our method just needs two matched points between views instead of five, which speeds up the process of iterative optimization for egomotion estimation. At last, we evaluate our algorithm on both synthetic and real datasets.

The rest of the paper is organized as follows. Section 2 describes two efficient algorithms for estimating the egomotion under the weak Manhattan-world assumption. Section 3 presents the performance of our solutions on synthetic and real data and compares with the classical method in a quantitative evaluation. Finally, in Sect. 4, conclusions are drawn.

2 Ego-Motion Estimation

In VO, compared with the estimation of translation which is relatively simple, the estimation of rotation deserves more attention. The majority of experimental errors are derived from the inaccurate estimation of rotation. Thus, reducing the accumulated error caused by rotation can greatly improve the performance of the algorithm. In this work, if there exists at least one vertical plane in the scene, we estimate the motion by decoupling rotation and translation so that a drift-free rotation estimation can be derived from the alignment of local Manhattan frames. Based on the accurate rotational motion, we can obtain a robust estimation of translation with 1-point RANSAC approach. In addition, we propose a new 2-pt minimal-case algorithm with the simplified motion model while no vertical planes are known.

2.1 The Plane-Plus-Point Algorithm

We design a fast plane segmentation method through using only the depth image based on a RGB-D camera, where we extract the inverse depth induced horizontal and vertical histograms to detect planes instead of segmenting the huge point clouds directly. We view the whole indoor scenery as a composition of one or several local Manhattan structures. We can recognize at least one local Manhattan coordinate frame according to the detected vertical planes at any time. And the pose estimation is simplified when knowing the vertical direction in the scenes. Then through aligning the measured vertical direction with the camera coordinate system, the y-axis of the camera is parallel to the vertical planes while the x-z-plane of the camera is parallel to the ground plane (illustrated in Fig. 1). This alignment can make relative motion reduce to a 3-DOF motion, which includes 1 DOF of the remaining rotation and 2 DOF of the translation (i.e., a 3D translation vector up to scale).

Fig. 1.
figure 1

Alignment of the camera with the normal of the ground plane.

Fig. 2.
figure 2

Rotation between two successive times.

In general, we can detect only one ground plane, but multiple vertical planes which correspond to different Descartes coordinate frames may be obtained at the same time. Therefore, we need to distinguish the dominant Manhattan frame from the minor ones based on the specified Descartes coordinate frames attached to the walls. For each Descartes coordinate frame, we calculate its evaluation score according to the area of all the vertical planes in the RGB image whose normal (in the depth image) is approximately parallel or perpendicular to each other. Among them we choose the frame whose score is the largest as the dominant local Manhattan frame.

Hence, the drift-free rotation between two successive views will be obtained as soon as the dominant Manhattan frame has been determined at each moment. As shown in Fig. 2, assuming the robot is in the same Manhattan structure (frame) denoted at two different times \(t_{i}\) and \(t_{j}\), we can estimate only one rotation \(R_{i,j}\) of \(C_{i}\) with respect to \(C_{j}\):

$$\begin{aligned} R_{i,j}=(R_{i}^{wc})^{T}R_{j}^{wc} \end{aligned}$$
(1)

where \(C_{i}\) and \(C_{j}\) are respectively the camera coordinate frame at the continuous time \(t_{i}\) and \(t_{j}\) and where \(R_{i}^{wc}\) and \(R_{j}^{wc}\) denote the rotations of \(C_{i}\) and \(C_{j}\) with respect to the local Manhattan coordinate.

Knowing the rotation information based on Manhattan world constraints, the translation estimation needs to be obtained through other algorithms such as a 1-point RANSAC method. We detect and match corner points from two successive RGB images and get the 3D points through the depth image. Then the translation T is estimated with one 3D point correspondence:

$$\begin{aligned} T=P^{'}-(R_{i,j})^{T}P \end{aligned}$$
(2)

where P and \(P^{'}\) are respectively the current 3D points and the previous 3D points.

2.2 The 2-Point Homography-Based Algorithm

In the real-world indoor environments the vertical plane is not always available. In this case, the plane-plus-point algorithm which is proposed in the previous section does not work. Therefore we propose a new 2-point minimal case algorithm for the homography matrix based method, and we do a local refinement between the current and previous plane according to the Manhattan assumption to eliminate the drift if the vertical plane is detected again. Given \(p_{i}=[x_{i},y_{i},1]^{T}\) and \(p_{j}=[x_{j},y_{j},1]^{T}\), which are points on the ground plane in the first and second camera coordinate frames, the homography constraint is defined as:

$$\begin{aligned} \sigma p_{j}=Hp_{i} \end{aligned}$$
(3)

With

$$\begin{aligned} H=R-\frac{t}{d}N^{T} \end{aligned}$$
(4)

Where \(\sigma \) is a scale factor, R and t are the rotation matrix and translation vector respectively, and N is the normal vector of the 3D plane and d is the distance from the camera to the corresponding plane. Giving two 3D points in the world coordinate, they will uniquely define a virtual vertical plane and the unit normal vector of this virtual plane with respect to the \(i_{th}\) view is \(n=[n_{x},0,n_{z}]^{T}\) (if they are not vertically aligned). Let \(\frac{t}{d}=[t_x\ 0\ t_z]^T\). The homography induced by this virtual vertical plane can be written as:

(5)
(6)

There are four unknown elements of a 3 \(\times \) 3 homography matrix in (6), therefore this matrix can be parametrized as:

$$\begin{aligned} H ={ \begin{bmatrix} h_1 &{} 0 &{} h_2\\ 0 &{} 1 &{} 0\\ h_3 &{} 0 &{} h_4 \end{bmatrix}}. \end{aligned}$$
(7)

In order to eliminate the scalar factor in (3), we use cross product instead and obtain:

$$\begin{aligned} p_j \times Hp_i = 0. \end{aligned}$$
(8)

Combining (7) and (8), we have the following relation:

$$\begin{aligned} {\begin{bmatrix} x_j\\ y_j\\ 1 \end{bmatrix}}\times { \begin{bmatrix} h_1 &{} 0 &{} h_2\\ 0 &{} 1 &{} 0\\ h_3 &{} 0 &{} h_4 \end{bmatrix}}{\begin{bmatrix} x_i\\ y_i\\ 1 \end{bmatrix}}=0. \end{aligned}$$
(9)

It gives us three linear equations, but cross product can also be expressed as a skew-symmetric matrix product, and the rank of the skew-symmetric \([p_j]_\times \) is two, only two linearly independent equations are achieved. By choosing the first two (9) can be rearranged into:

$$\begin{aligned} {\begin{bmatrix} x_iy_j &{} y_j &{} 0 &{} 0\\ 0 &{} 0 &{} x_iy_j &{} y_j \end{bmatrix}}{\begin{bmatrix} h_1\\ h_2\\ h_3\\ h_4 \end{bmatrix}}={\begin{bmatrix} x_jy_i\\ y_i \end{bmatrix}}. \end{aligned}$$
(10)

One point correspondence gives two constrains and \([h_1, h_2, h_3, h_4]\) can be uniquely determined by two point correspondences if they are not vertically aligned. Note that, for different two point correspondences, the parameters \([d, n_x, n_y]\) of the plane which is determined by these two points are not the same. Once \([h_1,h_2,h_3,h_4]\) is obtained from two point correspondences, let’s consider the following relations:

$$\begin{aligned} \left\{ \begin{aligned} n_x t_x&=cos\theta - h_1 ,\\ n_z t_x&=sin\theta - h_2 ,\\ n_x t_z&=-sin\theta - h_3 ,\\ n_z t_z&=cos\theta - h_4 , \end{aligned} \right. \end{aligned}$$
(11)

By multiplying the first equation by the forth one and multiplying the second equation by the third one, we obtain:

$$\begin{aligned} \left\{ \begin{aligned} n_x n_z t_x t_z&=(cos\theta - h_1)(cos\theta - h_4) ,\\ n_x n_z t_x t_z&=(sin\theta - h_2)(-sin\theta - h_3) , \end{aligned} \right. \end{aligned}$$
(12)

The left part of the two equations in (12) is identical. Therefore, the following relation can be obtained by associating the right parts:

$$\begin{aligned} (h_1+h_4)\cos \theta + (h_2-h_3)\sin \theta +h_2h_3-h_1h_4-1=0, \end{aligned}$$
(13)

with:

$$\begin{aligned} \sin ^2 \theta +\cos ^2 \theta = 1. \end{aligned}$$
(14)

Using (13) and (14) we can compute \(\cos \theta \) and \(\sin \theta \), which have two possible solutions. The rotation of the camera motion can directly be derived from the \(\cos \theta \) and the \(\sin \theta \).

Then the normal vector can be obtained by dividing both sides of the first equation of (12) by the second one:

$$\begin{aligned} \frac{n_x}{n_z} = \frac{cos\theta -h_1}{sin\theta -h_2}, \end{aligned}$$
(15)

with:

$$\begin{aligned} n_x^2 + n_z^2 = 1. \end{aligned}$$
(16)

Finally, the translation up to scale is given by:

$$\begin{aligned} t =d{ \begin{bmatrix} \dfrac{cos\theta - h_1}{n_x}&0&\dfrac{cos\theta - h_4}{n_z} \end{bmatrix}^T}. \end{aligned}$$
(17)

3 Experiments

To evaluate the performance of the proposed egomotion estimation method, both the synthetic data and the real-world data are used for the experiments. The synthetic data with ground truth is used to compare our 2-point algorithm with another minimal solution, the 5pt-essential method [17]. The real-world datasets are provided with two scenes, one for the laboratory building and the other for the dormitory building. Each scene that satisfies weak Manhattan constrains is captured by using a robot mounted with a Kinect v2.

3.1 Test with Synthetic Data

The synthetic data sets are generated in the following setup. The scene contains of 500 randomly sampled 3D points totally and the focal length of the camera is set to 1000 pixels with a field of view of 50\(^\circ \). The average distance from the first camera to the scene is set to 1 and the base line between two cameras is set to be 15% of the average scene distance. Since we focus on the indoor robot motion estimation, the second camera is rotated around \({y}{-}\)axis with the relative rotation angle varying from \(-15^\circ \) to \(15^\circ \). The translation is set parallel to the ground and the moving direction is set into two situations, along the x-axis (sideways) and along the z-axis (forward), respectively. This is similar to Nister’s test scene in [19], which has been used in [21].

As the estimated translation is up to a scalar factor, we compare the angle between the ground-truth and estimated translation vector. The errors are defined as follows:

$$\begin{aligned} \left\{ \begin{aligned} \xi _R&=\left| \theta _g -\theta _e \right| ,\\ \xi _t&=arccos((t{_g^T} t_e)/(\Vert t_g \Vert \Vert t_e \Vert )), \end{aligned} \right. \end{aligned}$$
(18)
Fig. 3.
figure 3

Rotation and translation error for forward motion. Comparing the 5pt-essential matrix algorithm with our 2pt-homography method. Left column: Rotation error, right column: Translation error. (a) is with varying image noise. (b) is with increased Pitch noise and 0.5 pixel standard deviation image noise. (c) is with increased Roll noise and 0.5 pixel standard deviation image noise.

Fig. 4.
figure 4

Rotation and translation error for sideways motion. Comparing the 5pt-essential matrix algorithm with our 2pt-homography method. Left column: Rotation error, right column: Translation error. (a) is with varying image noise. (b) is with increased Pitch noise and 0.5 pixel standard deviation image noise. (c) is with increased Roll noise and 0.5 pixel standard deviation image noise.

where \(\xi _R\) is the rotation error and \(\xi _t\) is the translation error, the errors are similar to [17]. The \(\theta _g, t_g\) denote the ground-truth rotation angle and translation, and \(\theta _e, t_e\) are the corresponding estimated rotation angle and translation vector, respectively.

We evaluate each algorithm under the image noise (corner location) with a different standard deviation and the increased (RollPitch) noise. The noise can be considered as the error of the normal of the ground plane. In our experiments, we assume that there are enough feature points lying on the ground. Half of them are randomly generated on the ground plane and the rest are in the 3D space above the ground plane. We use the least square solution with all the inliers and plot the mean value of 1000 trials with different points and different transformations.

Figures 3 and 4 shows the results of the 5pt-essential matrix algorithm and our 2pt-homography method. The experiments show that our 2pt-homography matrix algorithm outperforms the state-of-art 5pt-essential method, in terms of the rotation error and the translation error. It appears that the 5pt-essential method is more sensitive to (PitchRoll) noise while our 2pt-homography matrix algorithm is more robust. Notice, when we use this algorithm for real application, a two point RANSAC method can be used to reject outliers and the final solution is given by the least square method with all the inliers.

3.2 Performance with Real Data

In order to show the efficiency of the proposed frame-to-frame visual odometry framework, several real datasets taken in two different indoor corridor-like environments using an RGB-D camera mounted on a robot have been collected, as shown in Fig. 5. The scenes are full of low textured walls and the image resolution used is \(960\times 540\) pixels. All the experiments are run at 10 FPS on an Intel Core i5-4460 desktop computer with 3.20 GHz CPU, without GPU acceleration.

Fig. 5.
figure 5

(a) The laboratory building. (b) The dormitory building. First column: Example images in the school building. Second column: Reconstructed point cloud.

Fig. 6.
figure 6

Comparison between the ORB-RGBD SLAM and our proposed frame-to-frame visual odometry framework. (a) Results on the laboratory building. (b) Results on the dormitory building. First column: Trajectories of ORB-RGBD SLAM. Second column: Trajectories of our method.

We perform a comparison with the state-of-the-art ORB-SLAM2 [16]. As can be seen in Fig. 6, the ORB-SLAM2 fails to complete the entire image sequence because of lacking of features in low-textured walls and then do a false relocation. While our method achieves better results, the performance of our algorithm, using only frame-to-frame camera pose estimation, can be comparable to that of the algorithm with some non-linear refinement or loop closure detection algorithms. The overall errors of our proposed method the laboratory building and the dormitory building are only \(1.21\%\) and \(1.33\%\) respectively.

4 Conclusion

In this paper, a new method for accurate egomotion estimation of the Manhattan Frame from a single RGB-D image of indoor scenes is proposed. The proposed method differs from previous algorithms by using directions and points to estimate the pose jointly. It firstly detects vertical planes from a large number of RGB-D datasets if at least one vertical plane is available. The normal of the vertical plane is obtained directly based on the inverse-depth induced histograms and we estimate the pose through a novel 3-DOF VO. Secondly, we propose a new minimal-case algorithm to estimate the egomotion if the plane orientation is completely unknown. Finally, we propose a frame-to-frame visual odometry framework based on our algorithms. Experiments with synthetic data and real data validate that the proposed methods are comparable or even superior to the state-of-the-art algorithms while maintaining a high efficiency under planar motion. Our method is currently tested in indoor sceneries with an RGB-D camera. In future work, we will try to implement the proposed algorithm with other sensors and possibly extend to different environments.