Keywords

1 Introduction

Over the past years, Simultaneous Localization and Mapping (SLAM) has received extensive research interest because it serves as a basic methodology for robots moving autonomously in an unknown environment. Current methods of SLAM have found to achieve accurate mapping for extended periods of time especially by using laser sensor with well bounded result for both indoor and outdoor environments [1].

Since vision systems have properties of being low cost, lightweight and contain rich information when compared to traditional robotic sensors, like laser scanners or sonars, vision-based systems are employed in a wide range of robotic applications. These applications include object recognition, obstacle avoidance, navigation, topological global localization and, more recently, in simultaneous localization and mapping, which in this case is the so-called visual SLAM.

One can distinguish visual SLAM as either monocular or binocular approaches. The first remarkable work in monocular visual SLAM was done by Davison et al. [2], in which a single camera is used under the Extended Kalman filtering (EKF) framework. Since camera is a bearing-only sensor, crucial limitation of monocular SLAM is the unobservability of the scale, and this cause the scale of the map to slowly drift in large environment. On the other hand, stereo vision systems are often applied in which the absolute measurement of 3D space, especially the feature depths, can be directly estimated to avoid the scale ambiguity.

In stereo systems, the observation pair (u1, v1, d) contains both bearing and range information, where (u1, v1) is the left image coordinate and di is the disparity. By projecting this observation pair through the pinhole model one can get the 3D Euclidean XYZ position of landmark relative to the camera [3]. Many stereo SLAM systems, including indoor and outdoor, build the map using the 3D Euclidean representation for landmark model [36].

Standard pinhole model projection equations used in the vision systems with EKF framework suffers from nonlinearity [7, 8]. Due to this nonlinearity, the true uncertainty of 3D Euclidean XYZ landmark can be modeled by Gaussian only for nearby features. However, true distributions of faraway features are non-Gaussian, which makes the EKF filter estimation inconsistent [9]. Other work also tried using UKF (Unscented Kalman Filter) for a stereo system on an unmanned aerial vehicle [10] to solve the nonlinearity issue, but a better way is to find a landmark model that has a high degree of linearity. Therefore, Montiel et al. [11] proposed an inverse depth parameterization to represent the landmark model. The key concept is to parameterize the inverse depth of features relative to the camera locations from which they were first viewed directly. This way of parameterization would achieve a high degree of linearity, and furthermore, the features are initialized with no delay and can successfully estimate for both near and distant features.

The drawback of inverse depth parameterization is that the 6-D state vector representation is computational intensive. Therefore, Civera et al. [12] proposed a linearity index, that inverse depth representation can be safely converted to Euclidean XYZ form; once the depth estimate of a feature has converged. The speed of convergence therefore is important.

The most closely related work is by Paz et al. [13], in which they proposed a binocular stereo-based EKF SLAM. They combine both the inverse depth parameterization and Euclidean XYZ parameterization in the map, which alleviate the nonlinearity issue effectively. The system can map both near and far features with proper uncertainty distribution, and it can be used in large-scale outdoor environment.

However, the observation model in previous studies is bearing-only. When a 3D feature is acquired simultaneously by left and right camera images, the stereo system are treated as two bearing-only observers. In each instance, EKF update is done once for left and right camera without using any range information. Thus, range information such as disparity does not directly contribution to camera poses and map spatial location estimation. In contrast we wish to incorporate not just bearing but also range information.

In this research, we want to focus on using EKF to solve the stereo-based SLAM problem. It is essential to find an appropriate probabilistic models for observations of a stereo camera that is still consistent to the linear property. Thus, our contribution is to derive a new stereo observation model that incorporates the inverse depth parameterization with observation pair (u1, v1, d). In this way, both range and bearing information can be directly injected into the EKF estimation process to handling the nonlinear projection issue. In order to develop the new observation model, two Jacobians needs to be derived for the EKF framework. The first instance is in the feature initialization step and the second instance is in the feature prediction step. Based on our knowledge this is the first time this observation model is proposed in the stereo-based EKF SLAM.

2 Stereo-Based EKF SLAM System Models

2.1 State Vector Definition

Following the standard EKF-based approach of SLAM, the system state vector x consists the current estimated pose of camera and physical location of features. x will change in size dynamically as features are added to or deleted from the map.

$$ {\text{x}} = ({\text{x}}_{\text{C}} ,{\text{y}}_{1} , \ldots {\text{y}}_{\text{i}} , \ldots {\text{y}}_{\text{n}} )^{\text{T}} $$
(1)

The camera state XC is composed by the position rWC with respect to a world reference frame W, and qWC quaternion for orientation, and linear and angular velocity vW and \( \upomega^{\text{C}} \) relative to world frame W and camera frame C, respectively.

$$ {\text{x}}_{\text{C}} = (\begin{array}{*{20}c} {{\text{r}}^{\text{WC}} } & {{\text{q}}^{\text{WC}} } & {{\text{v}}^{\text{W}} } & {\upomega^{\text{C}} } \\ \end{array} )^{\text{T}} $$
(2)

The feature yi is defined by the inverse depth parameterization [11] using a 6-D state vector:

$$ {\text{y}}_{\text{i}} { = }\left( { {\text{x}}_{\text{i}} {\text{y}}_{\text{i}} {\text{z}}_{\text{i}} \uptheta_{\text{i}} \upphi_{i} \uprho_{\text{i}} } \right)^{\text{T}} $$
(3)

The yi vector encodes the ray from the first camera position from feature observed by xi, yi, zi, the camera optical center, and \( \uptheta_{\text{i}} ,\;\upphi_{\text{i}} \) azimuth and elevation (coded in the world frame) defining unit directional vector \( {\text{m}}\left( {\uptheta_{\text{i}} ,\upphi_{\text{i}} } \right) \). The feature point’s depth along the ray di is encoded by its inverse \( \uprho_{\text{i}} = 1/{\text{d}}_{\text{i}} \).

yi models a three-dimensional point located at

$$ \left( {\begin{array}{*{20}c} {{\text{X}}_{\text{i}} } \\ {{\text{Y}}_{\text{i}} } \\ {{\text{Z}}_{\text{i}} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {{\text{x}}_{\text{i}} } \\ {{\text{y}}_{\text{i}} } \\ {{\text{z}}_{\text{i}} } \\ \end{array} } \right) + \frac{1}{{\uprho_{\text{i}} }}{\text{m}}\left( {\uptheta_{\text{i}} ,\upphi_{\text{i}} } \right) $$
(4)
$$ {\text{m}} = (\cos \emptyset_{i} \sin \emptyset_{i} ,\, - \sin \emptyset_{i},\, \cos\emptyset_{i} \cos \uptheta_{i} )^{\text{T}} $$
(5)

2.2 Motion Model

The motion model in this work describes an ego motion with 6 DOF. The camera orientation is represented in terms of quaternions, which can deal with the issue of gimbal lock in Euler angles. It is assumed to be both in constant velocity and angular velocity with a zero-mean Gaussian acceleration noise \( {\text{n}} = \left( {\begin{array}{*{20}c} {{\text{a}}^{\text{W}} } & {\upalpha^{\text{C}} } \\ \end{array} } \right)^{\text{T}} \) uncertainty. At each step, there is an impulse of linear velocity \( {\text{V}}^{\text{W}} = {\text{a}}^{\text{W}}\Delta {\text{t}} \) and angular velocity \( \Omega ^{\text{C}} =\upalpha^{\text{C}}\Delta {\text{t}} \), with zero mean and known Gaussian distribution.

$$ {\text{x}}_{{{\text{C}}_{{{\text{k}} + 1}} }} = \left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} {{\text{r}}_{{{\text{k}} + 1}}^{\text{WC}} } \\ {{\text{q}}_{{{\text{k}} + 1}}^{\text{WC}} } \\ {{\text{v}}_{{{\text{k}} + 1}}^{\text{W}} } \\ \end{array} } \\ {\upomega_{{{\text{k}} + 1}}^{\text{C}} } \\ \end{array} } \right) = {\text{f}}_{\text{v}} \left( {{\text{x}}_{{{\text{C}}_{\text{k}} }} ,{\text{n}}} \right) = \left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} {r_{\text{k}}^{\text{WC}} + \left( {{\text{v}}_{\text{k}}^{\text{W}} + {\text{V}}_{\text{k}}^{\text{W}} } \right)\varDelta {\text{t}}} \\ {{\text{q}}_{\text{k}}^{\text{WC}} \times {\text{q}}\left( {\left( {\upomega_{\text{k}}^{\text{C}} + \Omega^{\text{C}} } \right)\varDelta {\text{t}}} \right)} \\ {{\text{v}}_{\text{k}}^{\text{W}} + {\text{V}}^{\text{W}} } \\ \end{array} } \\ {\upomega_{\text{k}}^{\text{C}} + \Omega^{\text{C}} } \\ \end{array} } \right) $$
(6)

Where \( {\text{q}}\left( {\left( {\upomega_{\text{k}}^{\text{C}} +\Omega ^{\text{C}} } \right)\Delta {\text{t}}} \right) \) is the quaternion defined by the rotation vector \( \left( {\upomega_{\text{k}}^{\text{C}} +\Omega ^{\text{C}} } \right)\Delta {\text{t}} \).

2.3 Stereo Observation Model

In this work, we define a new nonlinear function h(XC, yi), which allows the prediction of the value of observations measurement \( \widehat{\text{z}}_{\text{i}} \) given the current estimatiom camera pose \( {\text{x}}_{\text{C}} \) and the ith feature \( {\text{y}}_{\text{i}} \) in the map. The observation model can be written in a general form as:

$$ \widehat{\text{z}}_{\text{i}} = \left[ {\begin{array}{*{20}c} {{\text{u}}_{\text{li}} } \\ {{\text{v}}_{\text{li}} } \\ {{\text{d}}_{\text{i}} } \\ \end{array} } \right] = h\left( { {\text{x}}_{\text{C}} ,{\text{y}}_{\text{i}} } \right) + {\text{w}}_{{{\text{u}}_{\text{i}} {\text{v}}_{\text{i}} {\text{d}}_{\text{i}} }} $$
(7)

The vector \( \widehat{\text{z}}_{\text{i}} \) is a observation of the feature yi relative to the camera pose xC, where \( {\text{w}}_{{{\text{u}}_{\text{i}} {\text{v}}_{\text{i}} {\text{d}}_{\text{i}} }} \) is a vector of uncorrelated observation errors with zero mean Gaussian noise and covariance matrix \( {\text{R}}_{{{\text{u}}_{\text{i}} {\text{v}}_{\text{i}} {\text{d}}_{\text{i}} }} \).

The observation model can be divided in three steps. In step 1, through inverse depth parameterization, feature yi is transformed to Euclidean XYZ landmark representation with respect to the world reference frame W:

$$ \left( {\begin{array}{*{20}c} {{\text{X}}_{\text{i}} } \\ {{\text{Y}}_{\text{i}} } \\ {{\text{Z}}_{\text{i}} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {{\text{x}}_{\text{i}} } \\ {{\text{y}}_{\text{i}} } \\ {{\text{z}}_{\text{i}} } \\ \end{array} } \right) + \frac{1}{{\uprho_{\text{i}} }}{\text{m}}\left( {\uptheta_{\text{i}} ,\upphi_{\text{i}} } \right) $$
(8)

In step 2, Euclidean XYZ is transformed into camera reference frame C, while plugging into xC camera pose:

$$ {\text{h}}^{\text{C}} = \left[ {\begin{array}{*{20}c} {{\text{h}}_{\text{x}}^{\text{C}} } \\ {{\text{h}}_{\text{y}}^{\text{C}} } \\ {{\text{h}}_{\text{z}}^{\text{C}} } \\ \end{array} } \right] = \left( {{\text{R}}^{\text{WC}} } \right)^{\text{T}} \left( {\left( {\begin{array}{*{20}c} {{\text{X}}_{\text{i}} } \\ {{\text{Y}}_{\text{i}} } \\ {{\text{Z}}_{\text{i}} } \\ \end{array} } \right) - {\text{r}}^{\text{WC}} } \right) $$
(9)

Next, using the typical pinhole camera model [3], we will project this 3D position hC to its expected image coordinates and compute its expected disparity in the new view:

$$ \widehat{{{\text{z}}_{\text{i}} }} = \left[ {\begin{array}{*{20}c} {\widehat{\text{u}}_{\text{li}} } \\ {\widehat{\text{v}}_{\text{li}} } \\ {\widehat{\text{d}}_{\text{i}} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {\frac{{{\text{h}}_{\text{x}}^{\text{C}} }}{{{\text{h}}_{\text{z}}^{\text{C}} }}{\text{f}} + {\text{C}}_{\text{x}} } \\ {\frac{{{\text{h}}_{\text{y}}^{\text{C}} }}{{{\text{h}}_{\text{z}}^{\text{C}} }}{\text{f}} + {\text{C}}_{\text{y}} } \\ {\frac{\text{fb}}{{{\text{h}}_{\text{z}}^{\text{C}} }}} \\ \end{array} } \right] $$
(10)

where \( \left( {{\text{C}}_{\text{x}} {\text{C}}_{\text{Y}} } \right) \) are the camera center in pixels, f is the focal length, b the baseline of stereo.

3 The Estimation Process of Stereo-Based EKF SLAM

3.1 The Prediction Step

In the prediction stage, the camera motion model Eq. (6) is used to produce a state prediction from the previous state. Since the camera motion model only propagates the pose of previous state, we leave the map states unchanged:

$$ {\text{x}}\left( {{\text{k}} + 1 | {\text{k}}} \right) = {\text{f}}_{\text{v}} \left( {{\text{x}}_{\text{C}} \left( {\text{k|k}} \right),{\text{n}}} \right) $$
(11)

In the prediction of covariance, state-augmentation methods [14] is used, which results in an optimal SLAM estimate with reduced computation from cubic complexity to linear complexity, and has the form:

$$ {\text{P}}\left( {{\text{k}} + 1 | {\text{k}}} \right) = {\text{FP}}_{\text{xx}} \left( {\text{k|k}} \right){\text{F}}^{\text{T}} + {\text{GQ}}_{\text{k}} {\text{G}}^{\text{T}} $$
(12)

where \( {\text{F}} = \frac{{\partial {\text{f}}_{\text{v}} }}{{\partial {\text{x}}_{\text{C}} }} \) is the Jacobian of \( {\text{f}}_{\text{v}} ( ) \) evaluated at the estimate \( {\text{x}}_{\text{C}} \left( {\text{k|k}} \right) \),\( {\text{Q}}_{\text{k}} \)the Gaussian noise covariance and \( {\text{G}} = \frac{{\partial {\text{f}}_{\text{v}} }}{{\partial {\text{n}}}} \) is the Jacobian of \( {\text{f}}_{\text{v}} ( ) \) evaluated with the noise n.

3.2 The Update Step

In the update step, an observation \( z_{i} (k + 1) = \left( {u_{li} ,v_{li} , d_{i} } \right) \) of the ith feature will be available. Stereo observation model Eq. (7) is used to form an observation prediction \( \widehat{{{\text{z}}_{i} }}\left( {k + 1 |k} \right) \) and innovation \( v_{i} (k + 1) \)

$$ \widehat{{{\text{z}}_{\text{i}} }}\left( {{\text{k}} + 1 | {\text{k}}} \right) = {\text{h}}\left( { {\text{x}}_{\text{C}} \left( {{\text{k}} + 1 | {\text{k}}} \right) , {\text{y}}_{\text{i}} \left( {\text{k|k}} \right) } \right) $$
(13)
$$ {\text{v}}_{\text{i}} \left( {{\text{k}} + 1} \right) = {\text{z}}_{\text{i}} \left( {{\text{k}} + 1} \right) - \widehat{{{\text{z}}_{\text{i}} }}\left( {{\text{k}} + 1 | {\text{k}}} \right) $$
(14)

and then, one can calculate the innovation covariance matrix:

$$ {\text{S}}_{\text{i}} \left( {{\text{k}} + 1} \right) = {\text{HP}}\left( {{\text{k}} + 1 | {\text{k}}} \right){\text{H}}^{\text{T}} + {\text{R}}_{{{\text{u}}_{\text{i}} {\text{v}}_{\text{i}} {\text{d}}_{\text{i}} }} ({\text{k}} + 1) $$
(15)

where \( {\text{H}} \) is the Jacobian of \( {\text{h}}\left( { . } \right) \) evaluated at \( {\text{x}}_{\text{C}} \left( {{\text{k}} + 1 | {\text{k}}} \right) \) and \( {\text{y}}_{\text{i}} \left( {\text{k|k}} \right) \), and the Kalman gain \( {\text{K}}_{\text{i}} \left( {{\text{k}} + 1} \right) \) can be obtained as

$$ {\text{K}}_{\text{i}} \left( {{\text{k}} + 1} \right) = {\text{P}}\left( {{\text{k}} + 1 | {\text{k}}} \right){\text{H}}^{\text{T}} + {\text{S}}_{\text{i}} \left( {{\text{k}} + 1} \right)^{ - 1} $$
(16)

The observation matrix \( {\text{z}}_{\text{i}} ({\text{k}} + 1) \) is used to update the predictions and form a new estimation of the state by using the standard EKF update equations:

$$ {\text{x}}\left( {{\text{k}} + 1 | {\text{k}} + 1} \right) = {\text{x}}\left( {{\text{k}} + 1 | {\text{k}}} \right) + {\text{K}}_{\text{i}} \left( {{\text{k}} + 1} \right){\text{v}}_{\text{i}} \left( {{\text{k}} + 1} \right) $$
(17)
$$ {\text{P}}\left( {{\text{k}} + 1 | {\text{k}} + 1} \right) = {\text{P}}\left( {{\text{k}} + 1 | {\text{k}}} \right) - {\text{K}}_{\text{i}} \left( {{\text{k}} + 1} \right){\text{HP}}\left( {{\text{k}} + 1 | {\text{k}}} \right) $$
(18)

The main focus here is the innovation \( {\text{v}}_{\text{i}} ({\text{k}} + 1) \) (14), which represents the difference between the actual sensor measurement \( {\text{z}}_{\text{i}} ({\text{k}} + 1) \) and the predicted measurement \( {\text{z}}_{\text{i}} \left( {{\text{k}} + 1 | {\text{k}}} \right) \), both containing range and bearing information. It means that when multiplying innovation with Kalman gain \( {\text{K}}_{\text{i}} \left( {{\text{k}} + 1} \right) \), both bearing and range information optimize the state estimation directly

3.3 Landmark Initialization

The initialization process includes both the feature state initial values and the covariance assignment. Therefore we use inverse depth parameterization to represent features initial values, and derived the feature initialization model, \( {\text{g}}\left( {r_{{\left( {k + 1 |k + 1} \right)}}^{WC} , q_{{\left( {k + 1 |k + 1} \right)}}^{WC} , {\text{z}}_{{{\text{i}}\left( {k + 1 |k + 1} \right)}} } \right) \) to describe the initial values in terms of current camera pose \( r_{{\left( {k + 1 |k + 1} \right)}}^{WC} \), \( q_{{\left( {k + 1 |k + 1} \right)}}^{WC} \) and a new sensor observation pair \( {\text{z}}_{{{\text{i}}\left( {k + 1 |k + 1} \right)}} = \left( {u_{li} ,v_{li} , d_{i} } \right) \)

$$ {\text{y}}_{\text{i}} = {\text{g}}\left( {{\text{r}}_{{\left( {{\text{k}} + 1 | {\text{k}} + 1} \right)}}^{\text{WC}} , {\text{q}}_{{\left( {{\text{k}} + 1 | {\text{k}} + 1} \right)}}^{\text{WC}} , {\text{z}}_{{{\text{i}}\left( {{\text{k}} + 1 | {\text{k}} + 1} \right)}} } \right) = \left( {{\text{x}}_{\text{i}} {\text{y}}_{\text{i}} {\text{z}}_{\text{i}} \uptheta_{\text{i}} \;\upphi_{\text{i}} \;\uprho_{\text{i}} } \right)^{\text{T}} $$
(19)

The end-point of the projection ray is taken from the camera location estimate:

$$ \left( {{\text{x}}_{\text{i}} {\text{y}}_{\text{i}} {\text{z}}_{\text{i}} } \right)^{\text{T}} = {\text{r}}_{{\left( {{\text{k}} + 1 | {\text{k}} + 1} \right)}}^{\text{WC}} $$
(20)

The feature spatial location vector (in the camera reference frame) is computed from the observation pair \( {\text{z}}_{\text{i}} = \left( {{\text{u}}_{\text{li}} {\text{v}}_{\text{li}} {\text{d}}_{\text{i}} } \right)^{\text{T}} \), by rearranging stereo observation model Eq. (10), we have

$$ {\text{h}}^{\text{C}} = \left( {\begin{array}{*{20}c} {\left( {{\text{u}}_{\text{li}} - {\text{C}}_{\text{x}} } \right)\frac{\text{b}}{{{\text{d}}_{\text{i}} }}} \\ {\left( {{\text{v}}_{\text{li}} - {\text{C}}_{\text{Y}} } \right)\frac{\text{b}}{{{\text{d}}_{\text{i}} }}} \\ {\frac{\text{fb}}{{{\text{d}}_{\text{i}} }}} \\ \end{array} } \right) $$
(21)

where \( \left( {{\text{u}}_{\text{li}} {\text{v}}_{\text{li}} } \right) \) are the pixels on the left image, and \( {\text{d}}_{\text{i}} \) is the horizontal disparity.

The inverse depth prior \( \uprho_{\text{i}} \) can be computed from \( {\text{h}}^{\text{C}} \)

$$ \uprho_{\text{i}} = \frac{1}{{\left\| {{\text{h}}^{\text{C}} } \right\|}} $$
(22)

Using the current camera orientation estimation from the state vector, hC can be transformed to the world reference frame and the azimuth and elevation angles are extracted as:

$$ {\text{h}}^{\text{W}} = {\text{R}}^{\text{WC}} ({\text{q}}_{{\left( {{\text{k}} + 1 | {\text{k}} + 1} \right)}}^{\text{WC}} ){\text{h}}^{\text{C}} $$
(23)
$$ \left( {\begin{array}{*{20}c} {\uptheta_{\text{i}} } \\ {\emptyset_{\text{i}} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {{\text{arc}}\tan ({\text{h}}_{\text{x}}^{\text{W}} ,{\text{h}}_{\text{z}}^{\text{W}} )} \\ {{\text{arc}}\tan \left( { - {\text{h}}_{\text{y}}^{\text{W}} ,\sqrt {{\text{h}}_{\text{x}}^{\text{W}} + {\text{h}}_{\text{z}}^{{{\text{W}}^{2} }} } } \right)} \\ \end{array} } \right) $$
(24)

The newly initialized feature \( {\text{y}}_{\text{i}} = \left( {{\text{x}}_{\text{i}} {\text{y}}_{\text{i}} {\text{z}}_{\text{i}}\uptheta_{\text{i}} \upphi_{\text{i}}\uprho_{\text{i}} } \right)^{\text{T}} \) is added to the state vector \( {\text{x}}\left( {{\text{k}} + 1 | {\text{k}} + 1} \right) \).

In order to model the uncertainty of the newly initialized feature, we derived the Jacobian matrix of the functions in (19), using a first-order error propagation to approximate the distribution of the variables in (19) as multivariate Gaussians. The covariance matrix of newly feature is:

$$ {\text{P}}_{{{\text{y}}_{\text{i}} {\text{y}}_{\text{i}} }} \left( {\text{k + 1|k + 1}} \right) = {\text{J}}_{\text{xc}} {\text{P}}_{\text{xx}} {\text{J}}_{\text{xc}}^{\text{T}} + {\text{J}}_{\text{R}} {\text{RJ}}_{\text{R}}^{\text{T}} $$
(25)

R includes \( \upsigma_{\text{u}} ,\;\upsigma_{\text{v}} ,\;\upsigma_{\text{d}} \), which represents the pixel uncertainties in image \( \left( {{\text{u}}_{\text{li}} {\text{v}}_{\text{li}} } \right) \) location and disparity \( {\text{d}}_{\text{i}} \) . In our experiments, we use \( \upsigma_{\text{u}} = 1\,{\text{pixel }},\;\upsigma_{\text{v}} = 1\,{\text{pixel}},\;\upsigma_{\text{d}} = 1\,{\text{pixel}} \). Since R is the error covariance describing the noisy measurements of the stereo system, the uncertainty through \( {\text{J}}_{\text{R}} \) is propagated to the newly feature yi of landmark model space. \( {\text{J}}_{\text{R}} \) is the Jacobian of \( {\text{g}}\left( { . } \right) \) which is derived by the observation pair \( {\text{z}}_{\text{i}} \). \( {\text{P}}_{\text{xx}} \) is the camera pose covariance matrix, representing current pose estimation uncertainty. This uncertainty through \( {\text{J}}_{\text{xc}} \) is propagated to the newly feature yi of landmark model space. \( {\text{J}}_{\text{xc}} \) is the Jacobian of \( {\text{g}}\left( { . } \right) \) which is derivative by \( {\text{r}}_{{\left( {{\text{k}} + 1 | {\text{k}} + 1} \right)}}^{\text{WC}} , {\text{q}}_{{\left( {{\text{k}} + 1 | {\text{k}} + 1} \right)}}^{\text{WC}} \).

4 Experimental Results

4.1 Analysis of Landmark Uncertainty

In order to validate that our proposed observation model describes the uncertainty of 3D points accurately, we have simulated an experiment where the true uncertainty of the landmark location (derived from a Monte Carlo simulation) is compared to the estimated uncertainty from Eq. (25) and the traditional Euclidean XYZ landmark model.

The actual intrinsic parameters of the stereo camera, such as the baseline, are accounted in the simulation. The origin of the left camera is set as the reference frame, with the principal axis pointing to Z and X axis pointing to the right.

Consider a landmark point in front of the left camera that is at 70 m distance along the Z axis, a Monte Carlo simulation has been performed by drawing a set of 10,000 samples from the Gaussians distributions of \( {\text{u}}_{\text{li}} \), \( {\text{v}}_{\text{li}} \), and \( {\text{d}}_{\text{i}} \) (assuming a standard deviation of \( \upsigma_{\text{u}} =\upsigma_{\text{v}} = 1 \) and \( \upsigma_{\text{d}} = 2 \) pixels, respectively), and by projecting them through Eq. (21), yielding a set of 10,000 samples of the landmark 3D position (X Y Z). In Fig. 1 (Left), the black sample points show the true measurement uncertainty from stereo systems, green point shows the real position of the landmark point. Next, the estimated uncertainty is calculated using first-order error propagation based on our observation model Eq. (25), shown by the enclosing red lines. The traditional Euclidean XYZ model is shown by the enclosing blue lines. One can see that the red lines enclose the true uncertainty noise, while blue lines do not. In Fig. 1 (Right), histogram is used to show the uncertainty distribution. Gray rectangles show the true uncertainty of the landmark location, red rectangles show our proposed observation model uncertainty, and blue indicates the traditional Euclidean XYZ model. The red rectangles have more closely covered the true gray rectangles distribution.

Fig. 1.
figure 1

(Left) The black point clouds represent the true uncertainty which are samples of distribution of a real landmark point position, given that the pixel noise in the images is Gaussian. Red line enclosed regions represent the estimated uncertainty using our proposed observation model. Blue line enclosed regions represent the estimated uncertainty using the traditional Euclidean XYZ model. (Right) The histogram is used to show the uncertainty distribution, gray rectangles show the true uncertainty of the landmark location, red rectangles show the estimated uncertainty using our proposed observation model, and blue indicates the traditional Euclidean XYZ model. Our model describes true noise distribution more accurately.

In Fig. 2(a), (b), (c), black sample points indicates the real distributions with various distances, using 15 m, 30 m, 45 m respectively, and the red points shows the real position of the landmark point. The estimated uncertainty is calculated using first-order error propagation using our observation model, shown by blue enclosing lines. One can see that for any distance close or far, the uncertainty region estimated by our model accurately bounds the true uncertainty.

Fig. 2.
figure 2

Simulated experiment of a point reconstruction from a stereo pair observation for a point at (a) 15 m, (b) 30 m, (c) 45 m distance. The point clouds are samples of distribution of a real landmark point position, given that the pixel noise in the images is Gaussian. The black sample points show the distribution. Blue line enclosed regions represent the estimated uncertainty using our proposed observation model (color figure online).

From the simulation result, we shown that the measure error can be accurately estimated basing on our proposed observation model, which will help the EKF filter estimation to be consistent and avoid filter divergence.

4.2 Real World Experiments

4.2.1 Dataset and Feature Points Matching

All experiments are verified by using the Karlsruhe dataset [15], a real-world, large-scale, grayscale stereo sequences. Odometry data is available from OXTS RT 3000 GPS/IMU system. An experimental vehicle is equipped with a stereo camera rig Pointgrey Flea2 firewire cameras covering inner city traffic. All the frames are rectified with a resolution of 1344 × 391 pixels running at 10 fps. In our experiment we set the left image frame as the reference coordinate system. Our algorithm is implemented by LabVIEW on an Intel Core i5 with 1.7 GHz and 4 GB RAM computer.

In order to obtain the matching stereo observation pairs, stereo points matching based on LIBVISO2 [16], an open-source library is used to demonstrate real-time computation of point feature match of left and right images. The matching stereo pairs are done only in the first frame for initialization.

We use a different method to track the initialized points in the subsequent frames. When each stereo observation pair is initialized and saved in the map, it also save corresponding 11 × 11 surrounding patch, which serves as a photometric identifier. When the next image comes, the saved pairs in 3D space are projected back to image plane. The pairs are deleted if it is outside the visible field of view of left and right images. If they are within the field of view, we use active search concept to decide the searching region, in which the region size is determined by the EKF innovation covariance. The corresponding patch of the features first will be warped according to the predicted camera motion, and then normalized cross-correlation is performed in the searching region to find the matching point, for details see [17].

4.3 Analysis of Features Location Estimation and Stability

In this part of the experiment, we want to compare the two monocular observers based visual SLAM systems and our proposed stereo observer based visual SLAM systems with regard to the accuracy of landmark features spatial location estimation.

We used a subset of the Karlsruhe dataset, the 2009_09_08_drive_0010 from frame 61 to frame 72. The scenario is that the vehicle is moving forward on the road, and then turns right. The features are initialized first in frame 61 (Fig. 3 Left), and then were continuously tracked until frame 72 (Fig. 3 Right). Therefore the camera poses and features locations were updated 10 times. We recorded every state vector and covariance matrix along the way, and select a near (number 7) and far (number 0) feature point for analysis.

Fig. 3.
figure 3

(Left) At frame 60 features are initialized. (Right) At frame 70, features tracking result.

In Fig. 4 (Left) and (Right) each time the feature updates its depth uncertainty, no matter the feature is either far or near, it can be seen that our proposed method depth and uncertainty converge faster.

Fig. 4.
figure 4

(Left) Depth uncertainty of feature number 7, over 10 times EKF updates, (Right) depth uncertainty of feature number 0, over 10 times EKF updates.

Figure 5 (Left) and (Right) shows the raw measurement of each feature’s depth (green points), and the depth estimation. The blue points shows the two monocular observers based visual SLAM systems estimation result, and the red points shows the stereo observer based visual SLAM systems estimation result.

Fig. 5.
figure 5

(Left) The depth estimated value of feature number 7 (close by), over 10 times EKF updates, (Right) the depth estimated value of feature number 0 (far away), over 10 times EKF updates. The proposed model (red line) is more stable and closer to raw measurement mean (color figure online).

From Fig. 5 it shows that our proposed method estimation is more stable, therefore the curve is smoother. Also after 10 updates, the estimation result is also closer to the mean values of raw measurement. After 10 updates, only the estimation of near feature is closer to the raw measurement mean values, Fig. 5 (Left), while the features far away cannot get close to the raw measurement mean values Fig. 5 (Right).

4.4 Motion Estimate Results

To evaluate how the capability of the proposed algorithm can correctly estimate the camera pose and velocity in a large-scale environment after extended periods of time, we use one of the Karlsruhe dataset, the 2009_09_08_drive_0015 images. Within the dataset we pick out 800 stereo frames, with a total length of the route approximately 350 m. These images are sequential scenes of inner city urban driving, including two wide angle turns.

Figure 6 depicts the trajectory estimated by our visual SLAM algorithm (in red) and ‘groundtruth’ output of a OXTS RT 3003 GPS/IMU system (in green). Note that the GPS/IMU system can only be considered as ’weak’ groundtruth [16], because localization errors of up to two meters may occur in inner-city scenarios due to limited satellite availability. Thus instead, LIBVISO2 [16] visual odometry (in black) is argued to be a better groundthruth. As we can see, our proposed model estimated trajectory is quite close to the ’groundtruth’.

Fig. 6.
figure 6

(Left) Depicts the trajectory estimated by our visual SLAM algorithm (red) and the trajectory using the OXTS RT 3003 GPS/IMU system provided by the Karlsruhe dataset. (Right) The progression of vehicle orientation and velocity, estimated by our visual SLAM algorithm. Our proposed method has estimated well in large-scale scenario (color figure online).

From Fig. 6, because the car at frame 250 made a negative x-directional turn, yaw angle started to increase drastically. This cause the forward velocity Vz to decrease drastically, while the negative x-direction Vx started to increase. Afterward at frame 600, both Vz and Vx started to decrease, and then at frame 650 the car turns toward the z-axis direction, making the yaw decreases. Finally the car continues toward the z-axis direction while Vz velocity started to increase.

From the experimental result, we can see that the proposed algorithm can be used in large-scale outdoor scenario, that the car pose and velocity can be decently estimated quite well, close to the groundtruth.

5 Conclusions and Future Works

The focus of this work was to develop a new probabilistic observation model in EKF-based stereo SLAM. The statistical behavior in stereo vision is known to have inherently non-linear problem. Therefore, we use inverse depth parameterization as the landmark model to deal with the non-linear problem, and in contrast to the binocular stereo-based approach, we used stereo observation pair (u, v, d) to derive a new observation model, allowing both bearing and range information to be incorporated into the EKF estimation process. Furthermore, our new observation model is also computationally faster than the previously mentioned binocular stereo-based approach. In our approach, we only have to do projection and update in the EKF framework once, in contrast to the binocular approach which requires projection and update two times each. Moreover, because update step requires inverse of large innovation matrix, doing update step twice would be computationally intensive in large-scale SLAM. Based on our knowledge this is the first time this observation model with inverse depth parameterization is proposed in the stereo-based EKF SLAM.

From our experiments, it shows that even in large-scale outdoor environment, there is only a little ‘scale drift’, which means that our observation model together with inverse depth parameterization has kept true to the real noise distribution of the landmark feature. This also means the proposed observation models that is designed with the additional range information has helped and worked consistently under the strict requirement of EKF framework. We also demonstrated that the proposed system has converged faster and has more stable depth estimation from experiment data. These convergence and stableness characteristics will be critical to our future work. For future work we want to develop moving objects detection along with static map into a nice single EKF framework.