Keywords

1 Introduction

Recently, techniques for solving the problem of reconstructing a 3D model of the environment have drawn significant attention from both the robotics vision communities and computer vision. Many variants of these techniques have started to make an impact in a wide range of robotics applications such as robot navigation, obstacle avoidance, and visual simultaneous localization and mapping (VSLAM) based on stereo vision [1]. Structure from motion problem (SFM) is defined as a technique for 3D reconstruction from a set of 2D images and some geometrical constraints [2].

SFM methods perform a bundle adjustment (BA) optimization of the total geometry in order to obtain an accurate 3D model of the scene. However, this is computationally very expensive and cannot be implemented in a real-time application [3]. On the other hand, simultaneous localization and mapping (SLAM) is a problem faced recently in robotics community, essentially addressing the hard real-time mapping and navigation problem by “sequential” interactive local estimation of the structure and motion [4].

The SLAM problem tries to answer the following central question: “Is it possible for an autonomous robot starting at an unknown location in an unknown environment to build a map of the environment while simultaneously using this map to compute the vehicle’s location?” Solving this problem allows to develop a truly autonomous robot and navigate safely around the environment [5].

The stochastic nature of the mobile robot motion with noisy measurement data complicates the coupling between navigation and mapping that is inherent SLAM [6]. Many successful SLAM algorithms address these issues by formulating the problem in a probabilistic manner, tracking the joint posterior over the robot pose and map.

Probabilistic Bayesian filter is the strategy to incorporate uncertainty for all possible robot poses with a probability density function (pdf), which is the degree of belief (Bel) of the robot moves [7].

The estimation problem can be solved by KF-based approaches or particle filter (i.e., online VSLAM) for real-time application or bundle adjustment (i.e., standard SFM) for an offline application [7]. In the medical field, stereo vision obtained a 3D vision, which improved the accuracy of surgery and reduced the time required for surgery and errors that may occur. A novel SLAM algorithm proposed by [8] aimed at advancing the state-of-art in image-guided surgery using stochastic models and KF framework to recursively estimate the configuration of the high degree of freedom snake surgical robot using stereo vision. VSLAM estimates the camera pose by implementing epipolar geometry on the static feature correspondences as shown in Fig. 9.1. The dynamic features are regarded as outliers and excluded from the computation.

Fig. 9.1
figure 1

Feature correspondence extraction

From Fig. 9.1, the static corner has been extracted as a correspondence point in order to estimate the camera orientation. The computer vision community has developed a large number of feature extraction techniques (e.g., Harris corner detector, scale invariant feature transform (SIFT), and speeded-up robust features (SURF)). Unfortunately, these feature-matching techniques do not guarantee perfect correspondences, especially when the data contains outliers [3]. Implementation of robust estimators RANSAC (random sample consensus) is useful to reject outliers and handle false correspondences. On the other hand, deep learning techniques can process the image sequences directly to compute the correspondences in real time [3].

This chapter is structured as follows: Sect. 9.1 introduces the basic estimation techniques; Sect. 9.2 presents the brief of stereo vision, camera calibration, projection, and epipolar function; Sect. 9.3 exhibits the uncertainties and error source in vision system; finally, Sect. 9.4 shows examples of VSLAM based on UKF algorithm and stereo vision.

2 Kalman Filter Framework-Based Probabilistic Inference

The following problem commonly recurs in Computer Vision and Autonomous Robotics algorithms: estimate the values of unknown parameters (robot posture, camera orientation, etc.), given a number of measurements (sensory data, images, feature points, etc.). These kinds of problems are called inverse problem because they involve in estimating unknown model parameters instead of simulating the forward formation equations [9]. However, a model of uncertainty sources needs to be introduced in order to have a reasonable algorithm. Such inference problems from noisy data are called probabilistic inference [10]. In this section, the measurement update equation for the KF, EKF, and UKF is derived starting from the maximum likelihood (the joint probability density function (pdf)), work through Bayesian rule, Gauss–Newton iterated nonlinear least squares method, and practical nonlinear Bayes filter (i.e., standard Kalman filter (SKF), extended Kalman filter (EKF)-based Taylor linearization, and unscented Kalman filter (UKF)-based stochastic linearization).

2.1 Maximum Likelihood Estimator (MLE)

The general form of noisy measurement model is given by (9.1):

$$ {\mathbf{z}}_{\mathbf{k}}=h\left({\mathbf{x}}_{\mathbf{k}}\right)+{\boldsymbol{v}}_{\mathbf{k}} $$
(9.1)

where z k is the noisy measurement vector, x k is the unknown state vector, h(⋅) is the associated nonlinear measurement model, which maps the unknown into that particular measurement, and ν k ~ N(0, R) is a normal Gaussian random variable (GRV) with zero mean and covariance matrix R. Given all of the noisy measurements z = {z k}, the likelihood of having the observed {z k} given a particular value of x is given by (9.2):

$$ L=p\left(\mathbf{z}|\mathbf{x}\right)=\prod \limits_kp\left({\mathbf{z}}_{\mathbf{k}}|{\mathbf{x}}_{\mathbf{k}}\right)=\prod \limits_kp\left({\mathbf{z}}_{\mathbf{k}}|h\left({\mathbf{x}}_{\mathbf{k}}\right)\right)=\prod \limits_kp\left({\boldsymbol{\upnu}}_{\mathbf{k}}\right) $$
(9.2)

where p(z| x) is the joint probability distribution of the measurements z with the unknown vector x. To solve the inverse problem (if the distribution is unimodal Gaussian), the optimal estimate value for the unknown state vector x in the absence of any prior model is that maximizes the likelihood function. if the distribution is unimodal Gaussian. However, if the probability is multimodal, it has several local maxima in likelihood, much more care is required [11].

The likelihood function can be written for the Gaussian noise as (9.3):

$$ {\displaystyle \begin{array}{l}L=\prod \limits_k{\left|2\pi \mathbf{R}\right|}^{-1/2}\exp \left(-\frac{1}{2}{\left({\mathbf{z}}_{\mathbf{k}}-h\left({\mathbf{x}}_{\mathbf{k}}\right)\right)}^T{\mathbf{R}}^{-\mathbf{1}}\left({\mathbf{z}}_{\mathbf{k}}-h\left({\mathbf{x}}_{\mathbf{k}}\right)\right)\right)\\ {}\kern0.75em =\prod \limits_k{\left|2\pi \mathbf{R}\right|}^{-1/2}\exp \left(-\frac{1}{2}{\left\Vert {\mathbf{z}}_{\mathbf{k}}-{\overline{\mathbf{z}}}_{\mathbf{k}}\right\Vert}_{{\mathbf{R}}^{-\mathbf{1}}}^2\right)\end{array}} $$
(9.3)

The norm \( {\left\Vert {\mathbf{z}}_{\mathbf{k}}-{\overline{\mathbf{z}}}_{\mathbf{k}}\right\Vert}_{{\mathbf{R}}^{-\mathbf{1}}}^2 \) is called the Mahalanobis distance [9]. It is used to measure the distance between the measurement z k with step time k and the mean of Gaussian distribution \( {\overline{\mathbf{z}}}_{\mathbf{k}} \). Usually, it is more convenient to work with the negative log-likelihood, as a cost function [12] (9.4):

$$ {\displaystyle \begin{aligned}E=-\log L&=\frac{1}{2}\sum \limits_k{\left({\mathbf{z}}_{\mathbf{k}}-{\overline{\mathbf{z}}}_{\mathbf{k}}\right)}^T{\mathbf{R}}^{-\mathbf{1}}\left({\mathbf{z}}_{\mathbf{k}}-{\overline{\mathbf{z}}}_{\mathbf{k}}\right)+\log \left|2\pi \mathbf{R}\right|\\[-3pt] &=\frac{1}{2}\sum \limits_k{\left\Vert {\mathbf{z}}_{\mathbf{k}}-{\overline{\mathbf{z}}}_{\mathbf{k}}\right\Vert}_{{\mathbf{R}}^{-\mathbf{1}}}^2+K\end{aligned}} $$
(9.4)

where K = ∑k log |2π R| is a constant independent of x and can be dropped. The inverse covariance matrix R −1 weights each measurement error residuals (i.e., the difference between the real measurement z k and the predicted mean value \( {\hat{\mathbf{z}}}_{\mathbf{k}}={\overline{\mathbf{z}}}_{\mathbf{k}} \)).

Another form of negative log-likelihood can be written as (9.5):

$$ E=-\log L=\sum \limits_k{\left\Vert {\mathbf{z}}_{\mathbf{k}}-{\hat{\mathbf{z}}}_{\mathbf{k}}\right\Vert}_{{\mathbf{R}}^{-\mathbf{1}}} $$
(9.5)

Consider the measurement noise is Gaussian and the measurement equation is linear (9.6):

$$ {\mathbf{z}}_{\mathbf{k}}=h\left({\mathbf{x}}_{\mathbf{k}}\right)={\mathbf{Hx}}_{\mathbf{k}} $$
(9.6)

where H is the measurement matrix; in this case, the maximum likelihood estimate is given by the minimization of the quadratic function (9.7), which is a simple quadratic form in x k solved using linear least square algorithm:

$$ E=\sum \limits_k{\left\Vert {\mathbf{z}}_{\mathbf{k}}-h\left({\mathbf{x}}_{\mathbf{k}}\right)\right\Vert}_{{\mathbf{R}}^{-\mathbf{1}}}=\sum \limits_k{\left({\mathbf{z}}_{\mathbf{k}}-{\mathbf{Hx}}_{\mathbf{k}}\right)}^T{\mathbf{R}}^{-\mathbf{1}}\left({\mathbf{z}}_{\mathbf{k}}-{\mathbf{Hx}}_{\mathbf{k}}\right) $$
(9.7)

2.2 Probabilistic Inference and Bayesian Rule

In some cases, the range of possible solution consistent with the measurements is too large to be useful, and any progress cannot be made [11]. For example, MLE estimates each pixel separately based on just its noisy version to solve the problem of image filtering [13]. The difference between the Bayesian inference and the MLE method in that the starting point of Bayesian inference is to formally consider the unknown vector x k as a random vector with a prior distribution p(x k), which is called the degree of belief (Bel), then the posterior distribution of x k can be computed by multiplying the measurements likelihood p(z k| x k) by the prior Bel [14].

Consider the noisy measurement model given by (9.8):

$$ {\mathbf{z}}_{\mathbf{k}}=h\left({\mathbf{x}}_{\mathbf{k}}\right)+{\boldsymbol{\upnu}}_{\mathbf{k}} $$
(9.8)

where \( {\mathbf{x}}_{\mathbf{k}}\sim N\left(\overline{\mathbf{x}},\mathbf{P}\right) \) is the unknown Gaussian random state vector with mean \( \overline{\mathbf{x}} \) and state covariance matrix P. The form of Bayesian rule is given by (9.9):

$$ p\left({\mathbf{x}}_{\mathbf{k}}|{\mathbf{z}}_{\mathbf{k}}\right)=\eta p\left({\mathbf{z}}_{\mathbf{k}}|{\mathbf{x}}_{\mathbf{k}}\right)p\left({\mathbf{x}}_{\mathbf{k}}\right) $$
(9.9)

where η is the normalizing constant [10]. The problem is to find x k that maximizes the posterior estimation (MAP).

Assume that the distributions of x k and z k are Gaussian (9.10):

$$ {\displaystyle \begin{array}{c}p\left({\mathbf{z}}_{\mathbf{k}}|{\mathbf{x}}_{\mathbf{k}}\right)={\left|2\pi \mathbf{R}\right|}^{-1/2}\exp \left(-\frac{1}{2}{\left({\mathbf{z}}_{\mathbf{k}}-h\left({\mathbf{x}}_{\mathbf{k}}\right)\right)}^T{\mathbf{R}}^{-\mathbf{1}}\left({\mathbf{z}}_{\mathbf{k}}-h\left({\mathbf{x}}_{\mathbf{k}}\right)\right)\right)\\ {}\kern0.5em p\left({\mathbf{x}}_{\mathbf{k}}\right)={\left|2\pi {\mathbf{P}}_{\mathbf{k}}\right|}^{-1/2}\exp \left(-\frac{1}{2}{\left({\hat{\mathbf{x}}}_{\mathbf{k}}-{\hat{\mathbf{x}}}_{\mathbf{k}}^{-}\right)}^T{\mathbf{P}}_{\mathbf{k}}^{-\mathbf{1}}\left({\hat{\mathbf{x}}}_{\mathbf{k}}-{\hat{\mathbf{x}}}_{\mathbf{k}}^{-}\right)\right)\end{array}} $$
(9.10)

The solution that maximizes p(x k| z k) is the most probable value of the random vector and is equivalent to minimize its negative log, which reduces to the quadratic form [11] (9.11):

$$ L=\frac{1}{2}\left({\left({\mathbf{z}}_{\mathbf{k}}-h\left({\mathbf{x}}_{\mathbf{k}}\right)\right)}^T{\mathbf{R}}^{-\mathbf{1}}\left({\mathbf{z}}_{\mathbf{k}}-h\left({\mathbf{x}}_{\mathbf{k}}\right)\right)+{\left({\hat{\mathbf{x}}}_{\mathbf{k}}-{\hat{\mathbf{x}}}_{\mathbf{k}}^{-}\right)}^T{\mathbf{P}}_{\mathbf{k}}^{-\mathbf{1}}\left({\hat{\mathbf{x}}}_{\mathbf{k}}-{\hat{\mathbf{x}}}_{\mathbf{k}}^{-}\right)\right) $$
(9.11)

An algebraic equivalent way to maximize the posterior likelihood is to consider the prior estimate as a pseudo-observation and write a new observation vector [12] (9.12):

$$ {\mathbf{Z}}_{\mathbf{k}}=\left[\begin{array}{c}{\mathbf{z}}_{\mathbf{k}}\\ {}{\hat{\mathbf{x}}}_{\mathbf{k}}\end{array}\right],g\left({\mathbf{x}}_{\mathbf{k}}\right)=\left[\begin{array}{c}h\left({\mathbf{x}}_{\mathbf{k}}\right)\\ {}{\hat{\mathbf{x}}}_{\mathbf{k}}^{-}\end{array}\right],\mathbf{C}=\left[\begin{array}{cc}\mathbf{R}& \mathbf{0}\\ {}\mathbf{0}& {\mathbf{P}}_{\mathbf{k}}\end{array}\right] $$
(9.12)

which gives (9.13):

$$ L=\frac{1}{2}{\left({\mathbf{Z}}_{\mathbf{k}}-g\left({\mathbf{x}}_{\mathbf{k}}\right)\right)}^T{\mathbf{C}}^{-\mathbf{1}}\left({\mathbf{Z}}_{\mathbf{k}}-g\left({\mathbf{x}}_{\mathbf{k}}\right)\right) $$
(9.13)

This is a nonlinear least squares problem of the form (9.14):

$$ E=-\log L=\sum \limits_k{\left\Vert {\mathbf{z}}_{\mathbf{k}}-{\hat{\mathbf{z}}}_{\mathbf{k}}\right\Vert}_{{\mathbf{R}}^{-\mathbf{1}}}^2 $$
(9.14)

A useful approximation for small residual problems is the online stochastic Gauss–Newton method, which defines the sequence of iterates as [12] (9.15):

$$ {\hat{\mathbf{x}}}_{\mathbf{k}}={\hat{\mathbf{x}}}_{\mathbf{k}}^{-}-{\left({\mathbf{G}}_{\mathbf{k}}^{\mathbf{T}}{\mathbf{C}}^{-\mathbf{1}}{\mathbf{G}}_{\mathbf{k}}\right)}^{-\mathbf{1}}{\mathbf{G}}_{\mathbf{k}}^{\mathbf{T}}{\mathbf{C}}^{-\mathbf{1}}\left({\mathbf{z}}_{\mathbf{k}}-h\left({\mathbf{x}}_{\mathbf{k}}\right)\right) $$
(9.15)

where \( {\mathbf{G}}_{\mathbf{k}}={\left.\frac{\partial g\left({\mathbf{x}}_{\mathbf{k}}\right)}{\partial {\mathbf{x}}_{\mathbf{k}}}\right|}_{\mathbf{x}=\overline{\mathbf{x}}} \) the Jacobian of g(x k) with respect to the state vector x k.

The Gauss–Newton method is simply using the matrix inversion lemma [11] (9.16):

$$ {\displaystyle \begin{array}{l}{\left({\mathbf{H}}^{\mathbf{T}}{\mathbf{R}}^{-\mathbf{1}}\mathbf{H}+{\mathbf{P}}_{\mathbf{k}}^{-\mathbf{1}}\right)}^{-\mathbf{1}}{\mathbf{H}}^{\mathbf{T}}{\mathbf{R}}^{-\mathbf{1}}={\mathbf{P}}_{\mathbf{k}}{\mathbf{H}}^{\mathbf{T}}{\left({\mathbf{H}\mathbf{P}}_{\mathbf{k}}{\mathbf{H}}^{\mathbf{T}}+\mathbf{R}\right)}^{-\mathbf{1}}\\ {}{\left({\mathbf{H}}^{\mathbf{T}}{\mathbf{R}}^{-\mathbf{1}}\mathbf{H}+{\mathbf{P}}_{\mathbf{k}}^{-\mathbf{1}}\right)}^{-\mathbf{1}}={\mathbf{P}}_{\mathbf{k}}-{\mathbf{P}}_{\mathbf{k}}{\mathbf{H}}^{\mathbf{T}}{\left({\mathbf{H}\mathbf{P}}_{\mathbf{k}}{\mathbf{H}}^{\mathbf{T}}+\mathbf{R}\right)}^{-\mathbf{1}}{\mathbf{H}\mathbf{P}}_{\mathbf{k}}\end{array}} $$
(9.16)

Using (9.16), the Kalman equation is given by (9.17):

$$ {\hat{\mathbf{x}}}_{\mathbf{k}}={\hat{\mathbf{x}}}_{\mathbf{k}}^{-}-\mathbf{K}\left({\mathbf{z}}_{\mathbf{k}}-h\left({\mathbf{x}}_{\mathbf{k}}\right)\right) $$
(9.17)

where K k is the Kalman gain (9.18):

$$ {\mathbf{K}}_{\mathbf{k}}={\mathbf{P}}_{\mathbf{k}}{\mathbf{H}}^{\mathbf{T}}{\left({\mathbf{H}\mathbf{P}}_{\mathbf{k}}{\mathbf{H}}^{\mathbf{T}}+\mathbf{R}\right)}^{-\mathbf{1}} $$
(9.18)

The update covariance matrix is approximated using (9.19):

$$ {\mathbf{P}}_{\mathbf{k}+\mathbf{1}}={\left({\mathbf{G}}_{\mathbf{k}}^{\mathbf{T}}{\mathbf{C}}^{-\mathbf{1}}{\mathbf{G}}_{\mathbf{k}}\right)}^{-\mathbf{1}}={\left({\mathbf{H}}^{\mathbf{T}}{\mathbf{R}}^{-\mathbf{1}}\mathbf{H}+{\mathbf{P}}_{\mathbf{k}}^{-\mathbf{1}}\right)}^{-\mathbf{1}} $$
(9.19)

where the Hessian \( {\mathbf{H}}_{\mathbf{k}}={\left.\frac{\partial h\left({\mathbf{x}}_{\mathbf{k}}\right)}{\partial {\mathbf{x}}_{\mathbf{k}}}\right|}_{\mathbf{x}=\overline{\mathbf{x}}} \) is the Jacobian of h(x k) with respect to the state vector x k.

The posterior covariance matrix using matrix inversion lemma (9.16) is given by (9.20):

$$ {\mathbf{P}}_{\mathbf{k}+\mathbf{1}}=\left(\mathbf{I}-{\mathbf{K}}_{\mathbf{k}}\mathbf{H}\right){\mathbf{P}}_{\mathbf{k}} $$
(9.20)

Many applications require an estimate for the uncertainty in this estimate such as KF, which require the computation of this uncertainty as posterior covariance matrix in order to optimally integrate new measurements with previously computed estimates [9].

2.3 Bayes Filter and Belief Update

In this section, the formulation of optimal recursive discrete time Bayesian filters (e.g., KF, EKF, and UKF) is presented as a practical estimator in terms of Bayes filter.

The basic elements of Bayesian filter are the initial belief Bel(x k − 1) containing preliminary information on the unknown vector x k − 1, the motion model p(x k| x k − 1, u k) as a probabilistic model of the discrete time state space, and the measurement model p(z k| x k) determining the stochastic mapping from the state vector to the measurement, where

  • x k ∈  n is the unknown state space vector on time step k.

  • u k ∈  L is the control vector on time step k.

  • z k ∈  m is the observation vector on time step k.

Using Markov assumption, these vectors are conditionally independent of past values. Bayes filter applies two rules successively to predict the system state [10] (9.21):

$$ {\displaystyle \begin{array}{l}\mathrm{Prediction}:\kern0.5em \overline{\mathrm{Bel}}\left({\mathbf{x}}_{\mathbf{k}}\right)=\sum \limits_kp\left({\mathbf{x}}_{\mathbf{k}}|{\mathbf{x}}_{\mathbf{k}-\mathbf{1}},{\mathbf{u}}_{\mathbf{k}}\right)\mathrm{Bel}\left({\mathbf{x}}_{\mathbf{k}-\mathbf{1}}\right)\\ {}\mathrm{correction}:\kern0.5em \mathrm{Bel}\left({\mathbf{x}}_{\mathbf{k}}\right)=\eta p\left({\mathbf{z}}_{\mathbf{k}}|{\mathbf{x}}_{\mathbf{k}}\right)\overline{\mathrm{Bel}}\left({\mathbf{x}}_{\mathbf{k}}\right)\end{array}} $$
(9.21)

The prior predictive \( \overline{\mathrm{Bel}}\left({\mathbf{x}}_{\mathbf{k}}\right) \) is calculated just before the measurement z k using the control vector u k; this step is called prediction (or action phase). Next, the state estimate belief given in the action phase is corrected using sensor measurement; this step is called correction (or perception phase).

2.3.1 KF Framework

In many vision applications, the object is tracked from frame to frame as it moves. Kalman Filter (1960) has been regarded as the optimal solution to many visual motion tracking and data prediction tasks [13].

The standard KF derivation is given here in the practical use of probabilistic inference [15]. Consider a noisy linear system given by (9.22):

$$ {\displaystyle \begin{array}{l}{\mathbf{x}}_{\mathbf{k}}={\mathbf{Ax}}_{\mathbf{k}-\mathbf{1}}+{\mathbf{B}}_1{\mathbf{u}}_{\mathbf{k}}+{\mathbf{B}}_2{\boldsymbol{\upomega}}_{\mathbf{k}}\\ {}{\mathbf{z}}_{\mathbf{k}}={\mathbf{Hx}}_{\mathbf{k}}+{\boldsymbol{\upnu}}_{\mathbf{k}}\end{array}} $$
(9.22)

where x k, x k − 1 are the current and previous state vector, A n × n is the linear state transition matrix of the dynamic model, B 1n × L is the control matrix, B 2n × L is the input noise matrix, H n × m is the measurement model matrix, ω k ~ N(0, Q) is an additive Gaussian state noise, and ν k ~ N(0, R) is the Gaussian measurement noise. The KF equations can be derived as follows [16]:

  1. 1.

    Prediction phase: The motion model causes a drift in the previous estimate, while the additive noise increases the system disbelief.

    First, apply the motion model and compute the joint distribution of the Gaussian state x k given the initial state \( {\mathbf{x}}_{\mathbf{k}-\mathbf{1}}\sim N\left({\overline{\mathbf{x}}}_{\mathbf{k}-\mathbf{1}},{\mathbf{P}}_{\mathbf{k}-\mathbf{1}}\right) \) and the input u k by (9.23):

    $$ {\hat{\mathbf{x}}}_{\mathbf{k}}^{-}\sim N\left(\mathbf{A}{\overline{\mathbf{x}}}_{\mathbf{k}-\mathbf{1}}+\mathbf{B}{\overline{\mathbf{u}}}_{\mathbf{k}},{\mathbf{A}\mathbf{P}}_{\mathbf{k}-\mathbf{1}}{\mathbf{A}}^{\mathbf{T}}+\mathbf{Q}\right) $$
    (9.23)

    Then, apply the measurement model and compute the joint distribution of the measurement z k given the predicted state \( {\hat{\mathbf{x}}}_{\mathbf{k}}^{-} \) by (9.24):

$$ {\hat{\mathbf{z}}}_{\mathbf{k}}\sim N\left(\mathbf{H}{\hat{\mathbf{x}}}_{\mathbf{k}}^{-},{\mathbf{H}\mathbf{P}}_{\mathbf{k}}^{-}{\mathbf{H}}^{\mathbf{T}}+\mathbf{R}\right)\vspace*{-1.3pc} $$
(9.24)
  1. 2.

    Correction phase: New measurements from the current frame introduce additional information that updates the prior estimate \( {\hat{\mathbf{x}}}_{\mathbf{k}}^{-} \) and restores some of the belief, by computing Kalman gain K k and updating the covariance matrix (9.25):

$$ {\displaystyle \begin{array}{l}{\mathbf{K}}_{\mathbf{k}}={\mathbf{P}}_{\mathbf{k}}^{-}{\mathbf{H}}^{\mathbf{T}}{\left[{\mathbf{H}\mathbf{P}}_{\mathbf{k}}^{-}{\mathbf{H}}^{\mathbf{T}}+\mathbf{R}\right]}^{-1}\\ {}{\hat{\mathbf{x}}}_{\mathbf{k}}={\hat{\mathbf{x}}}_{\mathbf{k}}^{-}+{\mathbf{K}}_{\mathbf{k}}\left({\mathbf{z}}_{\mathbf{k}}-{\hat{\mathbf{z}}}_{\mathbf{k}}\right)\\ {}{\mathbf{P}}_{\mathbf{k}}=\left(\mathbf{I}-{\mathbf{K}}_{\mathbf{k}}\mathbf{H}\right){\mathbf{P}}_{\mathbf{k}}^{-}\end{array}} $$
(9.25)

2.3.2 EKF Linearization Technique

For nonlinear problems, such as SLAM, an EKF, which linearizes the motion and measurement model around the current estimate, is used. The important drawback of the EKF approach is that it uses the Taylor linearization dynamic model [17]. However, if the robot drives along a straight path, the distribution of mobile model in a plane has been observed by “Banana Shape distribution.” As uncertainty increases, the algorithm becomes inconsistent due to the normality assumption breaking down [18]. Consider a noisy nonlinear system given by (9.26):

$$ {\displaystyle \begin{array}{c}{\mathbf{x}}_{\mathbf{k}}=f\left({\mathbf{x}}_{\mathbf{k}-\mathbf{1}},{\mathbf{u}}_{\mathbf{k}},{\boldsymbol{\upomega}}_{\mathbf{k}}\right)\\ {}{\mathbf{z}}_{\mathbf{k}}=h\left({\mathbf{x}}_{\mathbf{k}}\right)+{\boldsymbol{\upnu}}_{\mathbf{k}}\end{array}} $$
(9.26)

where f(⋅) is the nonlinear motion model, noisy by non-additive Gaussian noise ω k ~ N(0, Q) and h(⋅) is the nonlinear measurement model, noisy by ν k ~ N(0, R).

Table 9.1 illustrates the pseudo code for EKF algorithm as Bayes filter.

Table 9.1 Extended Kalman filter (x k − 1, P k − 1, u k, z k)

2.3.3 UKF Stochastic Linearization Technique

The UKF is a Gaussian recursive Bayesian filtering algorithm to solve the probabilistic inference practically. It propagates and updates the system state using a set of deterministically chosen points called sigma points [11]. These points capture the mean and covariance of the state distribution. Filter each point using unscented transform through the nonlinear motion and measurement models [14], and determine the posterior state mean and state covariance to the third order of the nonlinear system. This is a form of statistical local linearization, which produces more accurate estimates than the analytic local linearization employed by the EKF [19]. The UKF algorithm includes three main stages. Table 9.2 shows the pseudo code for the UKF algorithm [16].

Table 9.2 Unscented Kalman filter (χ k − 1, P k − 1, u k, z k)

3 Stereo Vision System

Stereo vision is considered one of the most important recent applications [21]. It is still developing especially in the robotics vision application [22]. Stereo vision is used to form a 3D map of the robot environment, and landmarks are used in this map for localization and exploration [21].

Figure 9.2 explains a low- and high-level image processing stages. In the low-level image processing stage, a camera calibration with distortion removal is carried out. Camera calibration includes the determination of the camera’s intrinsic and extrinsic parameters. Accurate estimates of this geometry are necessary in order to relate image information to an external world coordinate system. On the other hand, in the high-level image processing stage, certain correspondence points are determined based on advanced algorithms for feature extraction [23]. These points are hence used to calculate the relative orientation (RO) as well as the absolute orientation (AO) using a set of control points whose relative coordinates and corresponding image points are known. Stereo vision-based system with SLAM algorithm is used to enable the robot to percept the environment around and within the robot playing area [24]. The computation of relative camera pose can be done using 7-point correspondences for an uncalibrated camera or 5-point for a calibrated camera from two views under by enforcing epipolar geometry. If the image correspondences are known, the relative pose between two images can be recovered up to a scale factor [13]. When the camera pose is recovered, one can easily reconstruct 3D points of the scene by intersecting two projection ray lines through triangulation [25]. As the rays do not always intersect due to erroneous correspondences, the midpoint method or least square based method is proposed to estimate the intersection. Then to avoid the drifting problem, UKF is employed to refine both the camera pose and 3D points by minimizing re-projection errors [26].

Fig. 9.2
figure 2

Stereo vision 3D reconstruction stage

3.1 Perspective Projection and Collinearity Constraint

The process by which the 3D objects are mapped onto an image by a camera is approximated by collinearity constraint [13].

Figure 9.3 shows the perspective projection geometry.

Fig. 9.3
figure 3

Perspective projection model

As seen from Fig. 9.3, light falling on the image plane is assumed to have passed through a small pinhole. Therefore, each object point P ω maps to a single point on the image plane P u. Three-coordinate systems are necessary to define a perspective camera model [23]: (a) The 3D world coordinate system {W}. (b) The camera coordinate system {C}; it is attached to the projection center of the camera. The sensor plane is parallel to its xy plane and displaced in a positive z direction. This axis pierces the image plane at the principal point (u 0, ν 0) which acts as the origin of the image plane. (c) The 2D image coordinate system {I}; its origin lies at the upper left corner of the image. Two sets of parameters are used for perspective camera modeling [13]:

  • Extrinsic Parameters (Extrinsics): These parameters describe the camera pose in the environment. Extrinsics contain six parameters of the exterior orientation (EO) of the projection center (i.e., three parameters for the translation and three other parameters for the rotation). They all vary with the camera motion in the environment.

  • Intrinsic Parameters (Intrinsic): These parameters model the camera physics and describe the interior orientation (IO) of the camera. The intrinsic parameters are determined by calibration and are usually fixed. The parameters are now in place to define perspective projection mathematically. The mapping with an ideal perspective camera can be decomposed into two steps [23]:

    1. 1.

      Exterior orientation: Given the 3D object’s position in world frame \( {{}^{\mathbf{w}}\mathbf{P}}_{\boldsymbol{\upomega}}={\left[\begin{array}{ccc}{{}^wx}_{\omega }& {{}^wy}_{\omega }& {{}^wz}_{\omega}\end{array}\right]}^T \), the 3D object’s position with respect to the camera frame using homogeneous notation C T W4 × 4 (rotation matrix C R W3 × 3 and translation matrix C D W3 × 1) is given by (9.27):

      $$ {{}^C\mathbf{P}}_{\boldsymbol{\upomega}}={{}^{\mathbf{C}}\mathbf{T}}_{\mathbf{W}}{{}^{\mathbf{W}}\mathbf{P}}_{\boldsymbol{\upomega}} $$
      (9.27)
    2. 2.

      Interior orientation: A projection from camera frame to image frame using the calibration matrix κ and intrinsic parameters (the focal length f, the horizontal and vertical scale vector k u, k ν) is given by (9.28):

      $$ \boldsymbol{\upkappa} =\left(\begin{array}{cccc}{\alpha}_u& 0& {u}_0& 0\\ {}0& {\alpha}_{\nu }& {\nu}_0& 0\\ {}0& 0& 1& 0\end{array}\right) $$
      (9.28)

      where α u =  − k u f, α ν =  − k ν f. If the shear parameter s and the scale difference m are present, they amount to an affine distortion of the image frame. It is useful to model the distortions as corrections Δu, Δν of the image coordinates of a perspective camera, and the calibration matrix becomes (9.29):

      $$ \boldsymbol{\upkappa} =\left(\begin{array}{cccc}{\alpha}_u& s{\alpha}_{\nu }& {u}_0+\Delta u& 0\\ {}0& {\alpha}_{\nu}\left(1+m\right)& {\nu}_0+\Delta \nu & 0\\ {}0& 0& 1& 0\end{array}\right) $$
      (9.29)

      If κ is known, then the camera is considered to be calibrated. The final mathematical formation of the collinearity constraint for a perspective projection with distortion from object to image frame is given by (9.30):

$$ {{}^{\mathbf{I}}\mathbf{P}}_{\mathbf{u}}=\boldsymbol{\upkappa} {{}^{\mathbf{C}}\mathbf{P}}_{\boldsymbol{\upomega}} $$
(9.30)

3.2 Epipolar Geometry and Coplanarity Constraint

A 3D measurement cannot be derived from a single image of an unknown scene, because the depth along the Z axis is lost during projection. The principle to solve this problem is to measure the corresponding points acquired from two different viewpoints and reconstruct the 3D coordinates via triangulation. Some of these points are considered as control points [27]. Now, two problems need to be solved:

  1. 1.

    Determination of the image pair orientation (relative and absolute orientations).

  2. 2.

    Reconstruction of the 3D scene coordinates.

Figure 9.4 shows what is known as epipolar geometry. e L is the epipole which is the image of the right camera center in the left camera. e R is the epipole of the left camera center in the right camera. The plane formed by P and the two camera centers O L, O R is the epipolar plane.

Fig. 9.4
figure 4

Epipolar geometry

This plane intersects the image planes in the epipolar line l L, l R; these lines can be used for matching points, and B is the baseline, which is the distance between the projection center, O R and O L. All epipolar lines converge at the epipole. First, it seems that the matching process requires cross image searching, but epipolar restrictions reduce this search to a single line [13]. The two projection rays must be coplanar because they intersect in the 3D point P. For any corresponding point P, triple product coplanarity can be expressed by (9.31):

$$ \left[\begin{array}{ccc}{O}_L{P}_L& {O}_L{O}_R& {O}_R{P}_R\end{array}\right]=0 $$
(9.31)

That is, the three rays are in one plane. The third dimension is extracted from a pair of images using the coplanarity constraint given for the uncalibrated camera as follow (9.32):

$$ {p}_R^T\underset{\mathbf{F}}{\underbrace{{\boldsymbol{\upkappa}}_{\mathbf{R}}^{-\mathbf{T}}{\mathbf{R}}_{\mathbf{R}}^{-\mathbf{T}}{\mathbf{S}}_{\mathbf{b}}{\mathbf{R}}_{\mathbf{L}}^{-\mathbf{1}}{\boldsymbol{\upkappa}}_{\mathbf{L}}^{-\mathbf{1}}}}{p}_L=0 $$
(9.32)

where S b is a skew-symmetric matrix resulting from the triple product given in (9.31), κ R, κ L are the right and left calibration matrix, and R R, R L are the rotation matrix of the right and left camera. The fundamental matrix F 3×3 sums up everything that can be known about the relationship between two uncalibrated cameras. Using F 3×3, it is possible to calculate the positions of the epipoles and the epipolar line in one image associated with a point in the other one. An alternative form for the epipolar geometry in case of the calibrated camera using the essential matrix E 3×3 is given by (9.33):

$$ {p}_R^T\underset{\mathbf{E}}{\underbrace{{\mathbf{R}}_{\mathbf{R}}{\mathbf{S}}_{\mathbf{b}}{\mathbf{R}}_{\mathbf{L}}^T}}{p}_L=0 $$
(9.33)

The depth of the 3D point is calculated using the triangulation principle as shown in Fig. 9.5.

Fig. 9.5
figure 5

Stereo vision triangulation [13]

The 3D coordinates of a landmark can be computed from two matched points in the left and right images by (9.34):

$$ X={x}_R\frac{B}{p_x},Y=\frac{y_L+{y}_R}{2}\frac{B}{p_x},Z=f\frac{B}{p_x} $$
(9.34)

where x R, L, y R, L are the interest coordinate points in the images, X, Y, Z represent the object coordinates in the word frame, B is the baseline, and f is the focal length. The x-parallax p x = x R − x L is the distance between identical pixels when two images are mounted on top of each other [13].

4 Uncertainties in Stereo Vision System

Uncertainties are always present in the image acquisition and processing steps. Images are distorted due to various types of random noises such as Gaussian noise, Poisson noise, Quantization noise, Salt and paper noise, etc. These noises may be introduced from noise sources, for example, inaccurate image capturing devices like cameras, misaligned lenses, weak focal length, faulty memory location, etc. [28]. There are two main error categories, namely deterministic and nondeterministic [29]. The camera intrinsic parameter uncertainties are deterministic since the camera is supposed to be calibrated. In a stereo vision system with parallel optical axis as shown in Fig. 9.6, the epipolar constraint reduces to check that both the features are in the same row of the image. Consider the uncertainty in the 3D landmark position due to errors in the image quantization and in the feature detection process. Once the matching has been established, the most likely 3D coordinates of the landmark are estimated by projecting them back to the environment [30]. Refer to Eq. (9.34), error in the variables x, y, p x are usually modeled as uncorrelated zero-mean Gaussian random variables [13]. Using the first-order error propagation to approximate the distribution of the variables in (9.34) as multivariable Gaussian, the following covariance matrix for the x, y, z coordinates has been obtained (9.35):

$$ \boldsymbol{\Sigma} \approx \mathbf{J}\kern0.5em \operatorname{diag}\left({\sigma}_x^2,{\sigma}_x^2,{\sigma}_x^2\right){\mathbf{J}}^{\mathbf{T}} $$
(9.35)

where J represents the Jacobian matrix of the functions in (9.34), and \( \left({\sigma}_x^2,{\sigma}_x^2,{\sigma}_x^2\right) \) are the variances of the corresponding variables.

Fig. 9.6
figure 6

Stereo vision uncertainty

The theoretical precision of 3D points depends on the uncertainty of the relative orientation and the uncertainty of measured corresponding points. Assume that the uncertainty of the relative orientation is negligible. By variance propagation using (9.36)

$$ {\sigma}_X^2=\frac{Z}{p_x}{\sigma}_{px}=\frac{fB}{p_x^2}{\sigma}_{px}=\frac{Z^2}{fB}{\sigma}_{px}=\frac{Z}{f}\frac{1}{B/Z}{\sigma}_{px} $$
(9.36)

where Z coordinate is the distance of the point P from the principal plane, B is the baseline, f is the focal length, p x is the x-parallax, and σ px is the standard deviation of the point P. The result shows the uncertainty proportional to Z 2 of the distance from the baseline B for a given geometry point P as shown in Fig. 9.7.

Fig. 9.7
figure 7

Stereo vision

5 Examples

5.1 Pose Tracking Using UKF and Stereo Vision

An important improvement of the surgical robot is to extract pose information about the robot relative position to the patient. The absolute pose of the surgical robot relative to the patient cannot be observed directly due to the highly dynamic nature of the operating environment and uncertainties in the robot kinematics model [31]. To solve this problem, VSLAM is applied using the endoscopic stereo camera to estimate the robot motion. In many applications, KF framework is used to estimate the motion of the target object from the previous frame to the new frame [32]. In this example, a UKF approach is used to estimate the pose of the robot. Assuming that the feature points are observable throughout the sequence, the formulation of the UKF is as follows:

The state vector X k is given by (9.37):

$$ {\mathbf{X}}_{\mathbf{k}}={\left[\begin{array}{cccccccccccc}{\mathrm{x}}_{\mathrm{k}}& {\dot{\mathrm{x}}}_{\mathrm{k}}& {\mathrm{y}}_{\mathrm{k}}& {\dot{\mathrm{y}}}_{\mathrm{k}}& {\mathrm{z}}_{\mathrm{k}}& {\dot{\mathrm{z}}}_{\mathrm{k}}& {\upalpha}_{\mathrm{k}}& {\dot{\upalpha}}_{\mathrm{k}}& {\upbeta}_{\mathrm{k}}& {\dot{\upbeta}}_{\mathrm{k}}& {\upvarphi}_{\mathrm{k}}& {\dot{\upvarphi}}_{\mathrm{k}}\end{array}\right]}^{\mathrm{T}} $$
(9.37)

where x k, y k, z k, α k, β k, φ k are the object’s pose and orientation along the x, y, and z axes, respectively, and \( {\dot{x}}_k,{\dot{y}}_k,{\dot{z}}_k,{\dot{\alpha}}_k,{\dot{\beta}}_k,{\dot{\varphi}}_k \) are their corresponding velocities.

The dynamic system equation is given by (9.38):

$$ {\displaystyle \begin{array}{l}{\mathbf{X}}_{\mathbf{k}}={\mathbf{AX}}_{\mathbf{k}-\mathbf{1}}+{\boldsymbol{\upomega}}_{\mathbf{k}}\\ {}\mathbf{A}=\operatorname{diag}\left[\left[\begin{array}{cc}1& {T}_s\\ {}0& 1\end{array}\right],\dots, \left[\begin{array}{cc}1& {T}_s\\ {}0& 1\end{array}\right]\right]\end{array}} $$
(9.38)

where T s is the sample time, and ω k is zero-mean Gaussian noise.

The nonlinear measurement model is defined as (9.39):

$$ {\mathbf{z}}_{\mathbf{k}}=h\left({\mathbf{X}}_{\mathbf{k}}\right)+{\boldsymbol{\upnu}}_{\mathbf{k}} $$
(9.39)

where ν k is a 4m × 1 zero-mean Gaussian noise vector imposed on the images captured. m is the number of feature points extracted from the tracked robot. h(X k) is the 4m × 1 output stereo image pair point transfer function. The estimated coordinates of the feature points at the sample time k is given by (9.40):

$$ h\left({\mathbf{X}}_{\mathbf{k}}\right)={\left[\begin{array}{ccccccccccc}{u}_{1,k}^L& {v}_{1,k}^L& \cdots & {u}_{m,k}^L& {v}_{m,k}^L& \cdots & {u}_{1,k}^R& {v}_{1,k}^R& \cdots & {u}_{m,k}^R& {v}_{m,k}^R\end{array}\right]}^T $$
(9.40)

The corresponding points have the following coplanarity constraint (9.41):

$$ {p}_R^T\mathbf{E}{p}_L=0 $$
(9.41)

The standard perspective projection for a single feature point is given by (9.42):

$$ {\mathbf{z}}_{\mathbf{k}}=\left[\begin{array}{c}{u}_k^L\\ {}{v}_k^L\\ {}{u}_k^R\\ {}{v}_k^R\end{array}\right]=\left[\begin{array}{c}f{x}_x/{z}_x\\ {}f{y}_x/{z}_x\\ {}f\left(B-{x}_x\right)/{z}_x\\ {}f{y}_x/{z}_x\end{array}\right] $$
(9.42)

where \( {\mathbf{z}}_{\mathbf{k}}={\left[\begin{array}{cccc}{u}_k^L& {v}_k^L& {u}_k^R& {v}_k^R\end{array}\right]}^T \) are the measurement pixels in the left and right images, B is the baseline, and f is the focal length. The UKF algorithm can be derived using Table 9.2, giving the motion model (9.38) and the measurement model (9.39). As shown in Fig. 9.8, the robot is captured using the stereo camera with a location sensor.

Fig. 9.8
figure 8

Using UKF and stereo vision for robot tracking

The important features of the robot are then extracted and passed to the deep learning network. The robot pose with respect to the image frame is then matched with the predicted pose using UKF approach. The algorithm is initialized using epipolar geometry of the first two images and computes the essential matrix E using 5-point algorithm plus robust estimator deep learning network. The initial pose parameters are then extracted from E [33]. This is an initial guess of the pose and will be used in UKF approach.

5.2 Localization Approach-Based 2D Landmark Map

Landmarks are natural or artificial environment features which a robot can recognize from its sensory input and keep the uncertainty bounded [34]. Consider 4-WDDMR moves in the predefined environment in the global frame {G} as shown in Fig. 9.9. Stereo vision system is used as an exterior receptive sensor to enable the robot to recognize the landmarks [36]. The predefined landmarks in the global frame {G} help the robot to localize itself correctly. The 4-WDDMR discrete kinematic equation is given by (9.43):

$$ {\mathbf{x}}_{\mathbf{k}}=f\left({\mathbf{x}}_{\mathbf{k}-\mathbf{1}},{\mathbf{u}}_{\mathbf{k}}\right)=\left[\begin{array}{c}{x}_{k-1}+{\nu}_k{T}_s\cos \left({\theta}_{k-1}\right)\\ {}{y}_{k-1}+{\nu}_k{T}_s\sin \left({\theta}_{k-1}\right)\\ {}{\theta}_{k-1}+{\omega}_k{T}_s\end{array}\right] $$
(9.43)

where T s is the sample time and f(x k − 1, u k) is a nonlinear function and relates the 4-WDDMR’s pose \( {\mathbf{x}}_{\mathbf{k}}={\left[\begin{array}{ccc}{x}_k& {y}_k& {\theta}_k\end{array}\right]}^T \) in {G} frame, with the input vector \( {\mathbf{u}}_{\mathbf{k}}={\left[\begin{array}{cc}{\nu}_k& {\omega}_k\end{array}\right]}^T \), which represents the translation and angular velocities in the robot frame {L}. However, the noisy 4-WDDMR motion model with non-additive Gaussian noise w k is written as follows (9.44):

$$ {\mathbf{x}}_{\mathbf{k}}=f\left({\mathbf{x}}_{\mathbf{k}-\mathbf{1}},{\mathbf{u}}_{\mathbf{k}},{\mathbf{w}}_{\mathbf{k}}\right) $$
(9.44)
Fig. 9.9
figure 9

Robot localization-based landmarks [35]

In the authors’ previous work [37], it is assumed that landmarks have a fixed and known position (m x, m y).

The measurement model of the location z k of the landmark m k from the viewpoint of the robot given the location of the robot x k is defined as follows (9.45):

$$ {\displaystyle \begin{array}{c}{\mathbf{z}}_{\mathbf{k}}=h\left({\mathbf{x}}_{\mathbf{k}},{\mathbf{m}}_{\mathbf{k}}\right)=\left[\begin{array}{c}{h}_x\left({\mathbf{x}}_{\mathbf{k}},{\mathbf{m}}_{\mathbf{k}}\right)\\ {}{h}_y\left({\mathbf{x}}_{\mathbf{k}},{\mathbf{m}}_{\mathbf{k}}\right)\end{array}\right]\\ {}=\left[\begin{array}{c}\left({m}_{xk}-{x}_k\right)\cos \left({\theta}_k\right)+\left({m}_{yk}-{y}_k\right)\sin \left({\theta}_k\right)\\ {}-\left({m}_{xk}-{x}_k\right)\sin \left({\theta}_k\right)+\left({m}_{yk}-{y}_k\right)\cos \left({\theta}_k\right)\end{array}\right]\end{array}} $$
(9.45)

where h(x k, m k) is the nonlinear measurement function for the correction stage.

6 Conclusion

This book chapter has provided modern and advanced strategies for image filtering and image feature extraction. Kalman filter including EKF and UKF has been described in detail and implemented for pose tracking and localization-assisted 2D landmark map of a mobile robot. This chapter has contributed to the community of mobile robot localization and mapping, with detailed information on different modern approaches for vision-based localization and mapping. It has been pointed out that UKF algorithm is superior to EKF one in terms of bias cancellation and providing higher accuracy of mobile robot pose estimation.