3-D Vision for Navigation and Grasping

Kragic, Danica; Daniilidis, Kostas

doi:10.1007/978-3-319-32552-1_32

Danica Kragic³ &
Kostas Daniilidis⁴

Part of the book series: Springer Handbooks ((SHB))

91k Accesses
2 Citations

Abstract

In this chapter, we describe algorithms for three-dimensional (GlossaryTerm

3-D

) vision that help robots accomplish navigation and grasping. To model cameras, we start with the basics of perspective projection and distortion due to lenses. This projection from a 3-D world to a two-dimensional (GlossaryTerm

2-D

) image can be inverted only by using information from the world or multiple 2-D views. If we know the 3-D model of an object or the location of 3-D landmarks, we can solve the pose estimation problem from one view. When two views are available, we can compute the 3-D motion and triangulate to reconstruct the world up to a scale factor. When multiple views are given either as sparse viewpoints or a continuous incoming video, then the robot path can be computer and point tracks can yield a sparse 3-D representation of the world. In order to grasp objects, we can estimate 3-D pose of the end effector or 3-D coordinates of the graspable points on the object.

Access provided by Autonomous University of Puebla. Download chapter PDF

RMSD: A 3D Real-Time Mid-level Scene Description System

Visual Servo Control of Robot Grasping

Optimized vision-based robot motion planning from multiple demonstrations

Article 31 August 2017

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

With the rapid progress and cost reduction in digital imaging, cameras became the standard and probably the cheapest sensor on a robot. Unlike positioning (global positioning system – GlossaryTerm

GPS

), inertial (GlossaryTerm

IMU

), and distance sensors (sonar, laser, infrared) cameras produce the highest bandwidth of data. Exploiting information useful for a robot from such a bit stream is less explicit than in case of GPS or a laser scanner but semantically richer. In the years since the first edition of the handbook, we had significant advances in hardware and algorithms. RGB-D sensors like the Primesense Kinect enabled a new generation of full model reconstruction systems [32.1] with an arbitrary camera motion. Google’s project Tango [32.2] established the state of the art in visual odometry using the latest fusion methods between visual and inertial data ([32.3] and

). 3-D modeling became a commodity software (see, for example, 123D Catch App from Autodesk) and the widely used open source Bundler ([32.4]

) has been possible by advances in wide baseline matching and bundle adjustment. Methods for wide baseline matching have been proposed for several variations of pose and structure from motion [32.5]. Last, the problem of local minima for nonminimal overconstrained solvers has been addressed by a group of method using Branch and Bound global optimization of a sum of fractions subject to convex constraints [32.6] or an

L_{\infty}

-norm of the error function [32.7].

Let us consider the two main robot perception domains: navigation and grasping. Assume for example the scenario that a robot vehicle is given the task of going from place A to place B given as instruction only intermediate visual landmarks and/or GPS waypoints. The robot starts at A and has to decide where is a drivable path. Such a decision can be accomplished through the detection of obstacles from at least two images by estimating a depth or occupancy map with a stereo algorithm. While driving, the robot wants to estimate its trajectory which can be accomplished with a matching and structure from motion algorithm. The result of the trajectory can be used to build a lay out of the environment through dens matching and triangulation which in turn can be used as a reference for a subsequent pose estimation. At each time instance the robot has to parse the surrounding environment for risks like pedestrians, or for objects it is searching for like a trash-can. It has to become aware of loop closing or a reentry if the robot has been kidnaped or blind for a while. This can be accomplished through object and scene recognition yielding the what and where of objects around the robot. In an extreme scenario, a vehicle can be left to explore a city and build a semantic 3-D map as well as a trajectory of all places it visited, the ultimate visual simultaneous localization and semantic mapping problem. In the case of grasping, the robot detects an object given a learnt representation, and subsequently, it has to estimate the pose of the object and in some cases its shape by triangulation. When a camera is not mounted on an end-effector, the absolute orientation between the hand the object has to be found.

In the next section we will present the geometric foundations for 3-D vision and in the last section we describe approaches for grasping.

1 Geometric Vision

Let us start by introducing the projection of the world to an image plane. Assume that a point in the world $(X, Y, Z)$ has coordinates $(X_{c i}, Y_{c i}, Z_{c i})$ with respect to the coordinate system of a camera c_i related to each other by the following transformation

(\begin{array}{c} X_{c i} \\ Y_{c i} \\ Z_{c i} \end{array}) = R_{i} (\begin{array}{c} X \\ Y \\ Z \end{array}) + T_{i},

(32.1)

where R_i is a rotation matrix whose columns are the world axes with respect to the camera. The translation vector T_i is starting from the origin of the camera and ending at the origin of the world coordinate system. The rotation matrix is orthogonal $R^{T} R = 1$ with determinant one. We assume that the center of projection is the origin of the coordinate system and that the optical axis is the $Z_{c i}$ axis of the camera. If we assume that the image plane is the plane $Z_{c i} = 1$ then the image coordinates $(x_{i}, y_{i})$ read

x_{i} = \frac{X_{c i}}{Z_{c i}}, y_{i} = \frac{Y_{c i}}{Z_{c i}} .

(32.2)

In practice, what we measure are the pixel coordinates $(u_{i}, v_{i})$ in the image which are related to image coordinates $(x_{i}, y_{i})$ with the affine transformation

u_{i} = f α x_{i} + β y_{i} + c_{u}, v_{i} = f y_{i} + c_{v},

(32.3)

where f is the distance of the image plane to the projection center measured in pixels. It is also called focal length, because they are considered approximately equal. The aspect ratio α is a scaling induced by nonsquare sensor cells or different sampling rates horizontally and vertically. The skew factor β accounts for a shearing induced by a nonperfectly frontal image plane. The image center $c_{u}, c_{v}$ is the point of intersection of the image plane with the optical axis called the image center. These five parameters are called intrinsic parameters and the process of recovering them is called intrinsic calibration. Upon recovering them we can talk about a calibrated system and we can work with the image coordinates $(x_{i}, y_{i})$ instead of the pixel coordinates $(u_{i}, v_{i})$ . In many vision systems in particular on mobile robots, wide-angle lenses introduce a radial distortion around the image center which can be modelled polynomially

\begin{aligned} x_{i}^{dist} & = x_{i} (1 + k_{1} r + k_{2} r^{2} + k_{3} r^{3} + \dots) \\ y_{i}^{dist} & = y_{i} (1 + k_{1} r + k_{2} r^{2} + k_{3} r^{3} + \dots) \\ where r^{2} & = x_{i}^{2} + y_{i}^{2}, \end{aligned}

where we temporarily assumed that the image center is at (0,0). The image coordinates $(x_{i}, y_{i})$ in (32.3) have to be replaced with the distorted coordinates $(x^{dist}, y^{dist})$ .

1.1 Calibration

Recovering the intrinsic parameters when we can make multiple views of a reference pattern like a checker board without variation of the intrinsic parameters has become a standard procedure using tools like the MATLAB calibration toolbox or Zhang’s OpenCV calibration function [32.8]. When intrinsics like the focal length vary during operation and viewing reference patterns is not practically feasible, we rely on the state of the art method by Pollefeys et al. [32.10, 32.9]. When all intrinsic are unknown on the Kruppa equations and several stratified self-calibration approaches [32.11, 32.12] which require at least three views. Apart radial distortion, the projection relations shown above can be summarized in matrix form. By denoting

u_{i} = (u_{i}, v_{i}, 1)

and

X = (X, Y, Z, 1)

we obtain

λ_{i} u_{i} = K_{i} (R_{i}, T_{i}) X = P X,

(32.4)

where $λ_{i} = Z_{c i}$ is the depth of point X in camera coordinates and P is the 3 × 4 projection matrix. The depth λ_i which can be eliminated to obtain two equations relating the world to the pixel coordinates.

1.2 Pose Estimation or PnP

When we have landmarks in the world with known positions X and we can measure their projections, the problem of recovering the unknown rotation and translation in the calibrated case is called pose estimation or the Prespective-n-Point problem (GlossaryTerm

PnP

). Of course, it presumes the identification of the world points in the image. In robotics, the pose estimation is a variant of the localization problem in a known environment. When grasping objects of known shape PnP yields the target pose for an end-effector module the grasping point positions. We assume that a camera is calibrated and that measurements of N points are given in world coordinates

X_{j = 1.. N}

and calibrated image coordinates

x_{j = 1.. N}

. Let us assume two scene points and denote the known angle between their projections

x_{1}

and

x_{2}

as δ₁₂ (Fig. 32.1). Let us denote the squared distance

∥ X_{i} - X_{j} ∥^{2}

with

d_{i j}^{2}

and the lengths of X_j with

d_{j}^{2}

. Then cosine law reads

d_{i}^{2} + d_{j}^{2} - 2 d_{i} d_{j} \cos δ_{i j} = d_{i j}^{2} .

(32.5)

If we can recover d_i and d_j the rest will be an absolute orientation problem

d_{j} x_{j} = R X_{j} + T

(32.6)

to recover translation and rotation between camera and world coordinate system.

1.2.1 Minimal Solution

The cosine law has two unknowns d₁ and d₂ so with three points we should be able to solve for the pose estimation problem. Indeed, three points yield a system of three quadratic equations in three unknowns, so it will have a maximum of eight solutions.

We follow here the analysis of the classic solution in [32.13] and set $d_{2} = u d_{1}$ and $d_{3} = v d_{1}$ and solve all three equations for d₁

\begin{aligned} d_{1}^{2} & = \frac{d_{23}^{2}}{u^{2} + v^{2} - 2 u v \cos δ_{23}}, \\ d_{1}^{2} & = \frac{d_{13}^{2}}{1 + v^{2} - 2 v \cos δ_{13}}, \\ d_{1}^{2} & = \frac{d_{12}^{2}}{u^{2} + 1 - 2 u \cos δ_{12}}, \end{aligned}

which is equivalent to two quadratic equations in u and v

d_{12}^{2} (1 + v^{2} - 2 v \cos δ_{13}) = d_{13}^{2} (u^{2} + 1 - 2 u \cos δ_{12}),

(32.7)

d_{13}^{2} (u^{2} + v^{2} - 2 u v \cos δ_{23}) = d_{23}^{2} (1 + v^{2} - 2 v \cos δ_{13}) .

(32.8)

Solving (32.8) for u² and substituting in (32.7) allows solving E1 for u because u appears linearly. Substituting u back in (32.8) yields a quartic in v which can have as many as four real roots. For each v we obtain two roots for u through any of the quadratic equations yielding a maximum of eight solutions [32.13, 32.14]. Popular pose estimation algorithms are based either on an iterative method [32.15, 32.16] or linear versions using auxiliary unknowns of higher dimension [32.17, 32.18].

A more recent method [32.19] for n world points expresses 3-D points as the barycentric coordinates with respect to four virtual control points

X_{i} = \sum_{j = 1}^{4} α_{i j} C_{j}, where \sum_{j = 1}^{4} α_{i j} = 1 .

A rigid transformation to the camera coordinate system leaves the barycentric coordinates invariant and a perspective projection yields

λ_{i} x_{i} = \sum_{j = 1}^{4} α_{i j} {(X_{c i}, Y_{c i}, Z_{c i})}^{T} .

Eliminating λ_i yields two linear equations for each point

\begin{aligned} \sum_{j = 1}^{4} α_{i j} C_{x_{c j}} & = α_{i j} x_{i} C_{z_{c j}} \\ \sum_{j = 1}^{4} α_{i j} C_{y_{c j}} & = α_{i j} y_{i} C_{z_{c j}} \end{aligned}

with the coordinate triples of the control points in the camera frame being the 12 unknowns. This is a linear homogeneous system with the solution being the nullspace of a $2 n \times 12$ matrix. The unknown control points are found up to a scale factor which is easily fixed because we know the inter point distances. The pose is found from absolute orientation between control points in the camera and the world frame. This yields a very efficient solution for $n \geq 6$ points but leaves you with the initial choice of the control points as a factor affecting the solution.

In case that $n \geq 4$ points lie on a plane we can compute the homography H between the world and the camera plane [32.8]. Assuming Z = 0 is the world plane the homography reads

(\begin{array}{c} u \\ v \\ w \end{array}) \approx {\underset{⏟}{K (r_{1}, r_{2}, T)}}_{H} (\begin{array}{c} X \\ Y \\ W \end{array}),

where $r_{1, 2}$ are the first two columns of the rotation matrix and ≈ denotes the projective equivalence, namely, for any two points p and $p^{'}$ in the projective plane $p \approx p^{'}$ iff $p = λ p^{'}$ for real $λ \neq 0$ . Hence the first two columns of $K^{- 1} H$

K^{- 1} H = (h_{1}^{'}, h_{2}^{'}, h_{3}^{'})

have to be orthogonal. We seek thus an orthogonal matrix R that is the closest to $(h_{1}^{'}, h_{2}^{'}, h_{1}^{'} \times h_{2}^{'})$

{arg min}_{R \in SO (3)} ∥ R - (h_{1}^{'}, h_{2}^{'}, h_{1}^{'} \times h_{2}^{'}) ∥_{F}^{2} .

If the singular value decomposition (GlossaryTerm

SVD

) of

(h_{1}^{'}, h_{2}^{'}, h_{1}^{'} \times h_{2}^{'}) = U S V^{T},

then the solution is [32.20]

R = U (\begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & det (U V^{T}) \end{array}) V^{T} .

(32.9)

The diagonal matrix is a projection from the orthogonal group $O (3)$ to the special orthogonal group $SO (3)$ .

Last, we present a method [32.21] for n points that computes all local minima of the over constrained PnP problem. This involves solving the first derivatives explicitly with respect to the pose unknowns. To achieve this, following observation allows the elimination of the depths λ and the translation. Rigid transformation $λ x = R X + T$ can be written for n points as a linear system for $λ_{j = 1.. n}$ and the translation T

(\begin{array}{cccc} x_{1} & - I \\ ⋱ & ⋮ \\ x_{n} & - I \end{array}) (\begin{array}{c} λ_{1} \\ ⋮ \\ λ_{n} \\ T \end{array}) = (\begin{array}{c} R X_{1} \\ ⋮ \\ R X_{n} \end{array}) .

We can solve for the unknown depths-translation vector and back substitute it into a least squares minimization problem with respect to rotation parameters. It turns out that if we use the three Rodriguez parameters as rotation parametrization the necessary conditions for an extremum (vanishing derivatives) turn out to be three cubic equations [32.21]. Last we would like to point out to the reader that a nonlinear function of the rotation matrix can also be solved as an optimization problem on the Lie-group SO(3) [32.22, 32.23, 32.24] for the case of line correspondences.

1.3 Triangulation

When we know both the intrinsics and extrinsics or their summarization in matrix P and we measure a point we cannot recover its depth from just one camera position. Assuming that we have the projection of the same point X in two cameras

λ_{1} u_{1} = P_{1} (\begin{array}{c} X \\ 1 \end{array}), λ_{2} u_{2} = P_{2} (\begin{array}{c} X \\ 1 \end{array}),

(32.10)

with known projection matrices $P_{1}$ and $P_{2}$ we can recover the position X in space, a process well known as triangulation. Observe that we can achieve triangulation without decomposing the projection matrices into intrinsic and extrinsic parameters, we neeed though to remove the distortion in order to write them as above.

Having correspondences of the same point in two cameras with known projection matrices P_l and P_r we can solve the two projection equations for the world point X. It is worth noting that each point provides two independent equations so that triangulation becomes an overconstrained problem for two views. This is not a contradiction since two rays do not intersect in general in space unless they satisfy the epipolar constraint as presented in the next paragraph. The following matrix in the left hand side has in general rank 4 unless the epipolar constraint is satisfied in which case it has rank 3.

(\begin{array}{c} x P_{l} (3, :) - P_{l} (1, :) \\ y P_{l} (3, :) - P_{l} (2, :) \\ x P_{r} (3, :) - P_{r} (1, :) \\ y P_{r} (3, :) - P_{r} (2, :) \end{array}) (\begin{array}{c} X \\ Y \\ Z \\ 1 \end{array}) = 0,

(32.11)

where $P (i, :)$ means the i-th row of matrix P.

Obviously, the homogeneous system above can be transformed into an inhomogeneous linear system with unknowns $(X, Y, Z)$ . Otherwise it can be solved by finding the vector closest to the null-space of the 4 × 4 matrix above using SVD. A thorough treatment of triangulation is the classic [32.25].

1.4 Moving Stereo

Imagine now that a rigid stereo system consisting of cameras c_l (left) and c_r (right)

u_{l i} \approx P_{l} X_{i},

(32.12)

u_{r i} \approx P_{r} X_{i},

(32.13)

is attached to a moving robot and observe this system at two time instances

X_{0} = R_{1} X_{1} + T_{1},

(32.14)

where $X_{0}$ are point coordinates with respect to the world coordinate system, usually assumed aligned with one of the camera instances, and $X_{1}$ are the coordinates of the same point with respect to the camera rig, after a motion $(R_{1}, T_{1})$ . To estimate the motion of the rig, we have to solve two correspondence problems, first, between left and right image, and second, between left (or right) at the first time instance and left (or right, respectively) at the second time instance. The left to right correspondence enable the solution of the triangulation problem at each time instance. Motion can be obtained then by solving equations (32.14) for $(R_{1}, T_{1})$ , a problem called absolute orientation. Alternatively one can avoid the second triangulation and solve the pose estimation problem between triangulated points in 3-D and points in the left image only. The most popular visual odometry system today is libviso [32.26] and is based on a moving stereo rig ( ).

1.4.1 Absolute Orientation

The treatment for moving stereo will be short and the reader is referred to a similar treatment in the chapter about range sensing. We assume that correspondences between two time instances have been established based on tracking in the images so that we can formulate equations of the form

X_{2} = R X_{1} + T .

The standard way [32.20, 32.27] to solve this problem is to eliminate the translation by subtracting the centroids yielding

X_{2} - \overline{X_{2}} = R (X_{1} - \overline{X_{1}}) .

We need at least three points in total to obtain at least two noncollinear mean-free $X - \bar{X}$ vectors. If we concatenate the mean free vectors for n points into an $n \times 3$ matrix $A_{1, 2}$ we can formulate the following minimization of the Frobenius norm

min_{R \in SO (3)} ∥ A_{2} - R A_{1} ∥_{F},

which is known as the Procrustes problem. It can be shown [32.20] that the solution is obtained through SVD as in (32.9) where U, V are obtained from the singular value decomposition

A_{2} A_{1}^{T} = U S V^{T} .

Solutions are usually obtained with RANSAC by sampling triples of points and verification with the Procrustes method.

1.5 Structure from Motion

Relax now the assumption that projection matrices are known and remain with measuring and matching corresponding points $u_{1}$ and $u_{2}$ . This is the well known structure from motion problem or more precisely structure and 3-D-motion from 2-D motion. In photogrammetry, it is well known as relative orientation problem. Even after eliminating the λ’s from equations (32.12) or by writing them in projective equivalence form

u_{1} \approx P_{1} (\begin{array}{c} X \\ 1 \end{array}), u_{2} \approx P_{2} (\begin{array}{c} X \\ 1 \end{array}),

(32.15)

we realize that we if $(X, P_{1}, P_{2})$ is a solution than $(H X, P_{1} H^{- 1}, P_{2} H^{- 1})$ is a solution, too, where H an invertible 4 × 4 real matrix or in other words a collineation in $P^{3}$ . Even if we align the world coordinate system with the coordinate system of the first camera, which practice is common

u_{1} \approx (I, 0) X, u_{2} \approx P_{2} X,

(32.16)

we remain with the same ambiguity where H is of the form

H \approx (\begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ h_{41} & h_{42} & h_{43} & h_{44} \end{array}),

(32.17)

with $h_{44} \neq 0$ . This ambiguity is possible when the projection matrices are arbitrary rank 3 real matrices without any constraint on their elements. If we assume that we have calibrated our cameras then the projection matrices depend only on displacements

u_{1} \approx (I, 0) X, u_{2} \approx (R, T) X,

(32.18)

and the only remaining ambiguity is the scale ambiguity where H looks like an identity matrix except $h_{44} = s \neq 1$ being the scale factor. In other words if $(R, T, X)$ is a solution then $(R, s T, 1 / s X)$ is a solution, too. These remarks generalize in multiple views. Because, in robotics the $(R, T)$ matrices correspond to location and X to mapping of the environment, the problem has the more proper term GlossaryTerm

SLAM

: Simultaneous localization and mapping. However, because the term SLAM has been used with a variety of sensors like sonar and laser range scanners, the term monocular SLAM is better suited to describe structure from motion from multiple views [32.28].

1.5.1 Epipolar Geometry

This is probably one of the most studied problems in computer vision. We constrain ourselves to the calibrated case which is most relevant to robotics applications. The necessary and sufficient condition for the intersection of the two rays $R x_{1}$ and $x_{2}$ is that the two rays are coplanar with the baseline T

x_{2}^{T} (T \times R x_{1}) = 0,

(32.19)

which is the epipolar constraint (Fig. 32.2). To avoid the scale ambiguity we assume that T is a unit vector. We proceed by summarizing the unknowns into one matrix

E = \hat{T} R

(32.20)

where $\hat{T}$ is the 3 × 3 skew-symmetric matrix to the vector T. The E matrix is called the essential matrix. The epipolar constraint reads then

x_{2}^{T} E x_{1} = 0,

(32.21)

which is the equation of a line in the $x_{2}$ plane with coefficients $E x_{1}$ or a coefficient of a line in the $x_{1}$ plane with coefficients $E^{T} x_{2}$ . These lines are called epipolar and form pencils whose centers are the epipoles $e_{1}$ and $e_{2}$ , in the first and second image plane respectively. The epipoles are the intersections of the baseline with the two image planes, hence $e_{2} \approx T$ and $e_{1} \approx - R^{T} T$ . Looking at the equations of the epipolar lines we can immediately infer that $E^{T} e_{1} = 0$ and $E e_{2} = 0$ .

The set of all essential matrices

\begin{aligned} E = \{E \in R^{3 \times 3} ∣ E = \hat{T} R, \\ .where T \in S^{2} and R \in SO(3)\} \end{aligned}

has been characterized as a manifold of dimension 5 [32.29]. It has been proven [32.30] that

Proposition 32.1

A matrix $E \in R^{3 \times 3}$ is essential if and only if it has two singular values equal to each other and third singular value equal zero.

We present here Nister’s method [32.31] for recovering an essential matrix from five point correspondences and which gained in popularity because of its suitability for RANSAC methods.

1.5.2 Minimal Case

We expand the epipolar constraint in terms of homogeneous coordinates $x_{1} = (x_{1}, y_{1}, z_{1})$ and $x_{2} = (x_{2}, y_{2}, z_{2})$ (when the points are not at infinity $z_{i} = 1$ ) and obtain

(x_{1} x_{2}^{T}, y_{1} x_{2}^{T}, z_{1} x_{3}^{T}) E_{s} = 0,

(32.22)

where E_s is the raw by raw stacked version of matrix E. When we use only five point correspondences the resulting linear homogeneous system will have as a solution any vector in the four dimensional kernel of the data matrix

E_{s} = λ_{1} u_{1} + λ_{2} u_{2} + λ_{3} u_{3} + λ_{4} u_{4} .

(32.23)

At this point we want the matrix E resulting from E_s to be an essential matrix satisfying Proposition 32.1. It has been proven [32.30] that

Proposition 32.2

A matrix $E \in R^{3 \times 3}$ is essential if and only if

{EE}^{T} E = \frac{1}{2} trace ({EE}^{T}) E .

(32.24)

Though the $det (E) = 0$ constraint can be inferred from (32.24) we are still going to use it together with (32.24) to obtain ten cubic equations in the elements of E. As described in [32.31], one can obtain a tenth degree polynomial in λ₄. The number of real roots of this polynomial are computed with a Sturm sequence. There is no proof beyond physical plausibility of the existence of at least one solution that a real root will exist at all. Several alternative 5-point solvers have been proposed since Nister’s paper [32.32, 32.33, 32.34, 32.35] and an extensive list including code has been established by Pajdla’s group [32.36].

Assuming that we have recovered an essential matrix from point correspondences, the next task is to recover an orthogonal matrix R and a unit vector translation T from the essential matrix. If $E = U diag (σ, σ, 0) V^{T}$ , there are four solutions for the pair $(\hat{T}, R)$

\begin{aligned} (\hat{T_{1}}, R_{1}) & = (U R_{z, + π / 2} Σ U^{T}, U R_{z, + π / 2}^{T} V^{T}), \\ (\hat{T_{2}}, R_{2}) & = (U R_{z, - π / 2} Σ U^{T}, U R_{z, - π / 2}^{T} V^{T}), \\ (\hat{T_{1}}, R_{2}) & = (U R_{z, + π / 2} Σ U^{T}, U R_{z, - π / 2}^{T} V^{T}), \\ (\hat{T_{2}}, R_{1}) & = (U R_{z, - π / 2} Σ U^{T}, U R_{z, + π / 2}^{T} V^{T}), \end{aligned}

where R_z denotes rotation around the z-axis. The four solutions can be split into two two-fold ambiguities:

Mirror ambiguity: If T is a solution, then −T is a solution, too. There is no way to disambiguate from the epipolar constraint: $x_{2}^{T} ((- T) \times R x_{1}) = 0$ .
Twisted pair ambiguity: If R is a solution, then also $R_{T, π} R$ is a solution. The first image is twisted around the baseline 180 degrees.

These ambiguities are resolved by checking if depths of triangulated points are positive.

1.5.3 Critical Ambiguities

The approach with five point correspondences has a finite number of feasible (feasible means that they may produce multiple interpretations of structures in front of the camera) solutions when the points in the scene lie on a plane (a two fold ambiguity) [32.37] or when the points on the scene and the camera centers lie on a double sheet hyperboloid with the additional constraint that the camera centers lie symmetrically to the main generator of the hyperboloid [32.38]. These are inherent ambiguities which hold for any number of point correspondences when one seeks a solution for an exact essential matrix.

When someone is solving the linear least squares system for the essential matrix, a planar scene as well as the case of all points and the camera centers lying on a quadric causes a rank deficiency of the system and thus infinite solutions for E.

Beyond the ambiguous situations, there is a considerable amount of literature regarding instabilities in the two view problem. In particularly, it has been shown [32.37, 32.39, 32.40] that a small field of view and insufficient depth variation can cause an indeterminacy in the estimation of the angle between translation and optical axis. An additional small rotation can cause a confounding between translation and rotation [32.41]. Moreover, it has been shown, that there exist local minima close to the global minimum that can fool any iterative scheme [32.42, 32.43].

1.5.4 3-Point SfM

Minimal solutions based on 5 points are still too slow to be used on mobile platforms where additional information like a reference gravity vector might be obtained from an IMU. We present here a recent solution using a reference direction and only 3 points [32.44].

We are given three image correspondences from calibrated cameras, and a single directional correspondence like the gravity vector or a vanishing point. This problem is equivalent to finding the translation vector t and a rotation angle θ around an arbitrary rotation axis.

Let us choose the arbitrary rotation axis to be $e_{2} = [0, 1, 0]^{T}$ . After taking the directional constraint into account, from the initial five parameters in the essential matrix, we now only have to estimate three. We can use the axis-angle parameterization of a rotation matrix to rewrite the essential matrix constraint as follows

{p_{2 i}}^{T} \tilde{E} p_{1} = 0,

(32.25)

where

\tilde{E} = \hat{\tilde{t}} (I + \sin θ {\hat{e}}_{2} + (1 - \cos θ) {\hat{e}}_{2}^{2}),

and $\tilde{t} = (x, y, 1)$ .

Each image point correspondence gives us one such equation, for a total of three equations in three unknowns (elements of t and θ). To create a polynomial system, we set $s = \sin θ$ and $c = \cos θ$ , and add the trigonometric constraint $s^{2} + c^{2} - 1 = 0$ , for a total of four equations in four unknowns. In order to reduce the number of unknowns, we choose the direction of the epipole by assuming that the translation vector $\tilde{t}$ has the form $[x, y, 1]^{⊤}$ . This means that for each $\tilde{t}$ that we recover, $- \tilde{t}$ will also need to be considered as a possible solution.

Once we substitute for $\tilde{E}$ in (32.25), the resulting system of polynomial equations has the following form

a_{i 1} x s + a_{i 2} x c + a_{i 3} y s + a_{i 4} y c + a_{i 5} x - a_{i 2} s + a_{i 1} c + a_{i 6} = 0

(32.26)

for i = 1,..,3, and the equation

s^{2} + c^{2} - 1 = 0 .

(32.27)

This polynomial system can be solved in closed form and has up to four solutions. The total number of possible pose matrices arising from our formulation is therefore at most 8, when we take into account the fact that we have to consider the sign ambiguity in translation.

1.6 Multiple Views SfM

When we talk about simultaneous localization and mapping we obviously mean over a longer period of time. The question is how do we integrate additional frames in our 3-D motion estimation (localization) process.

To exploit multiple frames we introduce rank constraints [32.45]. We assume that the world coordinate system coincides with the coordinate system of the first frame and that a scene point is projected to x_i in the i-th frame and that its depth with respect to the 1st frame is λ₁

λ_{i} x_{i} = R_{i} (λ_{1} x_{1}) + T_{i} .

(32.28)

Taking the cross product with x_i and writing it for n frames yields a homogeneous system

(\begin{array}{cc} \hat{x_{2}} R_{2} x_{1} & \hat{x_{2}} T_{2} \\ ⋮ & ⋮ \\ \hat{x_{n}} R_{n} x_{1} & \hat{x_{2}} T_{n} \end{array}) (\begin{array}{c} λ_{1} \\ 1 \end{array}) = 0,

(32.29)

that has the depth of a point in the first frame as an unknown. The $3 n \times 2$ multiple view matrix has to have rank one [32.46], a constraint that infers both the epipolar and the trifocal equations. The least squares solution for the depth can easily be derived as

λ_{1} = - \frac{\sum_{i = 1}^{n} (x_{i} \times T_{i})^{T} (x_{i} \times R_{i} x_{1})}{∥ x_{i} \times R_{i} x_{1} ∥^{2}} .

(32.30)

Given a depth for each point we can solve for motion by rearranging the multiple views constraint (32.29) as

(\begin{array}{cc} λ_{1}^{1} x_{1}^{1 T} \otimes \hat{x_{i}^{1}} & \hat{x_{i}^{1}} \\ ⋮ & ⋮ \\ λ_{1}^{n} x_{1}^{n T} \otimes \hat{x_{i}^{n}} & \hat{x_{i}^{n}} \end{array}) (\begin{array}{c} R_{i}^{stacked} \\ T_{i} \end{array}) = 0,

(32.31)

where $x_{i}^{n}$ is the n-th image point in the i-th frame and $R_{i}, T_{i}$ is the motion from 1st to the i-th frame and $R_{i}^{stacked}$ is the 12 × 1 vector of stacked elements of the rotation matrix R_i. Suppose that k is the 12 × 1 kernel (or closest kernel in a least squares sense) of the $3 n \times 12$ matrix in the left hand side obtained through singular value decomposition and let us call A the 3 × 3 matrix obtained from the first 9 elements of k and a the vector of elements 10–12. To obtain a rotation matrix we follow the SVD steps in the solution of absolute orientation (32.14) to find the closest orthogonal matrix to an arbitrary invertible matrix.

1.6.1 Bundle Adjustment

On top of such an approach, a bundle adjustment [32.47] minimizes the sum of all deviations between image coordinates and the backprojections of the points to be reconstructed.

{arg min}_{R^{f}, T^{f}, X_{p}} ϵ^{T} C^{- 1} ϵ,

minimized with respect to all $6 (F - 1)$ motions and $3 N - 1$ structure unknowns, where ϵ is the vector containing all errors

\begin{aligned} ϵ_{p}^{f} & = (x_{p}^{f} - \frac{R_{11}^{f} X_{p} + R_{12}^{f} Y_{p} + R_{13}^{f} Z_{p} + T_{x}}{R_{31}^{f} X_{p} + R_{32}^{f} Y_{p} + R_{33}^{f} Z_{p} + T_{z}} \\ .y_{p}^{f} - \frac{R_{21}^{f} X_{p} + R_{22}^{f} Y_{p} + R_{23}^{f} Z_{p} + T_{y}}{R_{31}^{f} X_{p} + R_{32}^{f} Y_{p} + R_{33}^{f} Z_{p} + T_{z}}) \end{aligned}

and C is the error covariance matrix. We will continue with the assumption that C = I.

Call the objective function $Φ (u) = ϵ (u)^{T} ϵ (u)$ with u the vector of unknowns. Given a starting value for the vector of unknowns u we iterate with steps $Δ u$ by locally fitting a quadratic function to $Φ (u)$

Φ (u + Δ u) = Φ (u) + Δ u^{T} \nabla Φ (u) + \frac{1}{2} Δ u^{T} H (u) Δ u,

where $\nabla Φ$ is the gradient and H is the Hessian of Φ. The minimum of this local quadratic is at $Δ u$ satisfying

H δ u = - \nabla Φ (u) .

If $Φ (u) = ϵ (u)^{T} ϵ (u)$ then

\nabla Φ = 2 \sum_{i} ϵ_{i} (u) \nabla ϵ_{i} (u)^{T} = J (u)^{T} ϵ,

where the Jacobian J consists of elements

J_{i j} = \frac{\partial ϵ_{i}}{\partial u_{j}},

and the Hessian reads

\begin{aligned} H & = 2 \sum_{i} (\nabla ϵ_{i} (u) \nabla ϵ_{i} (u)^{T} + ϵ_{i} (u) \frac{\partial^{2} ϵ_{i}}{\partial u^{2}}) \\ = 2 (J (u)^{T} J (u) + \sum_{i} ϵ_{i} (u) \frac{\partial^{2} ϵ_{i}}{\partial u^{2}}) \approx 2 J (u)^{T} J (u) \end{aligned}

by omitting quadratic terms inside the Hessian. This yields the Gauss–Newton iteration

(J^{T} J) Δ u = J^{T} ϵ,

involving the inversion of a $(6 F + 3 N - 7) \times (6 F + 3 N - 7)$ matrix. Bundle adjustment is about the art of inverting efficiently $(J^{T} J)$ .

Let us split the unknown vector u into $u = (a, b)$ following [32.48] obtaining

$6 F - 6$ motion unknowns a,
$3 P - 1$ structure unknonws b,

and we will explain this case better if we assume two motion unknowns a₁ and a₂ corresponding to 2 frames, and 3 unknown points $b_{1}, b_{2}, b_{3}$ .

For keeping symmetry in writing we do not deal here with the global reference and the global scale ambiguity.

The Jacobian for 2 frames and 3 points has 6 pairs of rows (one pair for each image projection) and 15 columns/unknowns

J = \frac{\partial ϵ}{\partial (a, b)} = ({\underset{⏟}{\begin{array}{cc} A_{1}^{1} & 0 \\ 0 & A_{1}^{2} \\ A_{2}^{1} & 0 \\ 0 & A_{2}^{2} \\ A_{3}^{1} & 0 \\ 0 & A_{3}^{2} \end{array}}}_{motion}, {\underset{⏟}{\begin{array}{ccc} B_{1}^{1} & 0 & 0 \\ B_{1}^{2} & 0 & 0 \\ 0 & B_{2}^{1} & 0 \\ 0 & B_{2}^{2} & 0 \\ 0 & 0 & B_{3}^{1} \\ 0 & 0 & B_{3}^{2} \end{array}}}_{structure}),

with A matrices being 2 × 6 and B matrices being 2 × 3 being Jacobians of the error $ϵ_{i}^{f}$ of the projection of the i-th point in the f-th frame. We observe now a pattern emerging in $J^{T} J$

J^{T} J = (\begin{array}{ccccc} U^{1} & 0 & W_{1}^{1} & W_{2}^{1} & W_{3}^{1} \\ 0 & U^{2} & W_{1}^{2} & W_{2}^{2} & W_{3}^{3} \\ . . & . . & V_{1} & 0 & 0 \\ . . & . . & 0 & V_{2} & 0 \\ . . & . . & 0 & 0 & V_{3} \end{array}),

with the block diagonals for motion and structure separated. Let us rewrite the basic iteration $(J^{T} J) Δ u = J^{T} ϵ$ as

(\begin{array}{cc} U & W \\ W^{T} & V \end{array}) (\begin{array}{c} Δ a \\ Δ b \end{array}) = (\begin{array}{c} ϵ_{a}^{'} \\ ϵ_{b}^{'} \end{array}),

and premultiply with

(\begin{array}{cc} I & {WV}^{- 1} \\ 0 & I \end{array}) (\begin{array}{cc} U & W \\ W^{T} & V \end{array}) (\begin{array}{c} Δ a \\ Δ b \end{array}) = (\begin{array}{cc} I & {WV}^{- 1} \\ 0 & I \end{array}) (\begin{array}{c} ϵ_{a}^{'} \\ ϵ_{b}^{'} \end{array})

We find out that motion parameters can be updated separately by inverting a $6 F \times 6 F$ matrix

(U - {WV}^{- 1} W^{T}) Δ a = ϵ_{a}^{'} - {WV}^{- 1} ϵ_{b}^{'} .

Each 3-D point can be updated separately by inverting a 3 × 3 matrix V

V Δ b = ϵ_{b}^{'} - W^{T} Δ a

It is worth mentioning that bundle adjustment though extremely slow captures the correlation between motion estimates and structure (3-D points) estimates which is artificially hidden in the iterative scheme in (32.29).

The largest scale motion estimation and registration of views has been performed by Teller et al. [32.49] with a decoupled computation first of relative rotations and finally of relative translations. The above multiple view SfM techniques can also be applied in a sliding window mode in time. Davison et al. [32.28] showed the first real-time recursive approach by decoupling the direction of the viewing rays from the depth unknowns. For other recursive approaches the reader is referred to the corresponding SLAM chapter.

2 3-D Vision for Grasping

In this section we will move from the basic geometry required for grasping to the main 3-D vision challenges associated with the limited knowledge we might have about the shape of the object as well as the actual selection of 3-D grasping poses.

Naturally, object grasping and manipulation is closely related to general scene understanding and problems such as object detection, recognition, categorization and pose estimation. Taking all the above, there are very few approaches that address all the problems in a single system. One example, reported in [32.50], addresses the problem of enabling transfer of grasp knowledge between object categories, defined using both their physical properties and functionality. This is a challenging problem given that a number of objects with similar physical properties afford different tasks. An example can be a screwdriver and a carrot that are structurally alike, but only the former can be used as a tool, or a ball and an orange where only the latter affords eating (Fig. 32.3).

In relation to object grasping in particular, there are methods that assume that full 3-D model of the object is available and concentrate on grasp synthesis solely. In addition, many of the approaches conduct experiments in a simulated environment without working with real sensory data. However, the knowledge generated in simulation can also be applied later onto sensory data. Another group of approaches

considers grasp synthesis on real sensory data directly, dealing with problems such as noise, occlusions and missing data.

If the object to be grasped is known, there are approaches that store a database of grasp hypotheses, generated either in simulation or through experiments in a real setting. Most of the approaches assume that a 3-D mesh of the object is available and the challenge is then to automatically generate a set of feasible grasp hypotheses. This involves sampling the infinite space of possible hand configurations and ranking the resulting grasps according to some quality metric.

To simplify the process, a common approach is to approximate object’s shape with a constellation of primitives such as spheres, cones, cylinders, boxes or superquadrics [32.51, 32.52, 32.53, 32.54, 32.55]. The purpose of using shape primitives is to reduce the number of candidate grasps and thus prune the search space for finding the optimal set of grasp hypotheses.

One example, shown in Fig. 32.4 and reported in [32.52], decomposes a point cloud from a stereo camera into a constellation of boxes. Grasp planning is performed directly on the boxes which reduces the number of potential grasps. El-Khoury and Sahbani [32.56] distinguish between graspable and nongraspable parts of an object where each part is represented by fitting a superquadric to the point cloud data. Pelossof et al. [32.57] approximate an object with a single superquadric and use a Support Vector Machines based approach to search for the grasp that maximizes the grasp quality. Boularias et al. [32.58] model an object as a Markov random field (GlossaryTerm

MRF

) in which the nodes are points from the point cloud and edges are spanned between the six nearest neighbors of a point. A node in the MRF carries either one of the two labels: a good or a bad grasp location. Detry et al.[32.59] model the object as a constellation of local multimodal contour descriptors. The set of associated grasp hypotheses is modeled as a nonparametric density function in the space of six-dimensional (GlossaryTerm

6-D

) gripper poses, referred to as a bootstrap density. Papazov et al. [32.60] demonstrates 3-D object recognition and pose estimation in a grasping scenario considering cluttered scenes. Weisz and Allen [32.61] proposes a metric suitable for predicting grasp stability under pose uncertainty.

There are several approaches that deal specifically with incomplete point clouds. Marton et al. [32.63] exploit symmetry by fitting a curve to a cross section of the point cloud. Rao et al. [32.64] concentrates of depth segmentation and sample grasp points from the surface of a segmented object using surface normals. Bohg et al. [32.65] presents a related approach that reconstructs full object shape assuming planar symmetry and generates grasps based on the global shape of the object. Bone et al. [32.66] makes no prior assumption about the shape of the object and apply shape carving for generating a parallel-jaw gripper grasps. Hsiao et al. [32.67] employs heuristics for generating grasp hypotheses dependent on the shape of the point cloud. Recent work in [32.68] identifies regions that afford force closure grasps by evaluating local curvature of the objects to create an initial opposing grasp with two or three fingers, dependent on the relative size of the object with respect to the hand. Richtsfeld and Vincze [32.69] uses a stereo-camera setup to generate a 3-D representation of a scene with several objects and then generates various top grasps on object candidates. Maldonado et al. [32.70] use time-of-flight range data, model objects using 3-D Gaussians and rely on finger torque information during grasping to monitor the grasp execution. Stückler et al. [32.71] generate grasp hypotheses based on eigenvectors of the object’s footprints that are generated by projecting the 3-D object point cloud onto the supporting surface. The work of [32.72] presents a system for general scene understanding used for grasp planning and execution. The system uses a bottom-up grouping approach where contour and surface structures are used as the basis for grasp planning. The work builds upon previous work presented in [32.73].

Most of the recent work concentrates on grasp generalization either by observing human grasping or through off- and on-line learning directly on the robot. Kroemer et al. [32.74] demonstrates generalization capabilities using a pouring task scenario. The goal of the approach is to find a part of the object that is most likely to afford the demonstrated action. The learning method is based on the kernel logistic regression. Herzog et al. [32.75] stores a set of local templates of object that a human is interacting with. If a local part of an object segmented online is similar to a template in the database, the associated grasp hypothesis is executed. Song et al. [32.62] approach the problem of inferring a full grasp configuration in relation to a specific task the object is intended for. As in [32.76], the joint distribution over various grasping variables is modeled as a Bayesian network. Additional variables like task, object category and task constraints are introduced. The structure of this model is learned given a large number of grasp examples generated in a simulator and annotated with grasp quality metrics as well as suitability for a specific task. The learned quality of grasps on specific objects given a task is visualized in Fig. 32.5.

3 Conclusion and Further Reading

As main additional sources of reading, we recommend the textbooks by Hartley and Zisserman [32.12], Ma et al. [32.46], Faugeras [32.77], and Faugeras and Luong [32.11]. The reader is referred to Chap. 5 for fundamentals of estimation, to Chap. 35 for sensor fusion, to Chap. 34 for visual servoing, to Chap. 31 for range sensing, to Chap. 45 for 3-D models of the world, and to Chap. 46 for SLAM.

3-D vision is a rapidly advancing field and in this chapter we have covered only geometric approaches based on RGB cameras. Although depth sensors will become ubiquitous indoors and might be outdoors as well, RGB cameras remain formidable because of the higher number and larger diversity of features that can be matched and used for pose estimation and 3-D-modelling. Long range sensing can still be covered from motion with large translation while active sensors are constrained in terms of energy reflected from the environment.

Abbreviations

2-D:: two-dimensional
3-D:: three-dimensional
6-D:: six-dimensional
GPS:: global positioning system
IMU:: inertial measurement unit
MRF:: Markov random field
PnP:: prespective-n-point
SLAM:: simultaneous localization and mapping
SVD:: singular value decomposition

References

S. Izadi, R.A. Newcombe, D. Kim, O. Hilliges, D. Molyneaux, S. Hodges, P. Kohli, J. Shotton, A.J. Davison, A. Fitzgibbon: Kinectfusion: Real-time dynamic 3D surface reconstruction and interaction, ACM SIGGRAPH 2011 Talks (2011) p. 23
Google Scholar
Google: Atap project tango, https://www.google.com/atap/projecttango (2014)
J.A. Hesch, D.G. Kottas, S.L. Bowman, S.I. Roumeliotis: Camera-IMU-based localization: Observability analysis and consistency improvement, Int. J.Robotics Res. 33(1), 182–201 (2014)
Article Google Scholar
N. Snavely, S.M. Seitz, R. Szeliski: Modeling the world from internet photo collections, Int. J.Comput. Vis. 80(2), 189–210 (2008)
Article Google Scholar
Z. Kukelova, M. Bujnak, T. Pajdla: Polynomial eigenvalue solutions to minimal problems in computer vision, IEEE Trans. Pattern Anal.Mach. Intell. 34(7), 1381–1393 (2012)
Article Google Scholar
F. Kahl, S. Agarwal, M.K. Chandraker, D. Kriegman, S. Belongie: Practical global optimization for multiview geometry, Int. J.Comput. Vis. 79(3), 271–284 (2008)
Article MATH Google Scholar
R.I. Hartley, F. Kahl: Global optimization through rotation space search, Int. J.Comput. Vis. 82(1), 64–79 (2009)
Article MATH Google Scholar
Z. Zhang: A flexible new technique for camera calibration, IEEE Trans. Pattern Anal.Mach. Intell. 22, 1330–1334 (2000)
Article Google Scholar
M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, R. Koch: Visual modeling with a hand-held camera, Int. J.Comput. Vis. 59, 207–232 (2004)
Article Google Scholar
M. Pollefeys, L. Van Gool: Stratified self-calibration with the modulus constraint, IEEE Trans. Pattern Anal.Mach. Intell. 21, 707–724 (1999)
Article Google Scholar
O. Faugeras, Q.-T. Luong, T. Papadopoulo: The Geometry of Multiple Images: The Laws That Govern the Formation of Multiple Images of a Scene and Some of Their Applications (MIT Press, Cambridge 2001)
Book MATH Google Scholar
R. Hartley, A. Zisserman: Multiple View Geometry (Cambridge Univ. Press, Cambridge 2000)
MATH Google Scholar
K. Ottenberg, R.M. Haralick, C.-N. Lee, M. Nolle: Review and analysis of solutions of the three-point perspective problem, Int. J.Comput. Vis. 13, 331–356 (1994)
Article Google Scholar
M.A. Fischler, R.C. Bolles: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography, ACM Commun. 24, 381–395 (1981)
Article MathSciNet Google Scholar
R. Kumar, A.R. Hanson: Robust methods for estimaging pose and a sensitivity analysis, Comput. Vis.Image Underst. 60, 313–342 (1994)
Article Google Scholar
C.-P. Lu, G. Hager, E. Mjolsness: Fast and globally convergent pose estimation from video images, IEEE Trans. Pattern Anal.Mach. Intell. 22, 610–622 (2000)
Article Google Scholar
L. Quan, Z. Lan: Linear n-point camera pose determination, IEEE Trans. Pattern Anal.Mach. Intell. 21, 774–780 (1999)
Article Google Scholar
A. Ansar, K. Daniilidis: Linear pose estimation from points and lines, IEEE Trans. Pattern Anal.Mach. Intell. 25, 578–589 (2003)
Article MATH Google Scholar
V. Lepetit, F. Moreno-Noguer, P. Fua: EPNP: An accurate $o(n)$ solution to the PNP problem, Int. J. Comput. Vis. 81(2), 155–166 (2009)
Article Google Scholar
G.H. Golub, C.F. van Loan: Matrix Computations (Johns Hopkins Univ. Press, Baltimore 1983)
MATH Google Scholar
J.A. Hesch, S.I. Roumeliotis: A direct least-squares (dls) method for pnp, IEEE Int. Conf.Comput. Vis. (ICCV) (2011) pp. 383–390
Google Scholar
C.J. Taylor, D.J. Kriegman: Minimization on the Lie Group SO(3) and Related Manifolds (Yale University, New Haven 1994)
Google Scholar
P.-A. Absil, R. Mahony, R. Sepulchre: Optimization Algorithms on Matrix Manifolds (Princeton Univ. Press, Princeton 2009)
MATH Google Scholar
Y. Ma, J. Košecká, S. Sastry: Optimization criteria and geometric algorithms for motion and structure estimation, Int. J.Comput. Vis. 44(3), 219–249 (2001)
Article MATH Google Scholar
R.I. Hartley, P. Sturm: Triangulation, Comput. Vis.Image Underst. 68(2), 146–157 (1997)
Article Google Scholar
B. Kitt, A. Geiger, H. Lategahn: Visual odometry based on stereo image sequences with ransac-based outlier rejection scheme, IEEE Intell. Veh. Symp. (IV) (2010)
Google Scholar
B.K.P. Horn, H.M. Hilden, S. Negahdaripour: Closed-form solution of absolute orientation using orthonormal matrices, J. Opt. Soc. Am. A 5, 1127–1135 (1988)
Article MathSciNet Google Scholar
A.J. Davison, I.D. Reid, N.D. Molton, O. Stasse: Monoslam: Real-time single camera SLAM, IEEE Trans.Pattern Anal.Mach. Intell. 29(6), 1052–1067 (2007)
Article Google Scholar
R. Tron, K. Daniilidis: On the quotient representation for the essential manifold, Proc. IEEE Conf.Comput. Vis.Pattern Recognit. (2014) pp. 1574–1581
Google Scholar
T.S. Huang, O.D. Faugeras: Some properties of the E matrix in two-view motion estimation, IEEE Trans. Pattern Anal.Mach. Intell. 11, 1310–1312 (1989)
Article Google Scholar
D. Nister: An efficient solution for the five-point relative pose problem, IEEE Trans. Pattern Anal.Mach. Intell. 26, 756–777 (2004)
Article Google Scholar
H. Li, R. Hartley: Five-point motion estimation made easy, IEEE 18th Int. Conf. Pattern Recognit. (ICPR), Vol. 1 (2006) pp. 630–633
Google Scholar
Z. Kukelova, M. Bujnak, T. Pajdla: Polynomial eigenvalue solutions to the 5-pt and 6-pt relative pose problems, BMVC (2008) pp. 1–10
Google Scholar
H. Stewenius, C. Engels, D. Nistér: Recent developments on direct relative orientation, ISPRS J.Photogramm.Remote Sens. 60(4), 284–294 (2006)
Article Google Scholar
D. Batra, B. Nabbe, M. Hebert: An alternative formulation for five point relative pose problem, IEEE WorkshopMotionVideo Comput. (2007) pp. 21–21
Google Scholar
Center for Machine Perception, Minimal problems in computer vision; http://cmp.felk.cvut.cz/minimal/5_pt_relative.php
S. Maybank: Theory of Reconstruction from Image Motion (Springer, Berlin, Heidelberg 1993)
Book MATH Google Scholar
S.J. Maybank: The projective geometry of ambiguous surfaces, Phil. Trans. Royal Soc. Lond. A 332(1623), 1–47 (1990)
Article MathSciNet MATH Google Scholar
A. Jepson, D.J. Heeger: A fast subspace algorithm for recovering rigid motion, Proc. IEEE WorkshopVis. Motion, Princeton (1991) pp. 124–131
Chapter Google Scholar
C. Fermüller, Y. Aloimonos: Algorithmic independent instability of structure from motion, Proc. 5th Eur. Conf. Comput. Vision, Freiburg (1998)
Google Scholar
K. Daniilidis, M. Spetsakis: Understanding noise sensitivity in structure from motion. In: Visual Navigation, ed. by Y. Aloimonos (Lawrence Erlbaum, Mahwah 1996) pp. 61–88
Google Scholar
S. Soatto, R. Brockett: Optimal structure from motion: Local ambiguities and global estimates, IEEE Conf. Comput. Vis.Pattern Recognit., Santa Barbara (1998)
Google Scholar
J. Oliensis: A new structure-from-motion ambiguity, IEEE Trans. Pattern Anal.Mach. Intell. 22, 685–700 (1999)
Article Google Scholar
O. Naroditsky, X.S. Zhou, J. Gallier, S. Roumeliotis, K. Daniilidis: Two efficient solutions for visual odometry using directional correspondence, IEEE Trans. Patterns Anal. Mach. Intell. (2012)
Google Scholar
Y. Ma, K. Huang, R. Vidal, J. Kosecka, S. Sastry: Rank conditions of the multiple view matrix, Int. J.Comput. Vis. 59(2), 115–139 (2004)
Article MATH Google Scholar
Y. Ma, S. Soatto, J. Kosecka, S. Sastry: An Invitation to 3-D Vision: From Images to Geometric Models (Springer, Berlin, Heidelberg 2003)
MATH Google Scholar
W. Triggs, P. McLauchlan, R. Hartley, A. Fitzgibbon: Bundle adjustment – A modern synthesis, Lect. Notes Comput. Sci 1883, 298–372 (2000)
Article Google Scholar
M. Lourakis, A. Argyros: The Design and Implementation of a Generic Sparse Bundle Adjustment Software Package Based on the Levenberg–Marquard Method, Tech. Rep, Vol. 340 (ICS/FORTH, Heraklion 2004)
Google Scholar
S. Teller, M. Antone, Z. Bodnar, M. Bosse, S. Coorg: Calibrated, registered images of an extended urban area, Int. Conf. Comput. Vis.Pattern Recognit., Kauai, Vol. 1 (2001) pp. 813–820
Google Scholar
D. Kragic, M. Madry, D. Song: From object categories to grasp transfer using probabilistic reasoning, Proc. IEEE Int. Conf.RoboticsAutom. (ICRA) (2012) pp. 1716–1723
Google Scholar
A.T. Miller, S. Knoop, H.I. Christensen, P.K. Allen: Automatic grasp planning using shape primitives, Proc. IEEE Int. Conf.RoboticsAutom. (ICRA) (2003) pp. 1824–1829
Google Scholar
K. Hübner, D. Kragic: Selection of robot pre-grasps using box-based shape approximation, IEEE/RSJ Int. Conf.Intell. RobotsSyst. (IROS) (2008) pp. 1765–1770
Google Scholar
C. Dunes, E. Marchand, C. Collowet, C. Leroux: Active rough shape estimation of unknown objects, IEEE Int. Conf.Intell. RobotsSyst. (IROS) (2008) pp. 3622–3627
Google Scholar
M. Przybylski, T. Asfour: Unions of balls for shape approximation in robot grasping, IEEE/RSJ Int. Conf.Intell. RobotsSyst. (IROS), Taipei (2010) pp. 1592–1599
Google Scholar
C. Goldfeder, P.K. Allen, C. Lackner, R. Pelossof: Grasp Planning Via Decomposition Trees, Proc. IEEE Int. Conf.RoboticsAutom. (ICRA) (2007) pp. 4679–4684
Google Scholar
S. El-Khoury, A. Sahbani: Handling objects by their handles, IEEE/RSJ Int. Conf.Intell. RobotsSyst. WorkshopGraspTask Learn. Imitation (2008)
Google Scholar
R. Pelossof, A. Miller, P. Allen, T. Jebera: An SVM learning approach to robotic grasping, Proc. IEEE Int. Conf.RoboticsAutom. (ICRA) (2004) pp. 3512–3518
Google Scholar
A. Boularias, O. Kroemer, J. Peters: Learning robot grasping from 3-d images with markov random fields, IEEE/RSJ Int. Conf.Intell. RobotsSyst. (IROS) (2011) pp. 1548–1553
Google Scholar
R. Detry, E. Başeski, N. Krüger, M. Popović, Y. Touati, O. Kroemer, J. Peters, J. Piater: Learning object-specific grasp affordance densities, IEEE Int. Conf.Dev.Learn. (2009) pp. 1–7
Google Scholar
C. Papazov, S. Haddadin, S. Parusel, K. Krieger, D. Burschka: Rigid 3D geometry matching for grasping of known objects in cluttered scenes, Int. J.Robotics Res. 31(4), 538–553 (2012)
Article Google Scholar
J. Weisz, P.K. Allen: Pose error robust grasping from contact wrench space metrics, Proc. IEEE Int. Conf.RoboticsAutom. (ICRA) (2012) pp. 557–562
Google Scholar
D. Song, C.H. Ek, K. Hübner, D. Kragic: Multivariate discretization for bayesian network structure learning in robot grasping, Proc. IEEE Int. Conf.RoboticsAutom. (ICRA) (2011) pp. 1944–1950
Google Scholar
Z.C. Marton, D. Pangercic, N. Blodow, J. Kleinehellefort, M. Beetz: General 3D modelling of novel objects from a single view, IEEE/RSJ Int. Conf.Intell. RobotsSyst. (IROS) (2010) pp. 3700–3705
Google Scholar
D. Rao, V. Le Quoc, T. Phoka, M. Quigley, A. Sudsang, A.Y. Ng: Grasping novel objects with depth segmentation, IEEE/RSJ Int. Conf.Intell. RobotsSyst. (IROS), Taipei (2010) pp. 2578–2585
Google Scholar
J. Bohg, M. Johnson-Roberson, B. León, J. Felip, X. Gratal, N. Bergström, D. Kragic, A. Morales: Mind the gap – Robotic grasping under incomplete observation, Proc. IEEE Int. Conf.RoboticsAutom. (ICRA) (2011)
Google Scholar
G.M. Bone, A. Lambert, M. Edwards: Automated Modelling and Robotic Grasping of Unknown Three-Dimensional Objects, Proc. IEEE Int. Conf.RoboticsAutom. (ICRA) (2008) pp. 292–298
Google Scholar
K. Hsiao, S. Chitta, M. Ciocarlie, E.G. Jones: Contact-reactive grasping of objects with partial shape information, IEEE/RSJ Int. Conf.Intell. RobotsSyst. (IROS) (2010) pp. 1228–1235
Google Scholar
M.A. Roa, M.J. Argus, D. Leidner, C. Borst, G. Hirzinger: Power grasp planning for anthropomorphic robot hands, Proc. IEEE Int. Conf.RoboticsAutom. (ICRA) (2012)
Google Scholar
M. Richtsfeld, M. Vincze: Grasping of Unknown Objects from a Table Top, ECCV WorkshopVis.Action: Effic. Strateg.Cogn. AgentsComplex Environ. (2008)
Google Scholar
A. Maldonado, U. Klank, M. Beetz: Robotic grasping of unmodeled objects using time-of-flight range data and finger torque information, IEEE/RSJ Int. Conf.Intell. RobotsSyst. (IROS) (2010) pp. 2586–2591
Google Scholar
J. Stückler, R. Steffens, D. Holz, S. Behnke: Real-time 3d perception and efficient grasp planning for everyday manipulation tasks, Eur. Conf.Mob. Robots (ECMR) (2011)
Google Scholar
G. Kootstra, M. Popovic, J.A. Jørgensen, K. Kuklinski, K. Miatliuk, D. Kragic, N. Kruger: Enabling grasping of unknown objects through a synergistic use of edge and surface information, Int. J.Robotics Res. 31(10), 1190–1213 (2012)
Article Google Scholar
D. Kraft, N. Pugeault, E. Baseski, M. Popovic, D. Kragic, S. Kalkan, F. Wörgötter, N. Krueger: Birth of the object: Detection of objectness and extraction of object shape through object action complexes, Int. J.Humanoid Robotics pp, 247–265 (2009)
Google Scholar
O. Kroemer, E. Ugur, E. Oztop, J. Peters: A Kernel-based Approach to Direct Action Perception, Proc. IEEE Int. Conf.RoboticsAutom. (ICRA) (2012)
Google Scholar
A. Herzog, P. Pastor, M. Kalakrishnan, L. Righetti, T. Asfour, S. Schaal: Template-based learning of grasp selection, Proc. IEEE Int. Conf.RoboticsAutom. (ICRA) (2012)
Google Scholar
L. Montesano, M. Lopes, A. Bernardino, J. Santos-Victor: Learning object affordances: From sensory–motor coordination to imitation, IEEE Trans.Robotics 24(1), 15–26 (2008)
Article Google Scholar
O. Faugeras: Three-Dimensional Computer Vision (MIT Press, Cambridge 1993)
Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Autonomous Systems, Royal Institute of Technology (KTH), CSC-CAS/CVAP, 10044, Stockholm, Sweden
Danica Kragic
Department of Computer and Information Science, University of Pennsylvania, 3330 Walnut Street, PA 19104, Philadelphia, USA
Kostas Daniilidis

Authors

Danica Kragic
View author publications
You can also search for this author in PubMed Google Scholar
Kostas Daniilidis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Danica Kragic .

Editor information

Editors and Affiliations

Department of Electrical Engineering, University of Naples Federico II, Via Claudio 21, 80125, Naples, Italy
Bruno Siciliano
Department of Computer Sciences, Artificial Intelligence Laboratory, Stanford University, 450 Serra Mall, CA 94305, Stanford, USA
Oussama Khatib

Video-References

:: Google’s project Tango available from http://handbookofrobotics.org/view-chapter/32/videodetails/120
:: Finding paths through the world’s photos available from http://handbookofrobotics.org/view-chapter/32/videodetails/121
:: LIBVISO: Visual odometry for intelligent vehicles available from http://handbookofrobotics.org/view-chapter/32/videodetails/122
:: Parallel tracking and mapping for small AR workspaces (PTAM) available from http://handbookofrobotics.org/view-chapter/32/videodetails/123
:: DTAM: Dense tracking and mapping in real-time available from http://handbookofrobotics.org/view-chapter/32/videodetails/124
:: 3-D models from 2-D video – automatically available from http://handbookofrobotics.org/view-chapter/32/videodetails/125

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kragic, D., Daniilidis, K. (2016). 3-D Vision for Navigation and Grasping. In: Siciliano, B., Khatib, O. (eds) Springer Handbook of Robotics. Springer Handbooks. Springer, Cham. https://doi.org/10.1007/978-3-319-32552-1_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-32552-1_32
Published: 27 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32550-7
Online ISBN: 978-3-319-32552-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

3-D Vision for Navigation and Grasping

Abstract

Similar content being viewed by others

RMSD: A 3D Real-Time Mid-level Scene Description System

Visual Servo Control of Robot Grasping

Optimized vision-based robot motion planning from multiple demonstrations

Keywords

1 Geometric Vision

1.1 Calibration

1.2 Pose Estimation or PnP

1.2.1 Minimal Solution

1.3 Triangulation

1.4 Moving Stereo

1.4.1 Absolute Orientation

1.5 Structure from Motion

1.5.1 Epipolar Geometry

Proposition 32.1

1.5.2 Minimal Case

Proposition 32.2

1.5.3 Critical Ambiguities

1.5.4 3-Point SfM

1.6 Multiple Views SfM

1.6.1 Bundle Adjustment

2 3-D Vision for Grasping

3 Conclusion and Further Reading

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Video-References

Video-References

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation