Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

With the rapid progress and cost reduction in digital imaging, cameras became the standard and probably the cheapest sensor on a robot. Unlike positioning (global positioning system – GlossaryTerm

GPS

), inertial (GlossaryTerm

IMU

), and distance sensors (sonar, laser, infrared) cameras produce the highest bandwidth of data. Exploiting information useful for a robot from such a bit stream is less explicit than in case of GPS or a laser scanner but semantically richer. In the years since the first edition of the handbook, we had significant advances in hardware and algorithms. RGB-D sensors like the Primesense Kinect enabled a new generation of full model reconstruction systems [32.1] with an arbitrary camera motion. Google’s project Tango [32.2] established the state of the art in visual odometry using the latest fusion methods between visual and inertial data ([32.3] and ). 3-D modeling became a commodity software (see, for example, 123D Catch App from Autodesk) and the widely used open source Bundler ([32.4] ) has been possible by advances in wide baseline matching and bundle adjustment. Methods for wide baseline matching have been proposed for several variations of pose and structure from motion [32.5]. Last, the problem of local minima for nonminimal overconstrained solvers has been addressed by a group of method using Branch and Bound global optimization of a sum of fractions subject to convex constraints [32.6] or an L -norm of the error function [32.7].

Let us consider the two main robot perception domains: navigation and grasping. Assume for example the scenario that a robot vehicle is given the task of going from place A to place B given as instruction only intermediate visual landmarks and/or GPS waypoints. The robot starts at A and has to decide where is a drivable path. Such a decision can be accomplished through the detection of obstacles from at least two images by estimating a depth or occupancy map with a stereo algorithm. While driving, the robot wants to estimate its trajectory which can be accomplished with a matching and structure from motion algorithm. The result of the trajectory can be used to build a lay out of the environment through dens matching and triangulation which in turn can be used as a reference for a subsequent pose estimation. At each time instance the robot has to parse the surrounding environment for risks like pedestrians, or for objects it is searching for like a trash-can. It has to become aware of loop closing or a reentry if the robot has been kidnaped or blind for a while. This can be accomplished through object and scene recognition yielding the what and where of objects around the robot. In an extreme scenario, a vehicle can be left to explore a city and build a semantic 3-D map as well as a trajectory of all places it visited, the ultimate visual simultaneous localization and semantic mapping problem. In the case of grasping, the robot detects an object given a learnt representation, and subsequently, it has to estimate the pose of the object and in some cases its shape by triangulation. When a camera is not mounted on an end-effector, the absolute orientation between the hand the object has to be found.

In the next section we will present the geometric foundations for 3-D vision and in the last section we describe approaches for grasping.

1 Geometric Vision

Let us start by introducing the projection of the world to an image plane. Assume that a point in the world ( X , Y , Z ) has coordinates ( X c i , Y c i , Z c i ) with respect to the coordinate system of a camera ci related to each other by the following transformation

X c i Y c i Z c i = R i X Y Z + T i ,
(32.1)

where Ri is a rotation matrix whose columns are the world axes with respect to the camera. The translation vector Ti is starting from the origin of the camera and ending at the origin of the world coordinate system. The rotation matrix is orthogonal  R T R = 1 with determinant one. We assume that the center of projection is the origin of the coordinate system and that the optical axis is the Z c i axis of the camera. If we assume that the image plane is the plane Z c i = 1 then the image coordinates ( x i , y i ) read

x i = X c i Z c i , y i = Y c i Z c i .
(32.2)

In practice, what we measure are the pixel coordinates ( u i , v i ) in the image which are related to image coordinates ( x i , y i ) with the affine transformation

u i = f α x i + β y i + c u , v i = f y i + c v ,
(32.3)

where f is the distance of the image plane to the projection center measured in pixels. It is also called focal length, because they are considered approximately equal. The aspect ratio α is a scaling induced by nonsquare sensor cells or different sampling rates horizontally and vertically. The skew factor β accounts for a shearing induced by a nonperfectly frontal image plane. The image center c u , c v is the point of intersection of the image plane with the optical axis called the image center. These five parameters are called intrinsic parameters and the process of recovering them is called intrinsic calibration. Upon recovering them we can talk about a calibrated system and we can work with the image coordinates ( x i , y i ) instead of the pixel coordinates ( u i , v i ) . In many vision systems in particular on mobile robots, wide-angle lenses introduce a radial distortion around the image center which can be modelled polynomially

x i dist = x i ( 1 + k 1 r + k 2 r 2 + k 3 r 3 + ) y i dist = y i ( 1 + k 1 r + k 2 r 2 + k 3 r 3 + ) where r 2 = x i 2 + y i 2 ,

where we temporarily assumed that the image center is at (0,0). The image coordinates ( x i , y i ) in (32.3) have to be replaced with the distorted coordinates  ( x dist , y dist ) .

1.1 Calibration

Recovering the intrinsic parameters when we can make multiple views of a reference pattern like a checker board without variation of the intrinsic parameters has become a standard procedure using tools like the MATLAB calibration toolbox or Zhang’s OpenCV calibration function [32.8]. When intrinsics like the focal length vary during operation and viewing reference patterns is not practically feasible, we rely on the state of the art method by Pollefeys et al. [32.10, 32.9]. When all intrinsic are unknown on the Kruppa equations and several stratified self-calibration approaches [32.11, 32.12] which require at least three views. Apart radial distortion, the projection relations shown above can be summarized in matrix form. By denoting

u i = ( u i , v i , 1 )

and

X = ( X , Y , Z , 1 )

we obtain

λ i u i = K i R i T i X = P X ,
(32.4)

where  λ i = Z c i is the depth of point X in camera coordinates and P is the 3 × 4 projection matrix. The depth λi which can be eliminated to obtain two equations relating the world to the pixel coordinates.

1.2 Pose Estimation or PnP

When we have landmarks in the world with known positions X and we can measure their projections, the problem of recovering the unknown rotation and translation in the calibrated case is called pose estimation or the Prespective-n-Point problem (GlossaryTerm

PnP

). Of course, it presumes the identification of the world points in the image. In robotics, the pose estimation is a variant of the localization problem in a known environment. When grasping objects of known shape PnP yields the target pose for an end-effector module the grasping point positions. We assume that a camera is calibrated and that measurements of N points are given in world coordinates  X j = 1.. N and calibrated image coordinates  x j = 1.. N . Let us assume two scene points and denote the known angle between their projections  x 1 and  x 2 as δ12 (Fig. 32.1). Let us denote the squared distance  X i - X j 2 with d i j 2 and the lengths of Xj with d j 2 . Then cosine law reads

d i 2 + d j 2 - 2 d i d j cos δ i j = d i j 2 .
(32.5)

If we can recover di and dj the rest will be an absolute orientation problem

d j x j = R X j + T
(32.6)

to recover translation and rotation between camera and world coordinate system.

Fig. 32.1
figure 1

Pose estimation problem: A camera seeing 3 points at unknown distances d1, d2, and d3 with known angles between the rays and known point distances d 12 , d 13 , d 23

1.2.1 Minimal Solution

The cosine law has two unknowns d1 and d2 so with three points we should be able to solve for the pose estimation problem. Indeed, three points yield a system of three quadratic equations in three unknowns, so it will have a maximum of eight solutions.

We follow here the analysis of the classic solution in [32.13] and set d 2 = u d 1 and d 3 = v d 1 and solve all three equations for d1

d 1 2 = d 23 2 u 2 + v 2 - 2 u v cos δ 23 , d 1 2 = d 13 2 1 + v 2 - 2 v cos δ 13 , d 1 2 = d 12 2 u 2 + 1 - 2 u cos δ 12 ,

which is equivalent to two quadratic equations in u and v

d 12 2 ( 1 + v 2 - 2 v cos δ 13 ) = d 13 2 ( u 2 + 1 - 2 u cos δ 12 ) ,
(32.7)
d 13 2 ( u 2 + v 2 - 2 u v cos δ 23 ) = d 23 2 ( 1 + v 2 - 2 v cos δ 13 ) .
(32.8)

Solving (32.8) for u2 and substituting in (32.7) allows solving E1 for u because u appears linearly. Substituting u back in (32.8) yields a quartic in v which can have as many as four real roots. For each v we obtain two roots for u through any of the quadratic equations yielding a maximum of eight solutions [32.13, 32.14]. Popular pose estimation algorithms are based either on an iterative method [32.15, 32.16] or linear versions using auxiliary unknowns of higher dimension [32.17, 32.18].

A more recent method [32.19] for n world points expresses 3-D points as the barycentric coordinates with respect to four virtual control points

X i = j = 1 4 α i j C j , where j = 1 4 α i j = 1 .

A rigid transformation to the camera coordinate system leaves the barycentric coordinates invariant and a perspective projection yields

λ i x i = j = 1 4 α i j X c i , Y c i , Z c i T .

Eliminating λi yields two linear equations for each point

j = 1 4 α i j C x c j = α i j x i C z c j j = 1 4 α i j C y c j = α i j y i C z c j

with the coordinate triples of the control points in the camera frame being the 12 unknowns. This is a linear homogeneous system with the solution being the nullspace of a  2 n × 12 matrix. The unknown control points are found up to a scale factor which is easily fixed because we know the inter point distances. The pose is found from absolute orientation between control points in the camera and the world frame. This yields a very efficient solution for  n 6 points but leaves you with the initial choice of the control points as a factor affecting the solution.

In case that  n 4 points lie on a plane we can compute the homography H between the world and the camera plane [32.8]. Assuming Z = 0 is the world plane the homography reads

u v w K r 1 r 2 T H X Y W ,

where r 1 , 2 are the first two columns of the rotation matrix and ≈ denotes the projective equivalence, namely, for any two points p and  p in the projective plane  p p iff  p = λ p for real  λ 0 . Hence the first two columns of K - 1 H

K - 1 H = h 1 h 2 h 3

have to be orthogonal. We seek thus an orthogonal matrix R that is the closest to  h 1 h 2 h 1 × h 2

arg min R SO ( 3 ) R - h 1 h 2 h 1 × h 2 F 2 .

If the singular value decomposition (GlossaryTerm

SVD

) of

h 1 h 2 h 1 × h 2 = U S V T ,

then the solution is [32.20]

R = U 1 0 0 0 1 0 0 0 det ( U V T ) V T .
(32.9)

The diagonal matrix is a projection from the orthogonal group  O ( 3 ) to the special orthogonal group  SO ( 3 ) .

Last, we present a method [32.21] for n points that computes all local minima of the over constrained PnP problem. This involves solving the first derivatives explicitly with respect to the pose unknowns. To achieve this, following observation allows the elimination of the depths λ and the translation. Rigid transformation  λ x = R X + T can be written for n points as a linear system for  λ j = 1.. n and the translation T

x 1 - I x n - I λ 1 λ n T = R X 1 R X n .

We can solve for the unknown depths-translation vector and back substitute it into a least squares minimization problem with respect to rotation parameters. It turns out that if we use the three Rodriguez parameters as rotation parametrization the necessary conditions for an extremum (vanishing derivatives) turn out to be three cubic equations [32.21]. Last we would like to point out to the reader that a nonlinear function of the rotation matrix can also be solved as an optimization problem on the Lie-group SO(3) [32.22, 32.23, 32.24] for the case of line correspondences.

1.3 Triangulation

When we know both the intrinsics and extrinsics or their summarization in matrix P and we measure a point we cannot recover its depth from just one camera position. Assuming that we have the projection of the same point X in two cameras

λ 1 u 1 = P 1 X 1 , λ 2 u 2 = P 2 X 1 ,
(32.10)

with known projection matrices  P 1 and  P 2 we can recover the position X in space, a process well known as triangulation. Observe that we can achieve triangulation without decomposing the projection matrices into intrinsic and extrinsic parameters, we neeed though to remove the distortion in order to write them as above.

Having correspondences of the same point in two cameras with known projection matrices Pl and Pr we can solve the two projection equations for the world point X. It is worth noting that each point provides two independent equations so that triangulation becomes an overconstrained problem for two views. This is not a contradiction since two rays do not intersect in general in space unless they satisfy the epipolar constraint as presented in the next paragraph. The following matrix in the left hand side has in general rank 4 unless the epipolar constraint is satisfied in which case it has rank 3.

x P l ( 3 , : ) - P l ( 1 , : ) y P l ( 3 , : ) - P l ( 2 , : ) x P r ( 3 , : ) - P r ( 1 , : ) y P r ( 3 , : ) - P r ( 2 , : ) X Y Z 1 = 0 ,
(32.11)

where  P ( i , : ) means the i-th row of matrix P.

Obviously, the homogeneous system above can be transformed into an inhomogeneous linear system with unknowns ( X , Y , Z ) . Otherwise it can be solved by finding the vector closest to the null-space of the 4 × 4 matrix above using SVD. A thorough treatment of triangulation is the classic [32.25].

1.4 Moving Stereo

Imagine now that a rigid stereo system consisting of cameras cl (left) and cr (right)

u l i P l X i ,
(32.12)
u r i P r X i ,
(32.13)

is attached to a moving robot and observe this system at two time instances

X 0 = R 1 X 1 + T 1 ,
(32.14)

where  X 0 are point coordinates with respect to the world coordinate system, usually assumed aligned with one of the camera instances, and  X 1 are the coordinates of the same point with respect to the camera rig, after a motion ( R 1 , T 1 ) . To estimate the motion of the rig, we have to solve two correspondence problems, first, between left and right image, and second, between left (or right) at the first time instance and left (or right, respectively) at the second time instance. The left to right correspondence enable the solution of the triangulation problem at each time instance. Motion can be obtained then by solving equations (32.14) for ( R 1 , T 1 ) , a problem called absolute orientation. Alternatively one can avoid the second triangulation and solve the pose estimation problem between triangulated points in 3-D and points in the left image only. The most popular visual odometry system today is libviso [32.26] and is based on a moving stereo rig ( ).

1.4.1 Absolute Orientation

The treatment for moving stereo will be short and the reader is referred to a similar treatment in the chapter about range sensing. We assume that correspondences between two time instances have been established based on tracking in the images so that we can formulate equations of the form

X 2 = R X 1 + T .

The standard way [32.20, 32.27] to solve this problem is to eliminate the translation by subtracting the centroids yielding

X 2 - X 2 = R ( X 1 - X 1 ) .

We need at least three points in total to obtain at least two noncollinear mean-free  X - X ¯ vectors. If we concatenate the mean free vectors for n points into an n × 3 matrix A 1 , 2 we can formulate the following minimization of the Frobenius norm

min R SO ( 3 ) A 2 - R A 1 F ,

which is known as the Procrustes problem. It can be shown [32.20] that the solution is obtained through SVD as in (32.9) where U, V are obtained from the singular value decomposition

A 2 A 1 T = U S V T .

Solutions are usually obtained with RANSAC by sampling triples of points and verification with the Procrustes method.

1.5 Structure from Motion

Relax now the assumption that projection matrices are known and remain with measuring and matching corresponding points  u 1 and  u 2 . This is the well known structure from motion problem or more precisely structure and 3-D-motion from 2-D motion. In photogrammetry, it is well known as relative orientation problem. Even after eliminating the λ’s from equations (32.12) or by writing them in projective equivalence form

u 1 P 1 X 1 , u 2 P 2 X 1 ,
(32.15)

we realize that we if ( X , P 1 , P 2 ) is a solution than ( H X , P 1 H - 1 , P 2 H - 1 ) is a solution, too, where H an invertible 4 × 4 real matrix or in other words a collineation in  P 3 . Even if we align the world coordinate system with the coordinate system of the first camera, which practice is common

u 1 I 0 X , u 2 P 2 X ,
(32.16)

we remain with the same ambiguity where H is of the form

H 1 0 0 0 0 1 0 0 0 0 1 0 h 41 h 42 h 43 h 44 ,
(32.17)

with h 44 0 . This ambiguity is possible when the projection matrices are arbitrary rank 3 real matrices without any constraint on their elements. If we assume that we have calibrated our cameras then the projection matrices depend only on displacements

u 1 I 0 X , u 2 R T X ,
(32.18)

and the only remaining ambiguity is the scale ambiguity where H looks like an identity matrix except h 44 = s 1 being the scale factor. In other words if ( R , T , X ) is a solution then ( R , s T , 1 / s X ) is a solution, too. These remarks generalize in multiple views. Because, in robotics the ( R , T ) matrices correspond to location and X to mapping of the environment, the problem has the more proper term GlossaryTerm

SLAM

: Simultaneous localization and mapping. However, because the term SLAM has been used with a variety of sensors like sonar and laser range scanners, the term monocular SLAM is better suited to describe structure from motion from multiple views [32.28].

1.5.1 Epipolar Geometry

This is probably one of the most studied problems in computer vision. We constrain ourselves to the calibrated case which is most relevant to robotics applications. The necessary and sufficient condition for the intersection of the two rays  R x 1 and  x 2 is that the two rays are coplanar with the baseline T

x 2 T ( T × R x 1 ) = 0 ,
(32.19)

which is the epipolar constraint (Fig. 32.2). To avoid the scale ambiguity we assume that T is a unit vector. We proceed by summarizing the unknowns into one matrix

E = T ^ R
(32.20)

where T ^ is the 3 × 3 skew-symmetric matrix to the vector T. The E matrix is called the essential matrix. The epipolar constraint reads then

x 2 T E x 1 = 0 ,
(32.21)

which is the equation of a line in the  x 2 plane with coefficients E x 1 or a coefficient of a line in the  x 1 plane with coefficients E T x 2 . These lines are called epipolar and form pencils whose centers are the epipoles  e 1 and  e 2 , in the first and second image plane respectively. The epipoles are the intersections of the baseline with the two image planes, hence  e 2 T and  e 1 - R T T . Looking at the equations of the epipolar lines we can immediately infer that E T e 1 = 0 and E e 2 = 0 .

Fig. 32.2
figure 2

A point is perspectively projected to calibrated image vectors  R x 1 and  x 2 which are coplanar with baseline T

The set of all essential matrices

E = E R 3 × 3 E = T ^ R ,  where  T S 2  and  R SO(3)

has been characterized as a manifold of dimension 5 [32.29]. It has been proven [32.30] that

Proposition 32.1

A matrix  E R 3 × 3 is essential if and only if it has two singular values equal to each other and third singular value equal zero.

We present here Nister’s method [32.31] for recovering an essential matrix from five point correspondences and which gained in popularity because of its suitability for RANSAC methods.

1.5.2 Minimal Case

We expand the epipolar constraint in terms of homogeneous coordinates  x 1 = ( x 1 , y 1 , z 1 ) and  x 2 = ( x 2 , y 2 , z 2 ) (when the points are not at infinity z i = 1 ) and obtain

x 1 x 2 T y 1 x 2 T z 1 x 3 T E s = 0 ,
(32.22)

where Es is the raw by raw stacked version of matrix E. When we use only five point correspondences the resulting linear homogeneous system will have as a solution any vector in the four dimensional kernel of the data matrix

E s = λ 1 u 1 + λ 2 u 2 + λ 3 u 3 + λ 4 u 4 .
(32.23)

At this point we want the matrix E resulting from Es to be an essential matrix satisfying Proposition 32.1. It has been proven [32.30] that

Proposition 32.2

A matrix  E R 3 × 3 is essential if and only if

EE T E = 1 2 trace ( EE T ) E .
(32.24)

Though the  det ( E ) = 0 constraint can be inferred from (32.24) we are still going to use it together with (32.24) to obtain ten cubic equations in the elements of E. As described in [32.31], one can obtain a tenth degree polynomial in λ4. The number of real roots of this polynomial are computed with a Sturm sequence. There is no proof beyond physical plausibility of the existence of at least one solution that a real root will exist at all. Several alternative 5-point solvers have been proposed since Nister’s paper [32.32, 32.33, 32.34, 32.35] and an extensive list including code has been established by Pajdla’s group [32.36].

Assuming that we have recovered an essential matrix from point correspondences, the next task is to recover an orthogonal matrix R and a unit vector translation T from the essential matrix. If E = U diag ( σ , σ , 0 ) V T , there are four solutions for the pair ( T ^ , R )

( T 1 ^ , R 1 ) = ( U R z , + π / 2 Σ U T , U R z , + π / 2 T V T ) , ( T 2 ^ , R 2 ) = ( U R z , - π / 2 Σ U T , U R z , - π / 2 T V T ) , ( T 1 ^ , R 2 ) = ( U R z , + π / 2 Σ U T , U R z , - π / 2 T V T ) , ( T 2 ^ , R 1 ) = ( U R z , - π / 2 Σ U T , U R z , + π / 2 T V T ) ,

where Rz denotes rotation around the z-axis. The four solutions can be split into two two-fold ambiguities:

  • Mirror ambiguity: If T is a solution, then −T is a solution, too. There is no way to disambiguate from the epipolar constraint: x 2 T ( ( - T ) × R x 1 ) = 0 .

  • Twisted pair ambiguity: If R is a solution, then also R T , π R is a solution. The first image is twisted around the baseline 180 degrees.

These ambiguities are resolved by checking if depths of triangulated points are positive.

1.5.3 Critical Ambiguities

The approach with five point correspondences has a finite number of feasible (feasible means that they may produce multiple interpretations of structures in front of the camera) solutions when the points in the scene lie on a plane (a two fold ambiguity) [32.37] or when the points on the scene and the camera centers lie on a double sheet hyperboloid with the additional constraint that the camera centers lie symmetrically to the main generator of the hyperboloid [32.38]. These are inherent ambiguities which hold for any number of point correspondences when one seeks a solution for an exact essential matrix.

When someone is solving the linear least squares system for the essential matrix, a planar scene as well as the case of all points and the camera centers lying on a quadric causes a rank deficiency of the system and thus infinite solutions for E.

Beyond the ambiguous situations, there is a considerable amount of literature regarding instabilities in the two view problem. In particularly, it has been shown [32.37, 32.39, 32.40] that a small field of view and insufficient depth variation can cause an indeterminacy in the estimation of the angle between translation and optical axis. An additional small rotation can cause a confounding between translation and rotation [32.41]. Moreover, it has been shown, that there exist local minima close to the global minimum that can fool any iterative scheme [32.42, 32.43].

1.5.4 3-Point SfM

Minimal solutions based on 5 points are still too slow to be used on mobile platforms where additional information like a reference gravity vector might be obtained from an IMU. We present here a recent solution using a reference direction and only 3 points [32.44].

We are given three image correspondences from calibrated cameras, and a single directional correspondence like the gravity vector or a vanishing point. This problem is equivalent to finding the translation vector t and a rotation angle θ around an arbitrary rotation axis.

Let us choose the arbitrary rotation axis to be  e 2 = [ 0 , 1 , 0 ] T . After taking the directional constraint into account, from the initial five parameters in the essential matrix, we now only have to estimate three. We can use the axis-angle parameterization of a rotation matrix to rewrite the essential matrix constraint as follows

p 2 i T E ̃ p 1 = 0 ,
(32.25)

where

E ̃ = t ̃ ^ ( I + sin θ e ^ 2 + ( 1 - cos θ ) e ^ 2 2 ) ,

and  t ̃ = ( x , y , 1 ) .

Each image point correspondence gives us one such equation, for a total of three equations in three unknowns (elements of t and θ). To create a polynomial system, we set s = sin θ and c = cos θ , and add the trigonometric constraint s 2 + c 2 - 1 = 0 , for a total of four equations in four unknowns. In order to reduce the number of unknowns, we choose the direction of the epipole by assuming that the translation vector  t ̃ has the form [ x , y , 1 ] . This means that for each  t ̃ that we recover, - t ̃ will also need to be considered as a possible solution.

Once we substitute for  E ̃ in (32.25), the resulting system of polynomial equations has the following form

a i 1 x s + a i 2 x c + a i 3 y s + a i 4 y c + a i 5 x - a i 2 s + a i 1 c + a i 6 = 0
(32.26)

for i = 1,..,3, and the equation

s 2 + c 2 - 1 = 0 .
(32.27)

This polynomial system can be solved in closed form and has up to four solutions. The total number of possible pose matrices arising from our formulation is therefore at most 8, when we take into account the fact that we have to consider the sign ambiguity in translation.

1.6 Multiple Views SfM

When we talk about simultaneous localization and mapping we obviously mean over a longer period of time. The question is how do we integrate additional frames in our 3-D motion estimation (localization) process.

To exploit multiple frames we introduce rank constraints [32.45]. We assume that the world coordinate system coincides with the coordinate system of the first frame and that a scene point is projected to xi in the i-th frame and that its depth with respect to the 1st frame is λ1

λ i x i = R i ( λ 1 x 1 ) + T i .
(32.28)

Taking the cross product with xi and writing it for n frames yields a homogeneous system

x 2 ^ R 2 x 1 x 2 ^ T 2 x n ^ R n x 1 x 2 ^ T n λ 1 1 = 0 ,
(32.29)

that has the depth of a point in the first frame as an unknown. The 3 n × 2 multiple view matrix has to have rank one [32.46], a constraint that infers both the epipolar and the trifocal equations. The least squares solution for the depth can easily be derived as

λ 1 = - i = 1 n ( x i × T i ) T ( x i × R i x 1 ) x i × R i x 1 2 .
(32.30)

Given a depth for each point we can solve for motion by rearranging the multiple views constraint (32.29) as

λ 1 1 x 1 1 T x i 1 ^ x i 1 ^ λ 1 n x 1 n T x i n ^ x i n ^ R i stacked T i = 0 ,
(32.31)

where  x i n is the n-th image point in the i-th frame and  R i , T i is the motion from 1st to the i-th frame and  R i stacked is the 12 × 1 vector of stacked elements of the rotation matrix Ri. Suppose that k is the 12 × 1 kernel (or closest kernel in a least squares sense) of the 3 n × 12 matrix in the left hand side obtained through singular value decomposition and let us call A the 3 × 3 matrix obtained from the first 9 elements of k and a the vector of elements 10–12. To obtain a rotation matrix we follow the SVD steps in the solution of absolute orientation (32.14) to find the closest orthogonal matrix to an arbitrary invertible matrix.

1.6.1 Bundle Adjustment

On top of such an approach, a bundle adjustment [32.47] minimizes the sum of all deviations between image coordinates and the backprojections of the points to be reconstructed.

arg min R f , T f , X p ϵ T C - 1 ϵ ,

minimized with respect to all 6 ( F - 1 ) motions and 3 N - 1 structure unknowns, where ϵ is the vector containing all errors

ϵ p f = x p f - R 11 f X p + R 12 f Y p + R 13 f Z p + T x R 31 f X p + R 32 f Y p + R 33 f Z p + T z y p f - R 21 f X p + R 22 f Y p + R 23 f Z p + T y R 31 f X p + R 32 f Y p + R 33 f Z p + T z

and C is the error covariance matrix. We will continue with the assumption that C = I.

Call the objective function  Φ ( u ) = ϵ ( u ) T ϵ ( u ) with u the vector of unknowns. Given a starting value for the vector of unknowns u we iterate with steps  Δ u by locally fitting a quadratic function to  Φ ( u )

Φ ( u + Δ u ) = Φ ( u ) + Δ u T Φ ( u ) + 1 2 Δ u T H ( u ) Δ u ,

where  Φ is the gradient and H is the Hessian of Φ. The minimum of this local quadratic is at  Δ u satisfying

H δ u = - Φ ( u ) .

If  Φ ( u ) = ϵ ( u ) T ϵ ( u ) then

Φ = 2 i ϵ i ( u ) ϵ i ( u ) T = J ( u ) T ϵ ,

where the Jacobian J consists of elements

J i j = ϵ i u j ,

and the Hessian reads

H = 2 i ϵ i ( u ) ϵ i ( u ) T + ϵ i ( u ) 2 ϵ i u 2 = 2 J ( u ) T J ( u ) + i ϵ i ( u ) 2 ϵ i u 2 2 J ( u ) T J ( u )

by omitting quadratic terms inside the Hessian. This yields the Gauss–Newton iteration

( J T J ) Δ u = J T ϵ ,

involving the inversion of a  ( 6 F + 3 N - 7 ) × ( 6 F + 3 N - 7 ) matrix. Bundle adjustment is about the art of inverting efficiently ( J T J ) .

Let us split the unknown vector u into u = ( a , b ) following [32.48] obtaining

  • 6 F - 6 motion unknowns a,

  • 3 P - 1 structure unknonws b,

and we will explain this case better if we assume two motion unknowns a1 and a2 corresponding to 2 frames, and 3 unknown points b 1 , b 2 , b 3 .

For keeping symmetry in writing we do not deal here with the global reference and the global scale ambiguity.

The Jacobian for 2 frames and 3 points has 6 pairs of rows (one pair for each image projection) and 15 columns/unknowns

J = ϵ ( a , b ) = A 1 1 0 0 A 1 2 A 2 1 0 0 A 2 2 A 3 1 0 0 A 3 2 motion B 1 1 0 0 B 1 2 0 0 0 B 2 1 0 0 B 2 2 0 0 0 B 3 1 0 0 B 3 2 structure ,

with A matrices being 2 × 6 and B matrices being 2 × 3 being Jacobians of the error  ϵ i f of the projection of the i-th point in the f-th frame. We observe now a pattern emerging in  J T J

J T J = U 1 0 W 1 1 W 2 1 W 3 1 0 U 2 W 1 2 W 2 2 W 3 3 . . . . V 1 0 0 . . . . 0 V 2 0 . . . . 0 0 V 3 ,

with the block diagonals for motion and structure separated. Let us rewrite the basic iteration ( J T J ) Δ u = J T ϵ as

U W W T V Δ a Δ b = ϵ a ϵ b ,

and premultiply with

I WV - 1 0 I U W W T V Δ a Δ b = I WV - 1 0 I ϵ a ϵ b

We find out that motion parameters can be updated separately by inverting a  6 F × 6 F matrix

( U - WV - 1 W T ) Δ a = ϵ a - WV - 1 ϵ b .

Each 3-D point can be updated separately by inverting a 3 × 3 matrix V

V Δ b = ϵ b - W T Δ a

It is worth mentioning that bundle adjustment though extremely slow captures the correlation between motion estimates and structure (3-D points) estimates which is artificially hidden in the iterative scheme in (32.29).

The largest scale motion estimation and registration of views has been performed by Teller et al. [32.49] with a decoupled computation first of relative rotations and finally of relative translations. The above multiple view SfM techniques can also be applied in a sliding window mode in time. Davison et al. [32.28] showed the first real-time recursive approach by decoupling the direction of the viewing rays from the depth unknowns. For other recursive approaches the reader is referred to the corresponding SLAM chapter.

2 3-D Vision for Grasping

In this section we will move from the basic geometry required for grasping to the main 3-D vision challenges associated with the limited knowledge we might have about the shape of the object as well as the actual selection of 3-D grasping poses.

Naturally, object grasping and manipulation is closely related to general scene understanding and problems such as object detection, recognition, categorization and pose estimation. Taking all the above, there are very few approaches that address all the problems in a single system. One example, reported in [32.50], addresses the problem of enabling transfer of grasp knowledge between object categories, defined using both their physical properties and functionality. This is a challenging problem given that a number of objects with similar physical properties afford different tasks. An example can be a screwdriver and a carrot that are structurally alike, but only the former can be used as a tool, or a ball and an orange where only the latter affords eating (Fig. 32.3).

Fig. 32.3
figure 3

Examples of physically similar objects that afford different tasks

In relation to object grasping in particular, there are methods that assume that full 3-D model of the object is available and concentrate on grasp synthesis solely. In addition, many of the approaches conduct experiments in a simulated environment without working with real sensory data. However, the knowledge generated in simulation can also be applied later onto sensory data. Another group of approaches

considers grasp synthesis on real sensory data directly, dealing with problems such as noise, occlusions and missing data.

If the object to be grasped is known, there are approaches that store a database of grasp hypotheses, generated either in simulation or through experiments in a real setting. Most of the approaches assume that a 3-D mesh of the object is available and the challenge is then to automatically generate a set of feasible grasp hypotheses. This involves sampling the infinite space of possible hand configurations and ranking the resulting grasps according to some quality metric.

To simplify the process, a common approach is to approximate object’s shape with a constellation of primitives such as spheres, cones, cylinders, boxes or superquadrics [32.51, 32.52, 32.53, 32.54, 32.55]. The purpose of using shape primitives is to reduce the number of candidate grasps and thus prune the search space for finding the optimal set of grasp hypotheses.

One example, shown in Fig. 32.4 and reported in [32.52], decomposes a point cloud from a stereo camera into a constellation of boxes. Grasp planning is performed directly on the boxes which reduces the number of potential grasps. El-Khoury and Sahbani [32.56] distinguish between graspable and nongraspable parts of an object where each part is represented by fitting a superquadric to the point cloud data. Pelossof et al. [32.57] approximate an object with a single superquadric and use a Support Vector Machines based approach to search for the grasp that maximizes the grasp quality. Boularias et al. [32.58] model an object as a Markov random field (GlossaryTerm

MRF

) in which the nodes are points from the point cloud and edges are spanned between the six nearest neighbors of a point. A node in the MRF carries either one of the two labels: a good or a bad grasp location. Detry et al.[32.59] model the object as a constellation of local multimodal contour descriptors. The set of associated grasp hypotheses is modeled as a nonparametric density function in the space of six-dimensional (GlossaryTerm

6-D

) gripper poses, referred to as a bootstrap density. Papazov et al. [32.60] demonstrates 3-D object recognition and pose estimation in a grasping scenario considering cluttered scenes. Weisz and Allen [32.61] proposes a metric suitable for predicting grasp stability under pose uncertainty.

Fig. 32.4
figure 4

Generation of grasp candidates through object shape approximation and decomposition from (after [32.52])

There are several approaches that deal specifically with incomplete point clouds. Marton et al. [32.63] exploit symmetry by fitting a curve to a cross section of the point cloud. Rao et al. [32.64] concentrates of depth segmentation and sample grasp points from the surface of a segmented object using surface normals. Bohg et al. [32.65] presents a related approach that reconstructs full object shape assuming planar symmetry and generates grasps based on the global shape of the object. Bone et al. [32.66] makes no prior assumption about the shape of the object and apply shape carving for generating a parallel-jaw gripper grasps. Hsiao et al. [32.67] employs heuristics for generating grasp hypotheses dependent on the shape of the point cloud. Recent work in [32.68] identifies regions that afford force closure grasps by evaluating local curvature of the objects to create an initial opposing grasp with two or three fingers, dependent on the relative size of the object with respect to the hand. Richtsfeld and Vincze [32.69] uses a stereo-camera setup to generate a 3-D representation of a scene with several objects and then generates various top grasps on object candidates. Maldonado et al. [32.70] use time-of-flight range data, model objects using 3-D Gaussians and rely on finger torque information during grasping to monitor the grasp execution. Stückler et al. [32.71] generate grasp hypotheses based on eigenvectors of the object’s footprints that are generated by projecting the 3-D object point cloud onto the supporting surface. The work of [32.72] presents a system for general scene understanding used for grasp planning and execution. The system uses a bottom-up grouping approach where contour and surface structures are used as the basis for grasp planning. The work builds upon previous work presented in [32.73].

Most of the recent work concentrates on grasp generalization either by observing human grasping or through off- and on-line learning directly on the robot. Kroemer et al. [32.74] demonstrates generalization capabilities using a pouring task scenario. The goal of the approach is to find a part of the object that is most likely to afford the demonstrated action. The learning method is based on the kernel logistic regression. Herzog et al. [32.75] stores a set of local templates of object that a human is interacting with. If a local part of an object segmented online is similar to a template in the database, the associated grasp hypothesis is executed. Song et al. [32.62] approach the problem of inferring a full grasp configuration in relation to a specific task the object is intended for. As in [32.76], the joint distribution over various grasping variables is modeled as a Bayesian network. Additional variables like task, object category and task constraints are introduced. The structure of this model is learned given a large number of grasp examples generated in a simulator and annotated with grasp quality metrics as well as suitability for a specific task. The learned quality of grasps on specific objects given a task is visualized in Fig. 32.5.

Fig. 32.5
figure 5

Ranking of approach vectors on different objects given a specific task. The brighter an area the higher the rank. The darker an area, the lower the rank (after [32.62])

3 Conclusion and Further Reading

As main additional sources of reading, we recommend the textbooks by Hartley and Zisserman [32.12], Ma et al. [32.46], Faugeras [32.77], and Faugeras and Luong [32.11]. The reader is referred to Chap. 5 for fundamentals of estimation, to Chap. 35 for sensor fusion, to Chap. 34 for visual servoing, to Chap. 31 for range sensing, to Chap. 45 for 3-D models of the world, and to Chap. 46 for SLAM.

3-D vision is a rapidly advancing field and in this chapter we have covered only geometric approaches based on RGB cameras. Although depth sensors will become ubiquitous indoors and might be outdoors as well, RGB cameras remain formidable because of the higher number and larger diversity of features that can be matched and used for pose estimation and 3-D-modelling. Long range sensing can still be covered from motion with large translation while active sensors are constrained in terms of energy reflected from the environment.