Learning Human Priors for Task-Constrained Grasping

Hjelm, Martin; Ek, Carl Henrik; Detry, Renaud; Kragic, Danica

doi:10.1007/978-3-319-20904-3_20

Martin Hjelm¹⁷,
Carl Henrik Ek¹⁷,
Renaud Detry¹⁸ &
…
Danica Kragic¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9163))

Included in the following conference series:

International Conference on Computer Vision Systems

1991 Accesses
5 Citations

Abstract

An autonomous agent using manmade objects must understand how task conditions the grasp placement. In this paper we formulate task based robotic grasping as a feature learning problem. Using a human demonstrator to provide examples of grasps associated with a specific task, we learn a representation, such that similarity in task is reflected by similarity in feature. The learned representation discards parts of the sensory input that is redundant for the task, allowing the agent to ground and reason about the relevant features for the task. Synthesized grasps for an observed task on previously unseen objects can then be filtered and ordered by matching to learned instances without the need of an analytically formulated metric. We show on a real robot how our approach is able to utilize the learned representation to synthesize and perform valid task specific grasps on novel objects.

This work was supported by the Swedish Foundation for Strategic Research, the Belgian National Fund for Scientific Research (Fnrs), the Swedish Research Council, and the EU project EU ERC FLEXBOT.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Fast Grasp Learning for Novel Objects

Experiments with Hierarchical Reinforcement Learning of Multiple Grasping Policies

On robot grasp learning using equivariant models

Article Open access 04 July 2023

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

This paper focuses on the problem of how an autonomous agent can discover how different tasks, conditions grasping. The challenge in task based grasping is two-fold, first the agent needs to understand the task requirements, secondly it needs to translate these to the object. Formulating the requirements explicitly is not likely to scale beyond simple tasks and objects. Therefore a promising approach is to try to infer the requirements from data using a learning by demonstration approach.

As an example, given a set of demonstrations of pouring, the agent should realize and generalize that the actual constraints lies in, not covering the opening, and in the additional constraints that makes a placed grasp on a specific object successful. This is the fundamental challenge facing data-driven task based grasping.

An important body of work on task based grasping, using a generative model, is Song et al. [1, 2]. The authors build a Bayesian Network around gripper, object, and task parameters; making it possible to condition on task and synthesize task based grasps. Even though the approach shows impressive results on synthetic data, building statistical models of all observations, will place significant constraints on the dimensionality of the observations. It is thus questionable how the approach will perform when the number of tasks and objects increase.

A number of papers instead applies a discriminative approach and learns distributions over local features of the object, Saxena et al. [3], Boularis et al. [4], etc. This avoids the complexity of the generative model but still requires a substantial amount of, often manually, labeled data, which is expensive to obtain.

Rather than using features a successful approach is to represent objects as a set of local templates. Grasping can then be learned on a template basis allowing for transfer of grasps between different objects. The template systems of Detry et al. [5], Herzog et al. [6], Kroemer et al. [7], and Ying et al. [8] have been shown to give good results. The main drawback of these template based systems is their analytical approaches to similarity and the complexity in constructing the template structure. Our method avoids both these drawbacks by learning similarity from the data by learning a representation and utilizing it for matching when synthesizing grasps.

Selecting the features that are relevant for a task to model task constraints is linked with affordances. Stark et al. [9] detects affordance cues, functional parts of the object that the human demonstrator uses for grasping. The agent then uses the cues to plan grasps on objects. Aleotti et al. [10] presents a similar idea which represents objects as a topology. Each demonstrated grasp is associated with one part of the topology. The agent then formulates task constraints around the associated parts.

Instead of trying to decompose or detect functional parts of objects our method revolves around relating previous grasping experience and grounding task constraints in sensory data, similarly to our work in [11]. The features captured from the human demonstrator can therefore be considered as object specific affordance cues that our method learns to generalize.

2 Methodology

Our aim is to be able generate a set of grasps on a previously unseen object that are compatible with a given task, and order them according to relevance.

In the learning process we observe a set of objects and grasps, placed by a human demonstrator, that are suitable to achieve the task. For each grasp on an object we extract a set of features, which capture global and local aspects, which we combine into a vector $x$. The features, $x$, captured from a demonstration are labeled with a task variable, $t$.

Global features describe the object as a whole, for example, how well it approximates a shape primitive, etc. Local features, are features at the grasp of the human demonstrator, for example, local curvature or color. We do not collect any pose information of the demonstrator, since robot-human pose transforms are not the aim of this paper. The demonstration procedure is illustrated in Fig. 3 and the features we capture are described in the feature section.

Given a novel object, and a desired task we wish to infer from a set of synthesized grasps, $G =\{g_i\}_{i=1}^M$, which of them are compatible with the task. For each synthesized grasp we extract a feature vector in the same manner as in the demonstration process, $X_G =\{x_i\}_{i=1}^M$; this is explained further in the Grasp Synthesis section. To select valid grasps from the synthesized set we apply $k$-Nearest Neighbor (k-NN) classification using the learned instances.

After extracting the set of valid grasps we want to produce a ranking, which reflects how similar a synthesized grasp is to learned instances. We use a Kernel Density Estimate (KDE) over the set of observed instances to produce a likelihood for each of the grasps. The KDE is formed using a Gaussian Kernel and the assumption that a suggested grasp, $x^{*}$, is conditionally independent of all other grasps except for the $k$ nearest neighbors, that is,

$$\begin{aligned} p(x^{*}|\mathbf {X}_t) = \frac{1}{k} \sum _{x \in \mathbf {X}_{k}} \; \mathcal {N}(x^{*};x,\mathbf {\Sigma }), \end{aligned}$$

(1)

where $\mathbf {X}_{k}$ is the set of $k$ closest neighbors to $x^{*}$ found in $\mathbf {X}_t$, the observed instances of the specific task, and $\mathbf {\Sigma }$ is a diagonal matrix. To execute the grasp on the object we choose the most likely synthesized grasp.

2.1 Metric Learning for Feature And Transfer-Selection

Both k-NN and the Gaussian Kernel uses a Euclidian metric to measure feature proximity. It has several shortcomings, as the dimension of the features increases it becomes susceptible to noise, and the inclusion of irrelevant features. Further on, different features have different natural distance measures, and it is unclear how one can weight these.

We instead consider the problem of measuring similarity as a representation learning problem, where we want to learn a task dependent embedding from the data such that similarity is translated into proximity in the embedding. In addition, we want the embedding to reflect which features are relevant to the task.

One set of metric learning algorithms, that approaches the problems specified above, are those that try to learn a Mahalanobis metric; that is, a positive semidefinite matrix (PSD), $\mathbf {M}$, that reflects information given about instance distances in the training data.

Since $\mathbf {M}$ is PSD the Mahalanobis metric is equivalent to applying a transform, $\mathbf {L}$ to the data points, where $\mathbf {M}= \mathbf {L}^T\mathbf {L}$, and then computing the Euclidian distance; that is,

$$\begin{aligned} D(x,y) = (x-y)^{T} \mathbf {M} (x-y) = \parallel \mathbf {L}(x-y) \parallel ^{2}_{2}. \end{aligned}$$

(2)

The method we employ in this paper is the Large Margin Nearest Neighbor Classification (LMNN) algorithm [12]. [12] formulates the learning of $\mathbf {M}$ as a semidefinite program (SDP) which tries to compress distances between subsets of instances of the same class, called target neighborhoods, while pushing out non-class members, called impostors, outside a specified margin. See Fig. 2 for an illustration.

$$\begin{aligned} \begin{aligned}&\mathbf{minimize }&\\&(1-\mu )\Sigma _{i,i \leadsto j}(x_i-x_j)^{T} \mathbf {M} (x_i-x_j) + \mu \Sigma _{i,i \leadsto j,l}(1-y_{il})\xi _{ijl} \\&\mathbf{subject to }&\\&(x_i-x_l)^{T} \mathbf {M}(x_i-x_l) - (x_i-x_j)^{T} \mathbf {M} (x_i-x_j) \ge 1-\xi _{ijl}&\\&\xi _{ijl} \ge 0&\\&\mathbf {M} \succeq 0&\end{aligned} \end{aligned}$$

(3)

The first term in the objective function is the Mahalanobis distance which penalizes large distances between target neighbors. The second term is the slack variables that mimics the convex hinge loss. The indices $i,j$ specify target neighbors, and $l$ impostors. The slack variables, $\xi _{ijl}$, are over all triplets of target neighborhoods and measure the amount of violation of the margin in the transformed space.

The target neighbors subsets are determined by either a priori specified information about similarity or in the absence as the instances with minimum Euclidian distance in the original space. The cardinality of the target neighborhoods is set manually.

$\mathbf {M}$ is learned by exploiting the fact that most target neighborhoods are homogenous allowing the optimization to work on subsets of the constraints involving the neighborhoods. $\mathbf {M}$ can thus be found using sub-gradients of the loss function with standard hill-climbing algorithms, together with a step where $\mathbf {M}$ is projected onto the cone of all positive semidefinite matrices [12].

Learning $\mathbf {M}$, for $D$-dimensional data, requires learning $D^2$ entries of the matrix, the parameters of the model. With few data points that are high-dimensional overfitting can occur. Since data points are expensive to obtain we want to reduce the number of parameters. By letting $\mathbf {M}$ be diagonal, we only need to learn $D$ parameters. Instead of penalizing any non-zero off-diagonal elements in $\mathbf {M}$ in the objective function we set all off-diagonal elements to zero, in the update step. The optimization process will thus find the diagonal matrix that minimizes the cost function.

LMNN translates neatly into our feature representation for grasps. The diagonalization of $\mathbf {M}$ allows us to treat the diagonal elements as feature weights, where low values indicates non-discriminative features and therefore irrelevant features for a task. The matrix found by LMNN thus implicitly facilitates feature selection. By analyzing the magnitudes for a specific task we can compare with our expectations of which features are important.

2.2 Features

We are interested in exploiting the fact that objects with similar function follow similar form. There is also evidence that humans make inference about the world by considering both global and local observations, [13, 14], and that both aspects enhance inference [15].

Given the above observations we are interested formulating features that capture both global and local aspects of the object. Since we a priori cannot know which of them will be important for a task it is reasonable to select many simple and broad features over those that are more discriminatory. In addition, having multiple features will help when sensors fails, or give incorrect information.

We capture information from a grasp by extracting features from a bounding box around the grasp and merging them with global descriptors of the object into a 104 dimensional vector. The process is illustrated in Fig. 3. We go over the features briefly below and provide motivation when needed.

[Shape Primitives] - Scoring function for how well the object resembles a spherical, cylindrical, and cuboid.
[Elongatedness] - The ratio of the minor axes to the major axis.
[Relative Free volume] - The volume of the part of the object enclosed by the grasp cuboid divided by the volume of the object.
[Relative Grasping Position] - The relative position of the grasp on the axes of the object, that is, the ratio of the position of the grasp on the main axes divided by the axes
[Openings] - The opening feature detects openings on objects, parametrized as a 3D circle, and produces: - A binary value for the existence of an opening. - The angle the grasp cuboid center makes with respect to the openings normal - The signed relative orthogonal distance to the openings plane.
[Bag of Words Histogram over Fast Point Feature] - To encode local curvature we create a Bag of Words model over the Fast Point Feature Histogram (FPFH) [16] using a small set of 30 keywords. For each of the points in a recorded grasp we extract count-normalized histogram (CNH) over the keywords.
[Intensity and Gradient Filters] - We apply gradient filters of first, second, and third order over the 2D window containing the object. We also extract the intensity values. For each of the points in a recorded grasp we compute CNHs for each of the filter values and the intensity values.
[Color Quantization] - We are interested in comparing color against color at the grasp point, as there are reoccurring color patterns in human design. We segment the image using the segmentation algorithm of [17] and encode each patch with its closest color from a chart of 15 different colors, in CIELab space. Each point in the point cloud of the object thus gets a quantized color assigned to it. For a grasp we compute the CNH of the quantized colors of the points captured by the grasp.
[Color Uniformity] - We are also interested in the uniformity of the colors since objects, usually have areas that have a single colors to indicate some form of function, for a grasp we compute the entropy, mean and variance of the CNH color quantization.

2.3 Grasp Synthesis

Generating constructive grasps on a partial point cloud of an object is difficult. Models that rely on collision detection are not applicable due to the partial representation of the object. Our method for generating grasps mimics the method found in [18] but instead uses the information found in the point cloud and point cloud normals. In short, we slice the object with a plane constructed by points in the point cloud, and select an approach vector that is orthogonal to the plane. The width of gripper is computed by taking the minimum point spread of the two orthogonal directions to the plane. The height of the gripper is set to a random value. The bounding box generated by the gripper is used to extract the feature representation for the synthesized grasp.

3 Experiments

In evaluating our approach, we want to verify a number of things: - That our method can order its nearest neighbors better than k-NN - That the ordering is reasonable - That the approach can separate different groupings of grasps reasonably well - That the features that LMMN picks out resonates with what could be expected for the specific task - That the grasps we generate are in fact useful, placed at the proper positions and executable

3.1 Data Collection

We perform the collection of object-grasp data using a Kinect camera with objects placed on a flat surface in front of the robot. The robot observes and segments out the object and records features over the object. A human demonstrator performs both positive and negative grasp examples for four different tasks: lift up, pouring, handle, and corks. The robot observes the grasp placement and the local features as captured by a box around the hand placement. The procedure is illustrated in Fig. 3.

3.2 Learning

We preprocess the non-histogram features by scaling to unit variance. To evaluate the learning we run the LMNN and k-NN a 100 times, using random splits of the collected data with a ratio of 70 % training and 30 % test data. We also make sure that each split contains at least 25 % positive and negative examples. The reason for this lies in the formulation of LMNN that allows instances some slack in fulfilling the optimization constraints. The gradient descent optimization is thus prone to allowing the smaller class to violate the constraints while optimizing for the bigger class. If the positive example’s relevant features have been erased by the feature learning process then comparison to previously learned instances in the grasp synthesis phase becomes reliant on comparison to irrelevant features.

We use 5 nearest neighbors to evaluate the classification rate. As can be seen in Fig. 4 LMNN is superior to k-NN performing better, roughly by 3–7 % for all grasps classes. The curves for the class agreement as a function of the nearest neighbor in Fig. 6, are also consistent for the LMNN while k-NN declines quite rapidly as the order increases. Further on, k-NN has much more variance. A possible explanation for the decline in the lift up task is that the features that explain a correct grasping position might be less distinct compared to handles or corks.

Quantitatively LMNN outperforms k-NN as could be expected. Less clear is if the embedding actually re-orders objects in an order that is sensical to a human observer. In Fig. 5 we have included a set of grasp instances and their 5 closest neighbors. As can be seen the ordering in the LMNN embedding makes more sense. For example, in the case of the pen cork k-NN has it that the nearest neighbor is the cork of a cleaning fluid bottle while LMNN puts a similar pen as the nearest neighbor. Similar orderings can be observed in the other examples and indicates that LMNN through the embedding finds a more sensical ordering.

3.3 Feature Selection

Feature selection in our context means that the LMNN diagonal have low to zero values for irrelevant features and high for relevant. One way of analyzing the selection is to look at the mean value and standard deviation across the 100 runs. A detailed analysis is provided in Fig. 7.

3.4 Grasping

To evaluate how the model is able to transfer the demonstrated instances to grasp novel objects we select a set of unseen objects for each task and generate a $10000$ positive samples using 1-NN for classification. We pick the most likely grasp as the one to perform on the object. The final grasps are illustrated in Fig. 1.

4 Conclusion

We have presented an approach for task based grasping based on metric learning. Our approach use demonstrations to learn a task-specific representation with the characteristic that distances reflect grasp similarity according to the demonstrated semantic. This allows us to use simple distance based heuristics to transfer grasps from objects in the training data-base to novel objects. We show experiments on a real robot where execution is performed on previously unseen objects according to the semantic learned during demonstration. In future work we plan to extend the model to include information about how the human demonstrator grasped the object, such as approach vector, type of grasp, etc. Including this information will give the agent the ability to understand how the human demonstrator’s grasp relates to its own embodiment.

References

Song, D., Huebner, K., Kyrki, V., Kragic, D.: Learning task constraints for robot grasping using graphical models. In: IROS (2010)
Google Scholar
Song, D., Ek, C.H., Huebner, K., Kragic, D.: Embodiment-specific representation of robot grasping using graphical models and latent-space discretization. In: IROS, pp. 980–986 (2011)
Google Scholar
Saxena, A., Driemeyer, J., Ng, A.Y.: Robotic grasping of novel objects using vision. Int. J. Robot. Res. 27(2), 157–173 (2008)
Article Google Scholar
Boularias, A., Kroemer, O., Peters, J.: Learning robot grasping from 3-D images with Markov Random Fields. In: IROS, pp. 1548–1553 (2011)
Google Scholar
Detry, R., Ek, C.H., Madry, M., Kragic, D.: Learning a dictionary of prototypical grasp-predicting parts from grasping experience. In: ICRA (2011)
Google Scholar
Herzog, A., Pastor, P., Kalakrishnan, M., Righetti, L., Asfour, T., Schaal, S.: Template-based learning of grasp selection. In: ICRA (2012)
Google Scholar
Kroemer, O., Ugur, E., Oztop, E., Peters, J.: A kernel-based approach to direct action perception. In: ICRA, pp. 2605–2610 (2012)
Google Scholar
Ying, L., Fu, J.L., Pollard, N.S.: Data-driven grasp synthesis using shape matching and task-based pruning. IEEE Trans. Visual Comput. Graphics 13(4), 732–747 (2007)
Article Google Scholar
Stark, M., Lies, P., Zillich, M., Wyatt, J.C., Schiele, B.: Functional object class detection based on learned affordance cues. In: Gasteratos, A., Vincze, M., Tsotsos, J.K. (eds.) ICVS 2008. LNCS, vol. 5008, pp. 435–444. Springer, Heidelberg (2008)
Chapter Google Scholar
Aleotti, J., Caselli, S.: Part-based robot grasp planning from human demonstration. In: ICRA, pp. 4554–4560 (2011)
Google Scholar
Hjelm, M., Detry, R., Ek, C.H., Kragic, D.: Cross-object grasp transfer. In: ICRA, Representations for Cross-task (2014)
Google Scholar
Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 10, 207–244 (2009)
MATH Google Scholar
Bertenthal, B.I.: Origins and early development of perception, action, and representation. Annu. Rev. Psychol. 47(1), 431–459 (1996)
Article Google Scholar
Berthier, N.E., Clifton, R.K., Gullapalli, V., McCall, D.D., Robin, D.J.: Visual information and object size in the control of reaching. J. Mot. Behav. 28(3), 187–197 (1996)
Article Google Scholar
Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV, vol. 1, pp. 1–2. Prague (2004)
Google Scholar
Rusu, R.B., Blodow, N., Beetz, M.: Fast Point Feature Histograms (FPFH) for 3D registration. In: ICRA, pp. 3212–3217 (2009)
Google Scholar
Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vis. 59(2), 167–181 (2004)
Article Google Scholar
Bergström, N., Bohg, J., Kragic, D.: Integration of visual cues for robotic grasping. In: Fritz, M., Schiele, B., Piater, J.H. (eds.) ICVS 2009. LNCS, vol. 5815, pp. 245–254. Springer, Heidelberg (2009)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Autonomous Systems and the Computer Vision and Active Perception Lab, CSC, KTH Royal Institute of Technology, Stockholm, Sweden
Martin Hjelm, Carl Henrik Ek & Danica Kragic
University of Liège, Liège, Belgium
Renaud Detry

Authors

Martin Hjelm
View author publications
You can also search for this author in PubMed Google Scholar
Carl Henrik Ek
View author publications
You can also search for this author in PubMed Google Scholar
Renaud Detry
View author publications
You can also search for this author in PubMed Google Scholar
Danica Kragic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carl Henrik Ek .

Editor information

Editors and Affiliations

Aalborg University, Copenhagen, Denmark
Lazaros Nalpantidis
Aalborg University, Copenhagen, Denmark
Volker Krüger
Royal Institute of Technology - KTH, Stockholm, Sweden
Jan-Olof Eklundh
Democritus University of Thrace, Xanthi, Greece
Antonios Gasteratos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hjelm, M., Ek, C.H., Detry, R., Kragic, D. (2015). Learning Human Priors for Task-Constrained Grasping . In: Nalpantidis, L., Krüger, V., Eklundh, JO., Gasteratos, A. (eds) Computer Vision Systems. ICVS 2015. Lecture Notes in Computer Science(), vol 9163. Springer, Cham. https://doi.org/10.1007/978-3-319-20904-3_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-20904-3_20
Published: 19 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20903-6
Online ISBN: 978-3-319-20904-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Human Priors for Task-Constrained Grasping

Abstract