1 Introduction

Reinforcement Learning (RL) of robot sensorimotor control policies has seen great advances in recent years, demonstrated for a wide range of motor tasks such as dexterous manipulation. This has translated into higher levels of dexterity than previously possible, typically demonstrated by the ability to reorient a grasped object in-hand using complex finger movements (Akkaya et al., 2019; Chen et al., 2021; Qi et al., 2022).

However, training a sensorimotor policy is still a difficult process, particularly for hard-exploration problems where the underlying state space exhibits complex structure, such as "narrow passages" between parts of the space that are accessible or useful. Manipulation is indeed such a problem: even when starting with the object secured between the digits, a random action can easily lead to a drop, and thus to an irrecoverable state. Finger-gaiting (Ma et al., 2011) further implies transitions between different subsets of fingers used to hold the object, all while maintaining stability. This leads to difficulty in exploration during training, since random perturbations in the policy action space are unlikely to discover narrow passages in state space. Current studies address this difficulty through a variety of means: using simple, convex objects and non-convex objects of significantly reduced size to reduce the difficulty of the task, reliance on support surfaces to reduce the chances of a drop, object pose tracking through extrinsic sensing, etc.

Fig. 1
figure 1

Our method illustrated with an abstract state-space consisting of narrow stable regions (beige) between large unstable regions (black). The proposed two-stage approach uses sampling-based planning to explore the challenging state-space and leverages the information within best paths by using state data as reset distribution and action data for imitation pre-training for efficient reinforcement learning

The difficulty of exploring problems with labyrinthine state space structures is far from new in robotics. In fact, the large and highly effective family of Sampling-Based Planning (SBP) algorithms was developed to address this exact problem. By expanding a known structure towards targets randomly sampled in the state space of the problem, SBP methods can explore even very high-dimensional state spaces in ways that are probabilistically complete, or guaranteed to converge to optimal trajectories. However, SBP algorithms are traditionally designed to find trajectories rather than policies. For problems with computationally demanding dynamics, SBP cannot be used online for previously unseen start states or to quickly correct when unexpected perturbations are encountered along the way.

Broadly, this work draws on the strength of both RL and SBP methods in order to train motor control policies for in-hand manipulation with finger gaiting. We aim to manipulate more difficult objects, including concave shapes, while securing them at all times without relying on support surfaces. Furthermore, we aim to achieve large reorientation of the grasped object with purely intrinsic (tactile and proprioceptive) sensing. To achieve that we use a non-holonomic RRT algorithm with added constraints to find approximate trajectories that explore the useful parts of the problem state space. Then, we use these trajectories towards training complete RL policies based on the full dynamics of the problem. In particular, we use state data sampled from the tree to construct exploratory reset distribution and use action data from tree paths to obtain a warm-start policy through imitation learning and illustrate this in Fig. 1.

This extended version of our study contains a number of additions compared to the previous conference proceedings version (Khandate et al., 2023). We introduce Imitation Pre-training, a method for leveraging actions sampled from the RRT tree in order to further bootstrap learning and show that it further improves training efficiency. We extend our applications to the more difficult in-hand manipulation task of reorientation to a desired pose, which can be either a canonical pose known at training time (for example, as needed by downstream steps in an assembly process) or arbitrary and specified as a goal at run-time. Finally, we provide a number of additional insights obtained from an extensive ablation of our approach.

Overall, the main contributions of this work include:

  • To the best of our knowledge, we are the first to show that reset distributions generated via SBP with kinematic constraints can enable more efficient training of RL control policies for dexterous in-hand manipulation.

  • We show that SBP can explore useful parts of the manipulation state space, allowing RL to later fill in the gaps between approximate trajectories by learning appropriate actions under more realistic dynamic constraints. A warm-start policy obtained by imitating the actions from sampling-based trajectories provides an additional boost in training speed.

  • The exploration boost from SBP allows us to train policies for dexterous skills not previously shown, such as in-hand manipulation of concave shapes, with only intrinsic sensing and no support surfaces. We demonstrate these skills both in simulation and on real hardware.

2 Related work

2.1 Exploration in RL

Exploration methods for general RL operate under the strict assumption that the learning agent cannot teleport between states, mimicking the constraints of the real world. Under such constraints, proposed exploration methods include using intrinsic rewards (Pathak et al., 2017; Haarnoja et al., 2018) or improving action consistency via temporally correlated noise in policy actions (Amin et al., 2021) or parameter space noise (Plappert et al., 2017).

Fortunately, in cases where the policies are primarily trained in simulation, this requirement can be relaxed, and we can use our knowledge of the relevant state space to design effective exploration strategies. A number of these methods improve exploration by injecting useful states into the reset distribution during training. Nair et al. (2017) use states from human demonstrations in a block stacking task, while Ecoffet et al. (2019, 2021) use states previously visited by the learning agent itself for problems such as Atari games and robot motion planning. Tavakoli et al. (2018) evaluate various schemes for maintaining and resetting from the buffer of visited states. However, these schemes were evaluated only on benchmark continuous control tasks (Duan et al., 2016). From a theoretical perspective, Agarwal et al. (2020) show that a favorable reset state distribution provides a means to circumvent worst-case exploration issues using sample complexity analysis of policy gradients.

2.2 Combining SBP and RL

Finding feasible trajectories through a complex state space is a well-studied motion planning problem. Of particular interest to us are sampling-based methods such as Rapidly-exploring Random Trees (RRT) (LaValle, 1998; Karaman & Frazzoli, 2010; Webb & Berg, 2013) and Probabilistic Road Maps (PRM) (Kavraki et al., 1996, 1998). These families of methods have proven highly effective and are still being expanded. Stable Sparse-RRT (SST) and its optimal variant SST* (Li et al., 2021) are examples of recent sampling-based methods for high-dimensional motion planning with physics. However, the goal of these methods is finding (kinodynamic) trajectories between known start and goal states rather than closed-loop control policies which can handle deviations from the expected states.

Several approaches have tried to combine the exploratory ability of SBP with RL, leveraging planning for global exploration while learning a local control policy via RL (Chiang et al., 2019; Francis et al., 2020; Schramm & Boularias, 2022). These methods were primarily developed for and tested on navigation tasks, where nearby state space samples are generally easy to connect by an RL agent acting as a local planner. The LeaPER algorithm (Pinto et al., 2018) also uses plans obtained by RRT as reset state distribution and learns policies for simple non-prehensile manipulation. However, the state space for the prehensile in-hand manipulation tasks we show here is highly constrained, with small useful regions and non-holonomic transitions. Other approaches use trajectories planned by SBP as expert demonstrations. Morere et al. (2020) recommend using a policy trained with SBP as expert demonstrations as an initial policy. Alternatively, Jurgenson and Tamar (2019) and Ha et al. (2020) use planned trajectories in the replay buffer of an off-policy RL agent for multi-arm motion planning. First, these methods requires that planned trajectories also include the actions used to achieve transitions, which SBP does not always provide. Next, it is unclear how off-policy RL can be combined with the extensive physics parallelism that has been vital in the recent success of on-policy methods for learning manipulation (Allshire et al., 2021; Makoviychuk et al., 2021; Chen et al., 2021).

2.3 Dexterous manipulation

Turning specifically to the problem of dexterous manipulation, a number of methods have been used to advance the state of the art, including planning, learning, and leveraging mechanical properties of the manipulator. Leveroni and Salisbury (1996) build a map of valid grasps and use search methods to generate gaits for planar reorientation, while (Han & Trinkle, 1998) consider finger-gaiting of a sphere and identify the non-holonomic nature of the problem. Some methods have also considered RRT for finger-gaiting in-hand manipulation (Yashima et al., 2003; Xu et al., 2007), but limited to simulation for a spherical object. More recently, Morgan et al. demonstrate robust finger-gaiting for object reorientation using actor-critic reinforcement learning (Morgan et al., 2021) and multi-modal motion planning (Morgan et al., 2022), both in conjunction with a compliant, highly underactuated hand designed explicitly for this task. Bhatt et al. (2022) also demonstrate robust finger-gaiting and finger-pivoting manipulation with a soft compliant hand, but these skills were hand-designed and executed in an open-loop fashion rather than autonomously learned.

Model-free RL has also led to significant progress in dexterous manipulation, starting with OpenAI’s demonstration of finger-gaiting and finger-pivoting Akkaya et al. (2019) trained in simulation and translated to real hardware. However, this approach uses extensive extrinsic sensing infeasible outside a lab setting, and relies on support surfaces such as the palm underneath the object. Khandate et al. (2022) show dexterous finger-gaiting and finger-pivoting skills using only precision fingertip grasps to enable both palm-up and palm-down operation, but only on a range of simple convex shapes and in a simulated environment.

Recently (Makoviychuk et al., 2021) showed that GPU physics could be used to accelerate learning skills similar to OpenAI’s. Allshire et al. (2021) used extensive domain randomization and sim-to-real transfer to re-orient a cube but used a table top as an external support surface. Chen et al. (2021), Chen et al. (2022), and Handa et al. (2022) demonstrated in-hand re-orientation for a wide range of objects under palm-up and palm-down orientations of the hand with extrinsic sensing providing dense object feedback.

Sievers et al. (2022), Pitz et al. (2023) and Röstel et al. (2023) demonstrated in-hand cube reorientation to desired pose with purely tactile feedback. Qi et al. (2022) and Yin et al. (2023) used rapid motor adaptation to achieve effective sim-to-real transfer of tactile only in-hand manipulation skills for continuous reorientation of small cylindrical and cube-like objects. Yuan et al. (2023) and Qi et al. (2023) use both visual and tactile sensing. We learn tactile in-hand manipulation skills critically, in our case, the exploration ability of SBP allows learning of policies for more difficult tasks, such as in-hand manipulation of non-convex and large shapes, with only intrinsic sensing. We also achieve successful and robust sim-to-real transfer without extensive domain randomization or domain adaptation by closing the sim-to-real gap via tactile feedback.

3 Problem description

We focus on the problem of achieving dexterous in-hand manipulation while simultaneously securing the manipulated object in a precision grasp. Keeping the object stable in the grasp during manipulation is needed in cases where a support surface is not available or the skill must be performed under different directions for gravity (i.e. palm up or palm down). However, it also creates a difficult class of manipulation problems, combining movement of both the fingers and the object with a constant requirement of maintaining stability.

Formally, our goal is to obtain a policy for issuing finger motor commands to achieve a desired object transformation. The state of our system at time t is denoted by \({\varvec{x}}_t=({\varvec{q}}_t, {\varvec{p}}_t)\), where \({\varvec{q}} \in \mathcal {R}^d\) is a vector containing the positions of the hand’s d degrees of freedom (joints), and \({\varvec{p}} \in \mathcal {R}^6\) contains the position and orientation of the object with respect to the hand. An action (or command) is denoted by the vector \({\varvec{a}} \in \mathcal {R}^d\) comprising new setpoints for the position controllers running at every joint.

For parts of our approach, we assume that a model of the forward dynamics of our environment (i.e. a physics simulator) is available for planning or training. We denote this model by \({\varvec{x}}_{t+1} = F({\varvec{x}}_t, {\varvec{a}}_t)\). We will show however that our results transfer to real robots using standard sim-to-real methods.

As discussed above, we also require that the hand maintains a stable precision grasp of the manipulated object at all times. Overall, this means that our problem is characterized by a high-dimensional state space, but only small parts of this state space are accessible to us: those where the hand is holding the object in a stable precision grasp. Furthermore, the transition function of our problem is non-holonomic: the subset of fingers that are tasked with holding the object at a specific moment, as well as the object itself, must move in a concerted fashion. Conceptually, the hand-object system must evolve on the complex union of high-dimensional manifolds that form our accessible states. Still, the problem state space must be effectively explored if we are to achieve dexterous manipulation with large object reorientation and finger gaiting.

4 Sampling-based state space exploration

To effectively explore our high-dimensional state space characterized by non-holonomic transitions, we turn to the well-known Rapidly-exploring Random Trees (RRT) algorithm. We leverage our knowledge of the manipulation domain to induce tree growth along the desired manifolds in state space. In particular, we expect two conditions to be met for any state: (1) the hand is in contact with the object only via fingertips, (2) the distribution of these contacts must be such that a stable grasp is possible.

Algorithm 1
figure h

General-purpose non-holonomic RRT (G-RRT)

Assuming system dynamics F() are available and fast to evaluate, we use here the general non-holonomic version of the RRT algorithm, which is able to determine an action that moves the agent towards a desired sample in state space via random sampling. We use the same version of this algorithm as described, for example, by  King et al. (2016), which we recapitulate here in Algorithm 1 and refer to as G-RRT.

The essence of this algorithm is the while loop in line 5: it is able to grow the tree in a desired direction by sampling a number \(K_{max}\) of random actions, then using the transition function F() of our problem to evaluate which of these produces a new node that is as close as possible to a sampled target.

Our only addition to the general-purpose algorithm is the stability check in line 8: a new node gets added to the tree only if it passes a stability check. This check consists of advancing the simulation for an additional 2 s with no change in the action; if, at the end of this interval, the object has not been dropped (i.e. the height of the object is above a threshold) then the new node is deemed stable and added to the tree. Assuming a typical simulation step of 2 ms, this implies 1000 additional calls to F() for each sample.

Overall, the great advantage of this algorithm lies in its simplicity and generality. The only manipulation-specific component is the aforementioned stability check. However, its performance can be dependent on \(K_{max}\) (i.e. number of action samples at each iteration), and each of these samples requires a call to the transition function. This problem can be alleviated by the advent of highly efficient and massively parallel physics engines implementing the transition function, which is an important research direction complementary to our study.

We note that in previous work (Khandate et al., 2023) we used an additional condition on the state: the hand must maintain at least three fingers in contact with the object.Footnote 1 Empirically, we noticed that the three contact constraint had minimal impact on the results of G-RRT. Thus, we discontinued the contact constraint in this work.

In the same prior work, we also introduced a manipulation-specific version of the non-holonomic RRT algorithm dubbed M-RRT, which does not require the use of the transition function for stability checks. Instead, M-RRT uses manipulation kinematics alone to explore manifolds defined by the contact constraints. This version also required that the hand must maintain at least three fingers in contact with the object. While the M-RRT version has the potential to obtain significant speedups by foregoing dynamic simulation, G-RRT is appealing in its generality and simplicity. As both versions are able to effectively explore the state space, we focus this study on the G-RRT version of our algorithm and plan additional comparisons between these two versions for future work.

5 Reinforcement learning

While the exploration method we have discussed so far is capable of exploring the complex state space of in-hand manipulation and identifying approximate transitions that follow the complex manifold structure of this space, it does not provide directly usable policies. The state transitions themselves provide no mechanism to act in states that are not part of the tree, or to act under slightly different transition functions.

To generate closed-loop policies able to handle variability in the encountered states, we turn to RL algorithms. Critically, we rely on the trees generated by our sampling-based algorithms to ensure effective exploration of the state space during policy training.

Algorithm 2
figure i

R\(\times \)R (+ IPT)

5.1 Sampling-based reset distribution

One mechanism we use to transition information from the sampling-based tree to the policy training method is via the reset distribution: we select relevant paths from the planned tree and then use the nodes therein as reset states for policy training.

We note that the sampling-based trees as described here are task-agnostic. Their effectiveness lies in achieving good coverage of the state space (usually within pre-specified limits along each dimension). Once a specific task is prescribed (e.g. via a reward function), we must select states from paths through the tree that are relevant to the task. After selecting said states, we use a uniform distribution over these states as a reset distribution for RL. We describe this concretely in Algorithm 2. In particular, lines 2– 15 show the heuristic approach we use to select paths through the tree to compose the set of reset states. Lines 1827 show reinforcement learning from using these reset states for exploration. We note that other selection mechanisms are also possible; a promising and more general direction for future studies is to select tree branches that accumulate the highest reward.

This approach has a theoretical grounding in related work showing that derived reset distributions can be used to aid exploration. Let \(\rho \) denote the initial state distribution of the MDP to be solved with policy-gradient RL. Recent results (Agarwal et al., 2020) show that it is beneficial to compute policy gradients for policy \(\pi \) under a different initial distribution \(\mu \) if it enables sufficient exploration. If \(d^{\pi }_{\rho }\) denotes the stationary state distribution induced by policy \(\pi \) under the initial state distribution \(\rho \), then \(d^{\pi ^*}_{\rho }\) is the stationary distribution of the optimal policy \(\pi ^*\) with initial state distribution \(\rho \). Improved exploration can be achieved if \(\mu \) sufficiently covers \(d^{\pi ^*}_{\rho }\). Here, we obtain such a favorable distribution \(\mu \) from plans derived via sampling-based planning.

As this approach is compatible with online RL wherein policy rollouts are collected from a new set of states every episode, both off-policy and on-policy RL are equally feasible. However, we use on-policy learning due to its compatibility with GPU physics simulators and relative training stability.

5.2 Imitation pre-training

Algorithm 3
figure j

Imitation Pre-training (IPT)

Relying on sampling-based trajectories exclusively for the reset state distribution disregards potentially valuable information concerning actions and the connectivity between states within exploration trees. Imitation learning, which maps states to actions via supervised learning, can be effective in using such action information embedded within the exploration trees. However, this is not straightforward to execute. The sampling-based trajectories cannot be treated as authentic demonstrations. Instead, they are quasi-static demonstrations composed of a sequence of stable states as a consequence of the stability constraint (Algorithm 1, line  8) we enforce during sampling-based planning. Due to this stability check, the actions are a distorted version of the desired actions under full dynamics. Thus, imitation learning, for instance, with Behavioral Cloning (BC), is unlikely to yield a successful policy due to the approximate nature of demonstrations and inherent challenges arising from distribution shift. Nonetheless, the resulting policies can be used as a warm-start policy for reinforcement learning.

Fig. 2
figure 2

The object shapes for which we learn finger-gaiting. From left to right: the easy, medium and hard categories

Here we provide a description of this approach. First, we observe that the difference in joint angles between node transitions is a reliable approximation of the desired action, particularly its direction. Augmented by a scaling hyper-parameter, these differences can serve as action labels. For illustration, let \({\varvec{q}}_k\) and \({\varvec{q}}_{k+1}\) be joint angles at successive states. The action labels \({\varvec{a}}_k\) can be obtained by simply scaling the difference with a scaling factor \(\beta \), i.e \({\varvec{a}}_k = \beta ({\varvec{q}}_{k+1} - {\varvec{q}}_{k})\). In cases where obtaining such action labels is more complex, an inverse dynamics model can be of assistance. Equation (1) restates this generally

$$\begin{aligned} {\varvec{a}}_k&= g({\varvec{x}}_{k}, {\varvec{x}}_{k+1}) \end{aligned}$$
(1)

where g is the inverse function to compute action labels between two successive states of a demonstration.

We then utilize these demonstrations via learning an imitation policy to bootstrap reinforcement learning. To retain the benefits from any generalization properties achieved via imitation learning, we pre-train the critic network on rollouts of the imitation policy to avoid washing it out with a randomly initialized critic (Hansen et al., 2022). This method aligns with contemporary approaches for integrating reinforcement learning with imitation learning which go beyond simply augmenting the online replay buffer with demonstrations Hu et al. (2023). Our main approach, supplemented with imitation pre-training, is depicted in Algorithm 3. Lines 16 and  18 show imitation pre-training with sampling-based demonstrations and warm-starting reinforcement learning.

6 Experimental setup and tasks

6.1 Hardware

We use the robot hand shown in Fig. 1, consisting of five identical fingers. Each finger comprises a roll joint and two flexion joints for a total of 15 fully actuated position-controlled joints. For the real hardware setup, each joint is powered by a Dynamixel XM430-210T servo motor. The distal link of each finger consists of an optics-based tactile fingertip as introduced by Piacenza et al. (2020).

We test our methods on the object shapes illustrated in Fig. 2. We split this into categories: “easy” objects (sphere, cube, cylinder), “moderate” objects (cuboids with elongated aspect ratios), and “hard” objects (either concave L- or U-shapes) and train object-specific policies for each of these objects with the proposed methods and also the baseline methods to be discussed later. We note that in-hand manipulation of the objects in the “hard” category has not been previously demonstrated in the literature.

6.2 Exploration trees

The extensive sampling of possible actions, which is the main computational expense of G-RRT (line 5) lends itself well to parallelization. In practice, we use the IsaacGym Makoviychuk et al. (2021) simulator to parallelize this algorithm at both these levels (16 parallel evaluations of the main loop, and 1024 parallel evaluations of the action sampling loop). In general, given the advent of increasingly more powerful parallel architectures for physics simulation, we expect that more general methods that are easier to parallelize might win out in the long term over more problem-specific solutions that are more sample efficient at the individual thread level.

6.3 Reinforcement learning

We train our policies using Asymmetric Actor Critic PPO Pinto et al. (2017); Schulman et al. (2017) where we separate actor and critic training for improved performance. All training is done in the IsaacGym simulator and the critic uses object pose \(\varvec{p}\), object velocity \(\dot{\varvec{p}}\), and net contact force on each fingertip \(\varvec{t}_1 \ldots \varvec{t}_{m}\) as feedback in addition to the feedback already provided to the policy network.

For Imitation Pre-training, we compute action labels from state transitions using \(\beta =2\) in Eq. (1). We train the critic with rollouts of the imitation policy for 2 M steps to mitigate forgetting issues inherent in a randomly initialized critic network.

6.4 Tasks

We evaluate our method on a variety of challenging dexterous in-hand manipulation tasks.

6.4.1 Finger-gaiting

First, we focus on the task of achieving large in-hand object rotation about a desired axis. We, as others before Qi et al. (2022), believe this to be representative of this general class of problems, since it requires extensive finger gaiting and object reorientation.

We chose to focus on the case where the only sensing available is hand-centric, either tactile or proprioceptive. Achieving dexterity with only proprioceptive sensing, as biological organisms are clearly capable of, can lead to skills that are robust to occlusion and lighting and can operate in very constrained settings. With this directional goal in mind, the observation available to our policy consists of tactile and proprioceptive data collected by the hand, and no global object pose information. Formally, the observation vector is

$$\begin{aligned} {\varvec{o}}_t = [{\varvec{q}}_t, {\varvec{q}}^s_t, {\varvec{c}}_t] \end{aligned}$$
(2)

where \({\varvec{q}}_t, {\varvec{q}}^s_t \in \mathcal {R}^d\) are the current positions and setpoints of the joints, and \({\varvec{c}}_t \in [0, 1]^m\) is the vector representing binary (contact / no-contact) touch feedback for each of m fingers of the hand.

In this task, where our goal is finger-gaiting for z-axis object rotation, we plan trees where object rotation around the x- and y-axes was restricted to 0.2 radians. Then we select \(2 \times 10^4\) nodes from the paths that exhibit the most rotation around the z-axis to construct our reset distribution and for imitation pre-training. On average, each such path comprises 20–30 nodes. We recall that all tree nodes are subjected to an explicit stability check under full system dynamics before being added to the tree; we can thus use each of them as is.

Similar to our previous work Khandate et al. (2022), we use a reward function that rewards object angular velocity about the z-axis. In addition, we include penalties for the object’s translational velocity and its deviation from the initial position Qi et al. (2022).

6.4.2 Arbitrary reorientation

While finger-gaiting to achieve maximum rotation is a good proof-of-concept task with several applications, we now consider the canonically studied and versatile in-hand manipulation skill of reorientation to a desired pose. We continue with the requirement to achieve in-hand manipulation with only stable fingertip grasps as it enables reorientation in arbitrary orientations of the hand.

We take on the challenging task of orienting an object from a randomly assigned initial pose to a desired arbitrary orientation, a frequently studied task in dexterous in-hand manipulation Chen et al. (2021), Yuan et al. (2023). In addition to proprioceptive and tactile data collected by the hand, we include current and desired object orientations as inputs to the policy resulting in the following observation vector,

$$\begin{aligned} {\varvec{o}}_t = [{\varvec{q}}_t, {\varvec{q}}^s_t, {\varvec{c}}_t, {\varvec{\phi }}_t, {\varvec{\phi }}_g] \end{aligned}$$
(3)

where \({\varvec{\phi }}_t, {\varvec{\phi }}_g \in \mathcal {R}^4\) are the current and desired object orientation respectively. Our reward function is a modification of the one proposed in Chen et al. (2021) as we found the original reward function fails in our setup. Equation 4 describes the reward function we use for this task,

$$\begin{aligned} r_t= & c \cdot \max (\min (\Delta _t - \Delta _{t-1}, \epsilon ), -\epsilon ) \nonumber \\ & \qquad + c_{\text {success}} \cdot \mathbb {1}[\text {success}] \end{aligned}$$
(4)

where \(\Delta _t\) is angular distance between current and desired object orientation, \(\epsilon > 0\) is a threshold coefficient, and \(c < 0\) is a scaling coefficient. A large positive reward, denoted \(c_{\text {success}}\), is added if the agent successfully completes the task. Task completion is achieved through successful reorientation plus satisfaction of heuristics adapted from previous work Chen et al. (2022). The criteria are described here:

  1. 1.

    Reorientation criterion: \(\Delta _t < \theta _{\text {thresh}}\)

  2. 2.

    Joint angular velocity criterion: \(\Vert {\varvec{\dot{q}}}_t\Vert _2 < \dot{q}_{\text {thresh}}\)

  3. 3.

    Object linear velocity criterion: \(\Vert {\varvec{\dot{x}}}_t\Vert _2 < \dot{x}_{\text {thresh}}\)

  4. 4.

    Object angular velocity criterion: \(\Vert {\varvec{\omega }}_t\Vert _2 < \omega _{\text {thresh}}\)

Unlike the previous task where a subset of the tree needs to be extracted to aid exploration, here the full tree i.e all nodes of the tree can be used towards assembling the buffer of exploratory reset states. To ensure complete exploration of the state space we use G-RRT trees generated without any constraints on object orientation.

6.4.3 Go-to-root

While our method improves on sample efficiency for learning policies for arbitrary reorientation, obtaining such policies remains computationally expensive due to the requirement of a few tens of billion simulation training steps. Learning such policies can be overkill; often, in practice, it is sufficient to reorient the object to a fixed canonical pose starting from an arbitrary initial pose. In a packaging line, for example, items arrive at arbitrary orientations. The essential in-hand manipulation task is to reorient objects to a canonical pose before inserting them in the packaging container.

However, learning to reorient the object to a fixed desired orientation starting from an arbitrary initial pose is still difficult as it suffers the same exploration challenges seen in previous tasks. Fortunately, our method of using exploration trees is naturally well suited for this problem. The robot state with the desired canonical orientation can itself be used as the root node while generating the exploration tree. Thus, we refer to this as the “Go-to-root” task.

In this task, we aim to learn a policy to reorient the grasped object to reach a desired canonical / root orientation. The inputs to the policy network are similar to the arbitrary reorientation task, consisting of proprioceptive, tactile, and current object pose feedback but excluding the desired goal orientation as it is fixed, resulting in the observation vector in Eq. (5). The reward function and success criteria are identical to the arbitrary reorientation task except with fixed canonical orientation \({\varvec{\phi }}_g\) as the goal.

$$\begin{aligned} {\varvec{o}}_t = [{\varvec{q}}_t, {\varvec{q}}^s_t, {\varvec{c}}_t, {\varvec{\phi }}_t] \end{aligned}$$
(5)

For imitation pre-training, we extract plans by backtracking from randomly selected nodes with large displacement from the root. In each task, this amounts about about 3K unique observation-action pairs for finger-gaiting and about 15k unique observation-action pairs for go-to-root task.

6.5 Algorithms and baselines

In our experiments, we compare the following approaches:

6.5.1 Ours, R\(\times \) R

In this variant, we use the method presented in this paper, relying on exploratory reset states obtained by growing the tree via G-RRT. In all cases, we use a tree comprising \(10^5\) nodes as informed the ablation study in Fig 14.

6.5.2 Ours, R\(\times \) R + IPT

In this variant of our method, we use Imitation Pre-training with labels obtained from the grown G-RRT tree, alongside using exploratory reset distribution extracted from the same tree.

6.5.3 Stable Grasp Sampler (SGS)

This baseline represents an alternative to the method presented in this paper: we use a reset distribution consisting of stable grasps generated by sampling random joint angles and object orientations. This approach has been effective in demonstrating precision in-hand manipulation with only intrinsic sensing Khandate et al. (2022); Qi et al. (2022) for simple shapes.

6.5.4 Explored Restarts (ER)

This method selects states explored by the policy itself during random exploration to use as reset states Tavakoli et al. (2018). It is highly general, with no manipulation-specific component, and requires no additional step on top of RL training. We implement the "uniform restart" scheme as it was shown to have superior performance on high dimensional continuous control tasks. However, we have found it to be insufficient for the complex state space of our problem: it fails to learn a viable policy even for simple objects.

6.5.5 Fixed Initialization (FI)

For completeness, we also tried resetting from a single fixed state.

6.5.6 Fixed Initialization (FI + IPT)

Towards understanding the effect of pre-training, we also tried FI baseline but with a warm start policy obtained from Imitation Pre-training using the G-RRT tree.

6.5.7 Gravity Curriculum (GC)

Additionally, we evaluated Fixed Initialization with gravity curriculum, linearly annealed from \(0 \frac{\text {m}}{\text {s}^2}\) to \(-9.81 \frac{\text {m}}{\text {s}^2}\) over the course of 50 M timesteps.

7 Results

Fig. 3
figure 3

Training performance of our methods and a number of baselines for the Finger-gaiting task on the object categories shown in Fig. 2

Fig. 4
figure 4

Key frames of the policies for the Finger-gaiting achieved with our method R\(\times \)R for representative objects in simulation

7.1 Evaluation in simulation

7.1.1 Finger-gaiting

Our training results aggregated over 3 seeds are summarized in Figs. 3 and 5. We find that our methods (R\(\times \)R, R\(\times \)R + IPT) converge within 50 M timesteps across all objects for the Finger-gaiting task, consistently outperforming baselines.

On the easy object set, we find that all the methods except the ER baseline learn to gait. However, policies that use a random grasp reset distribution (SGS) and sampling-based reset distribution (R\(\times \)R) achieve higher performance with robust gaits. We also observe that warm starting with Imitation Pre-training (R\(\times \)R + IPT) results in faster convergence. Unlike our previous studies, we find that FI also learns finger-gaiting. We attribute this to the improved implementation of our asymmetric actor-critic PPO and increased training steps made possible with the help of GPU physics.

However, as we move to more difficult objects in the medium and hard categories, the performance gap between our methods and baselines broadens. For medium difficulty objects, we find that R\(\times \)R, R\(\times \)R + IPT, and SGS again all learn to gait, but the policies learned via our methods, R\(\times \)R and R\(\times \)R + IPT, are more effective with higher returns. Again, we observe that warm starting with Imitation Pre-training (R\(\times \)R + IPT) results in faster convergence and also achieves improved final performance.

Finally, for the hardest object set, a random grasp-based reset distribution is no longer successful. Only R\(\times \)R and R\(\times \)R + IPT converge to stable and consistent gaits. Once again, R\(\times \)R + IPT improves on R\(\times \)R with respect to rate of convergence and final performance.

Interestingly, FI + IPT baseline that uses a warm start policy (obtained from Imitation Pre-training using the G-RRT tree) performs similarly to FI baseline, failing on all but easy objects. This suggests that exploratory reset distributions may be necessary to derive benefits from Imitation Pre-training.

To verify the scalability of the method to train policies for multiple objects, we successfully train object-agnostic policies with our R\(\times \)R approach. To achieve this we simply use object-specific initial distributions in multi-object training. We found the training difficulty to be limited to the hardest object in the set. Importantly, in Sec 7.2, we also verify our policies can be transferred to the real hand.

Fig. 5
figure 5

Training performance of our methods and a number of baselines for the Finger-gaiting task on hard objects shown in Fig. 2

Fig. 6
figure 6

Training performance for the Go-to-root task with our methods and other best-performing baselines

Fig. 7
figure 7

Training performance for the Arbitrary Reorientation task comparing our methods (R\(\times \)R, R\(\times \)R + IPT) with Stable Grasp Sampler (SGS) baseline

Fig. 8
figure 8

Training performance for the Go-to-root task with our methods and other best-performing baselines for objects in hard category

Next, we present our results in learning policies to reach a desired object orientation i.e. go-to-root and arbitrary reorientation tasks. Additionally, due to the poor performance of many baselines in the previous task, we continue with the best-performing ones. In particular, we continue with SGS and GC baselines. We drop the ER baseline as it uniformly fails across object classes and also drop the FI baseline as fixed initialization is incompatible with both tasks by definition.

7.1.2 Go-to-root

We evaluate R\(\times \)R and R\(\times \)R + IPT on the Go-to-root task and compare it with SGS and GC baselines. For this task we augment the Gravity Curriculum baseline to use 20 hand-crafted grasps as the reset distribution. These grasps are constructed to appropriately cover the orientation space of the object. We do this to ensure fair comparison between methods, as learning reorientation from a fixed initial orientation to a fixed target is a significantly easier task.

Our results are summarized in Figs. 6 and 8. As shown by the curves, go-to-root is a harder task requiring between 100 and 200 M steps. However, these results indicate that our methods consistently outperform baseline methods and produce effective control policies for the task. Again we find that R\(\times \)R + IPT improves on R\(\times \)R in both convergence and final performance across all object sets.

Fig. 9
figure 9

Key frames of goal-reaching tasks, Go-to-root and Arbitrary Reorientation, in simulation

7.1.3 Arbitrary reorientation

We evaluate R\(\times \)R on the Arbitrary Reorientation task. Our results in Fig. 7 plot the success rate. Note that we considered the hand in "palm-down" orientation as it tends to be the desired hand orientation in many applications. We demonstrate that our method outperforms SGS on all objects. Interestingly, our method significantly outperforms SGS baseline even for objects in the easy category, with increasing performance for objects in the medium and hard categories. The keyframes of our policy executing this task are shown in Fig. 9.

Overall, a common thread of all experiments presented here is that our methods (R\(\times \)R, R\(\times \)R + IPT) enable learning a range of challenging dexterous manipulation tasks, while none of the domain-agnostic methods (ER, FI, GC) are able to learn in-hand manipulation on objects beyond the easy set.

Table 1 Manipulation performance in real hardware

7.2 Evaluation on real hand

To test the applicability of our method on real hardware, we attempted to transfer the learned Finger-gaiting policies for a subset of representative objects: cylinder, cube, cuboid & L-shape. We chose these objects to span the range from simpler to more difficult manipulation skills.

We achieve transfer of the reinforcement learning policies to the real-hardware via domain randomization introduced via a curriculum. We adapt the policies achieved with nominal system parameters i.e. the system parameters used during prior tree generation and reinforcement learning phases with an additional fine-tuning phase where we introduce domain randomization of relevant system parameters. A thorough list of such parameters is detailed next. It is important to note that while randomizing these parameters, the exploratory tree generated with nominal parameters still remains valid as these parameter perturbations only mildly affected the stability of states. Hence, it is sufficient to use states obtained from this tree generated with with nominal system parameters.

First, we impose velocity and torque limits in the simulation, mirroring those used on the real motors (0.6 rad/s and 0.5 N\(\cdot \)m, respectively). We found that our hardware has a significant latency of 0.05s, which we included in the simulation. In addition, we modified the angular velocity reward to maintain a desired velocity instead of maximizing the object’s angular velocity. We also randomize joint origins (\(0.1 \text { rad}\)), friction coefficient (1–40), and train with perturbation forces (1 N). All these changes are introduced successively via a curriculum.

Fig. 10
figure 10

Key frames of the Finger-gaiting policies transferred to the real hand

Fig. 11
figure 11

Key frames of a Finger-gaiting policy transferred to the hand in the “palm-down” orientation

For observation, we used the current position and setpoint from the motor controllers with no additional changes. For tactile data, we found that information from our tactile fingers is most reliable for contact forces above 1 N. We thus did not use reported contact data below this threshold and imposed a similar cutoff in simulation. Overall, we believe that a key advantage of exclusively using proprioceptive data is a smaller sim-to-real gap compared to extrinsic sensors such as cameras. We note that an ablation study illustrating the importance of touch feedback is presented in Sect 8.

For the set of representative objects, we ran the respective policy ten consecutive times and counted the number of successful complete object revolutions achieved before a drop. In other words, five revolutions means the policy successfully rotated the object for \(1,800^{\circ }\) before dropping it. In addition, we also report the average object rotation speed observed during the trials.

The results of these trials are summarized in Table 1. Figures 10 and 11 show the keyframes of the real hand finger-gaiting we achieved with our policies. Finally, as our finger-gaiting policies do not rely upon vision feedback, our policies are robust to changes in lighting conditions. Examples of our method operating in dynamic lighting conditions can be found in the accompanying video.

We note that the starting policies used for sim-to-real transfer were trained with trees generated by a version of G-RRT that used the constraint that at least three fingers must be in contact with the object. However, as previously mentioned, we found that this constraint has minimal effect on the trees generated via G-RRT. Nevertheless, we expect similar sim-to-real performance of policies trained using trees generated after forgoing the three contact constraint. These policies are identical to the policies used for sim-to-real transfer, as per visual comparison in simulation. We also note that preliminary attempts to transfer policies for the cylinder and cube also show similar performance.

8 Ablations

8.1 Sampling-based exploration

First, we conduct ablation of G-RRT for our object set. We aim to discover how effectively the tree explores its available state space given the number of iterations through the main loop (i.e. the number of attempted tree expansions towards a random sample). As a measure of tree growth, we look at the maximum object rotation achieved around our target axis. We note that any rotation beyond approximately \(\pi /4\) radians can not be done in-grasp, and requires finger repositioning. Thus, we compare the maximum achieved object rotation vs. the number of expansions attempted (on a log scale). The results are shown in Fig. 12.

Fig. 12
figure 12

Tree expansion performance for G-RRT. We plot the number of attempted tree expansions (i.e. iterations through the main loop, on a log scale) against the maximum object z-axis rotation achieved by any tree node so far. We plot performance for different values of \(K_{max}\), the number of random actions tested at each iteration

As expected, the performance of G-RRT improves with the number \(K_{max}\) of actions tested at each iteration. Interestingly, the algorithm performs well even with \(K_{max}=1\); this is equivalent to a tree node growing in a completely random direction, without any bias towards the intended sample. However, we note that, at each iteration, the node that grows is the closest to the state-space sample taken at the beginning of the loop. This encourages growth at the periphery of the tree and along the real constraint manifolds, and, as shown here, still leads to effective exploration.

We also found that G-RRT is sensitive to the action-scale parameter \(\alpha \). Figure 13 compares exploration as measured by angular distance of the farthest node from the root for varying \(\alpha \). Among the various settings evaluated, an \(\alpha = 0.15\) was the fastest. Interestingly, higher values of action-scale adversely affects the performance of G-RRT. Furthermore, as we will discusses later, not only does \(\alpha \) affect the rate of exploration it also impacts the quality of extracted paths.

Fig. 13
figure 13

G-RRT action-scale (\(\alpha \)) ablation. \(\alpha = 0.15\) is optimal

8.2 Reinforcement learning

We now consider the ablation of training methods. We perform these ablations on the Finger-gaiting and Go-to-root tasks on a non-convex, L-shaped object.

We begin by studying the impact of the size of the tree used in extracting reset states. Figure 14a summarizes our results for learning a Finger-gaiting policy using trees of different sizes grown via G-RRT. Qualitatively, we observe that, as the tree grows larger, the top 100–400 paths sampled from the tree contain increasingly more effective gaits, likely closer to the optimal policy. We see that we need a sufficiently large tree with at least \(10^4\) nodes to enable learning. However, training is most reliable with \(10^5\) nodes. This suggests a strong correlation between the quality of states used for reset distribution and sample efficiency of learning.

Figure 14b summarizes the tree-size ablation results for learning the Go-to-root task. For a very small tree, we see near 100% success rate at training time. This is rather unsurprising as, at these small tree sizes, the nodes are still close to the root orientation resulting in a trivial reorientation task. As we expect, validation performance on paths extracted from the largest tree increases as the size of the tree generating the initial state distribution increases.

Fig. 14
figure 14

The final performance of policies for Finger-gaiting (left) and Go-to-root (right) tasks using trees of increasing sizes. Validation performance is with reset states from paths extracted from the largest tree

We also reconsider the action-scale parameter. Besides affecting the rate of exploration it also impacts the quality of paths. We can infer this by the impact on training performance in Fig. 15. We see a similar trend, as the training performance peaks again around \(\alpha = 0.15\) for both Finger-gaiting and Go-to-root.

Fig. 15
figure 15

Action scale ablation for the Finger-gaiting and Go-to-root tasks with L

Fig. 16
figure 16

Ablation of policy feedback components which highlights the importance of touch feedback

We also conducted an ablation study of policy feedback. Particularly, we aimed to compare intrinsically available tactile feedback vs. object pose feedback that would require external sensing. Figure 16 summarizes these results which demonstrate the importance of tactile feedback for both Finger-gaiting and Go-to-root tasks.

In the Finger-gaiting task, we found that touch feedback is essential in the absence of object pose feedback for all moderate and hard objects. For these objects, we also saw that replacing this tactile feedback with object pose feedback results in slower learning. Similarly, in the Go-to-root task, leaving out touch feedback from policy input also results in slower learning.

These results underscore the importance of touch feedback for in-hand manipulation skills. Richer tactile feedback such as contact position, normals, and force magnitude can be expected to provide even stronger improvements; we hope to explore this in future work.

9 Discussion and conclusions

The results we have presented show that sampling-based exploration methods make it possible to achieve difficult manipulation tasks via RL. In fact, these popular and widely used classes of algorithms are highly complementary in this case. RL is effective at learning closed-loop control policies that maintain the local stability needed for manipulation, and, thanks to training on a large number of examples, are robust to variations in the encountered states. However, the standard RL exploration techniques (random perturbations in action space) are ineffective in the highly constrained state space with a complex manifold structure of manipulation tasks. Conversely, SBP methods, which rely on a fundamentally different approach to exploration, can effectively discover relevant regions of the state space.

We present multiple methods for conveying exploration information derived through SBP to RL training algorithms. In particular, explored states can be used as an effective reset distribution to enable learning. The transitions between states used during sample-based exploration are also useful surrogates for actions, and can thus be used in an imitation pre-training stage to boost learning. Imitation pre-training serves as an effective strategy to leverage the inherent structure accessible through sampling-based plans. This approach becomes particularly valuable in addressing the challenges of hard motor control tasks within reinforcement learning, not only by facilitating training but also by providing a favorable initialization of actor and critic networks. Recent work in this domain extends the use of the imitation policy beyond mere initialization by demonstrating further benefits when incorporated throughout the training process. This is a promising avenue for achieving further performance gains in reinforcement learning scenarios for hard motor control tasks.

We use our approach to demonstrate a number of dexterous manipulation skills: achieving large in-hand reorientation about a given axis via finger-gaiting and reorientation to a desired object pose, which is either pre-set or randomized. Importantly, we demonstrate finger gaiting precision manipulation of both convex and non-convex objects, using only tactile and proprioceptive sensing. Using only these types of intrinsic sensors makes manipulation skills insensitive to occlusion, illumination, or distractors, and reduces the sim-to-real gap. We take advantage of this by demonstrating our approach both in simulation and on real hardware. We note that, while some applications naturally preclude the use of vision (e.g. extracting an object from a bag), we expect that in many real-life situations, future robotic manipulators will achieve the best performance by combining touch, proprioception, and vision.

In future work, we believe that our approach can be scaled to tackle even more demanding motor control tasks with numerous degrees of freedom, including both prehensile and non-prehensile manipulation. Bi-manual manipulation seems like a natural candidate application, particularly if it involves coordination with multi-fingered dexterous hands. We also believe our approach can be applied beyond dexterous manipulation, such as in learning challenging agile locomotion skills with minimal changes. Towards this aim, we suggest expanding the sampling configuration space to the complete state of the robot with first order derivatives and also replacing the stability check with necessary boundary constraints on the state. With these changes in place, we believe it is a feasible recipe to accelerate learning an end-to-end policy for learning agile skills such as robot parkour Zhuang et al. (2023). Finally, sampling-based exploration techniques could be integrated directly into the policy training mechanisms, removing the need for two separate stages during training. We hope to explore all these ideas in the future.