Keywords

1 Introduction

Deep reinforcement learning (DRL) [3], in which neural networks are used as function approximators for reinforcement learning (RL), has been shown to be capable of solving complex control problems in several environments, including board games [27, 28], video games [4, 19, 30], simulated and real robotic manipulation [2, 9, 15] and simulated autonomous driving [12].

However, learning from a sparse reward signal, where the only reward is provided upon the completion of a task, still remains difficult. An agent may rarely or never encounter positive examples from which to learn in a sparse-reward environment. Many domains therefore provide dense reward signals [5], or practitioners may turn to reward shaping [20]. Designing dense reward functions typically requires prior domain knowledge, making this approach difficult to generalise across different environments.

Fortunately, a common scenario is goal-oriented RL, where the RL agent is tasked with solving different goals within the same environment [11, 25]. Even if each task has a sparse reward, the agent ideally generalises across goals, making the learning process easier. For example, in a robotic manipulation task, the goal during a single episode would be to achieve a specific position of a target object.

Hindsight experience replay (HER) [1] was proposed to improve the learning efficiency of goal-oriented RL agents in sparse reward settings: when past experience is replayed to train the agent, the desired goal is replaced (in “hindsight”) with the achieved goal, generating many positive experiences. In the above example, the desired target position would be overwritten with the achieved target position, with the achieved reward also being overwritten correspondingly.

Fig. 1.
figure 1

Overview of DTGSH. Every time a new episode is completed, its diversity is calculated, and it is stored in the episodic replay buffer. During training, m episodes are sampled according to their diversity-based priority, and then k diverse, hindsight-relabelled transitions are sampled using a k-DPP [13].

We note that HER, whilst it enabled solutions to previously unsolved tasks, can be somewhat inefficient in its use of uniformly sampling transitions during training. In the same way that prioritised experience replay [26] has significantly improved over the standard experience replay in RL, several approaches have improved upon HER by using data-dependent sampling [8, 32]. HER with energy-based prioritisation (HEBP) [32] assumes semantic knowledge about the goal-space and uses the energy of the target objects to sample trajectories with high energies, and then samples transitions uniformly. Curriculum-guided HER (CHER) [8] samples trajectories uniformly, and then samples transitions based on a mixture of proximity to the desired goal and the diversity of the samples; CHER adapts the weighting of these factors over time. In this work, we introduce diversity-based trajectory and goal selection with HER (DTGSH; See Fig. 1), which samples trajectories based on the diversity of the goals achieved within the trajectory, and then samples transitions based on the diversity of the set of samples. In this paper, DTGSH is evaluated on five challenging robotic manipulation tasks. From extensive experiments, our proposed method converges faster and reaches higher rewards than prior work, without requiring domain knowledge [32] or tuning a curriculum [8], and is based on a single concept—determinantal point processes (DPPs) [14].

2 Background

2.1 Reinforcement Learning

RL is the study of agents interacting with their environment in order to maximise their reward, formalised using the framework of Markov decision processes (MDPs) [29]. At each timestep t, an agent receives a state \(s_{t}\) from the environment, and then samples an action \(a_{t}\) from its policy \(\pi (a_{t}|s_{t})\). Next, the action \(a_{t}\) is executed in the environment to get the next state \(s_{t+1}\), and a reward \(r_{t}\). In the episodic RL setting, the objective of the agent is to maximise its expected return \(\mathbb {E}[R]\) over a finite trajectory with length T:

$$\begin{aligned} \mathbb {E}[R] = \mathbb {E}\left[ \sum _{t=1}^{T} \gamma ^{t-1}r_{t}\right] , \end{aligned}$$
(1)

where \(\gamma \in [0, 1]\) is a discount factor that exponentially downplays the influence of future rewards, reducing the variance of the return.

2.2 Goal-Oriented Reinforcement Learning

RL can be expanded to the multi-goal setting, where the agent’s policy and the environment’s reward function \(\mathcal {R}(s_t, a_t)\) are also conditioned on a goal g [11, 25]. In this work, we focus on the goal-oriented setting and environments proposed by OpenAI [23].

In this setting, every episode comes with a desired goal g, which specifies the desired configuration of a target object in the environment (which could include the agent itself). At every timestep t, the agent is also provided with the currently achieved goal \(g^{ac}_{t+1}\). A transition in the environment is thus denoted as: \((s_t, a_t, r_t, s_{t+1}, g, g^{ac}_{t+1})\). The environment provides a sparse reward function, where a negative reward is given unless the achieved goal is within a small distance \(\epsilon \) of the desired goal:

$$\begin{aligned} \mathcal {R}\left( g, g^{ac}_{t+1}\right) := {\left\{ \begin{array}{ll} 0&{} \text {if } \left\| g^{ac}_{t+1} - g\right\| \le \epsilon \\ -1&{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(2)

However, in this setting, the agent is unlikely to achieve a non-negative reward through random exploration. To overcome this, HER provides successful experiences for the agent to learn from by relabelling transitions during training: the agent trains on a hindsight desired goal \(g'\), which is set to the achieved goal \(g^{ac}_{t+1}\), with \(r_t\) recomputed using the environment reward function (Eq. 2).

2.3 Deep Deterministic Policy Gradient

Deep deterministic policy gradient (DDPG) [16] is an off-policy actor-critic DRL algorithm for continuous control tasks, and is used as the baseline algorithm for HER [1, 8, 32]. The actor \(\pi _{\theta }(s_{t})\) is a policy network parameterised by \(\theta \), and outputs the agent’s actions. The critic \(Q_{\eta }(s_{t}, a_{t})\) is a state-action-value function approximator parameterised by \(\eta \), and estimates the expected return following a given state-action pair. The critic is trained by minimising \({\mathcal {L}_{c}=\mathbb {E}[(Q_{\eta }(s_{t}, a_{t}) - y_{t})^{2}]}\) where \({y_{t} = r_{t} + \gamma Q_{\eta }(s_{t+1}, \pi _{\theta }(s_{t+1}))}\). The actor is trained by maximising \({\mathcal {L}_{a} = \mathbb {E}[Q_{\eta }(s_{t}, \pi _{\theta }(s_{t}))]}\), backpropagating through the critic. Further implementation details can be found in prior work [1, 16].

2.4 Determinantal Point Processes

A DPP [14] is a stochastic process that characterises a probability distribution over sets of points using the determinant of some function. In machine learning it is often used to quantify the diversity of a subset, with applications such as video [18] and document summarisation [10].

Formally, for a discrete set of points \(\mathcal {Y}=\{x_{1}, x_{2}, \cdots , x_{N}\}\), a point process \(\mathcal {P}\) is a probability measure over all \(2^{|\mathcal {Y}|}\) subsets. \(\mathcal {P}\) is a DPP if a random subset \(\mathbf {Y}\) is sampled with probability:

$$\begin{aligned} \mathcal {P}_{L}(\mathbf {Y}=Y) = \frac{\text {det}(L_{Y})}{\sum _{Y'\subseteq \mathcal {Y}} \text {det}(L_{Y'})} = \frac{\text {det}(L_{Y})}{\text {det}(L+I)}, \end{aligned}$$
(3)

where \(Y\subseteq \mathcal {Y}\), I is the identity matrix, \(L \in \mathbb {R}^{N\times N}\) is the positive semi-definite DPP kernel matrix, and \(L_{Y}\) is the sub-matrix with rows and columns indexed by the elements of the subset Y.

The kernel matrix L can be represented as the Gram matrix \(L = X^{T}X\), where each column of X is the feature vector of an item in \(\mathcal {Y}\). The determinant, \(\text {det}(L_{Y})\), represents the (squared) volume spanned by vectors \(x_{i}\in Y\). From a geometric perspective, feature vectors that are closer to being orthogonal to each other will have a larger determinant, and vectors in the spanned subspace are more likely to be sampled: \(\mathcal {P}_{L}(\mathbf {Y}=Y) \propto \text {det}(L_{Y})\). Using orthgonality as a measure of diversity, we leverage DPPs to sample diverse trajectories and goals.

3 Related Work

The proposed work is built on HER [1] as a way to effectively augment goal-oriented transitions from a replay buffer: to address the problem of sparse rewards, transitions from unsuccessful trajectories are turned into successful ones. HER uses an episodic replay buffer, with uniform sampling over trajectories, and uniform sampling over transitions. However, these samples may be redundant, and many may contribute little to the successful training of the agent.

In the literature, some efforts have been made to increase the efficiency of HER by prioritising more valuable episodes/transitions. Motivated by the work-energy principle in physics, HEBP [32] assigns higher probability to trajectories in which the target object has higher energy; once the episodes are sampled, the transitions are then sampled uniformly. However, HEBP requires knowing the semantics of the goal space in order to calculate the probability, which is proportional to the sum of the target’s potential, kinetic and rotational energies.

CHER [8] dynamically controls the sampling of transitions during training based on a mixture of goal proximity and diversity. Firstly, m episodes are uniformly sampled from the episodic replay buffer, and then a minibatch of \(k < m\) is sampled according to the current state of the curriculum. The curriculum initially biases sampling to achieved goals that are close to the desired goal (requiring a distance function), and later biases sampling towards diverse goals, using a k-nearest neighbour graph and a submodular function to more efficiently sample a diverse subset of goals (using the same distance function).

Other work has expanded HER in orthogonal directions. Hindsight policy gradient [24] and episodic self-imitation learning [6] apply HER to improve the efficiency of goal-based on-policy algorithms. Dynamic HER [7] and competitive ER [17] expand HER to the dynamic goal and multi-agent settings, respectively.

The use of DPPs in RL has been more limited, with applications towards modelling value functions of sets of agents in multiagent RL [21, 31], and most closely related to us, finding diverse policies [22].

4 Methodology

We now formally describe the two main components of our method, DTGSH: 1) a diversity-based trajectory selection module to sample valuable trajectories for the further goal selection; 2) a diversity-based goal selection module to select transitions with diverse goal states from the previously selected trajectories. Together, these select informative transitions from a large area of the goal space, improving the agent’s ability to learn and generalise.

4.1 Diversity-Based Trajectory Selection

We propose a diversity-based prioritization method to select valuable trajectories for efficient training. Related to HEBP’s prioritisation of high-energy trajectories [32], we hypothesise that trajectories that achieve diverse goal states \(g^{ac}_{t}\) are more valuable for training; however, unlike HEBP, we do not require knowledge of the goal space semantics.

In a robotic manipulation task, the agent needs to move a target object from its initial position, \(g^{ac}_{1}\), to the target position, g. If the agent never moves the object, despite hindsight relabelling it will not be learning information that would directly help in task completion. On the other hand, if the object moves a lot, hindsight relabelling will help the agent learn about meaningful interactions.

In our approach, DPPs are used to model the diversity of achieved goal states \(g^{ac}_{t}\) in an episode, or subsets thereof. For a single trajectory \(\mathcal {T}\) of length T, we divide it into several partial trajectories \(\tau _{j}\) of length b, with achieved goal states \(\{g^{ac}_{t}\}_{t=n:n+b-1}\). That is, with a sliding window of \(b = 2\), a trajectory \(\mathcal {T}\) can be divided into \(N_p\) partial trajectories:

$$\begin{aligned} \mathcal {T}_{i} = \{\{\underbrace{g^{ac}_{1}, g^{ac}_{2}}_{\tau _{1}}\}, \{\underbrace{g^{ac}_{2}, g^{ac}_{3}}_{\tau _{2}}\}, \{\underbrace{g^{ac}_{3}, g^{ac}_{4}}_{\tau _{3}}\}, \cdots , \{\underbrace{g^{ac}_{T-1}, g^{ac}_{T}}_{\tau _{N_{p}}}\}\}. \end{aligned}$$
(4)

The diversity \(d_{\tau _{j}}\) of each partial trajectory \(\tau _{j}\) can be computed as:

$$\begin{aligned} d_{\tau _{j}} = \text {det}(L_{\tau _{j}}), \end{aligned}$$
(5)

where \(L_{\tau _{j}}\) is the kernel matrix of partial trajectory \(\tau _{j}\):

$$\begin{aligned} L_{\tau _{j}} = M^{T}M, \end{aligned}$$
(6)

and \(M=[\hat{g}^{ac}_{n}, \hat{g}^{ac}_{n+1}, \cdots , \hat{g}^{ac}_{n+b-1}]\), where each \(\hat{g}^{ac}\) is the \(\ell _2\)-normalised version of the achieved goal \(g^{ac}\) [13]. Finally, the diversity \(d_\mathcal {T}\) of trajectory \(\mathcal {T}\) is the sum of the diversity of its \(N_p\) constituent partial trajectories:

$$\begin{aligned} d_\mathcal {T}= \sum _{j=1}^{N_{p}} d_{\tau _{j}}. \end{aligned}$$
(7)

Similarly to HEBP [32], we use a non-uniform episode sampling strategy. During training, we prioritise sampling episodes proportionally to their diversity; the probability \(p(\mathcal {T}_{i})\) of sampling trajectory \(\mathcal {T}_{i}\) from a replay buffer of size \(N_{e}\) is:

$$\begin{aligned} p(\mathcal {T}_{i}) = \frac{d_{\mathcal {T}_{i}}}{\sum _{n=1}^{N_{e}} d_{\mathcal {T}_{n}}}. \end{aligned}$$
(8)

4.2 Diversity-Based Goal Selection

In prior work [1, 32], after selecting the trajectories from the replay buffer, one transition from each selected trajectory is sampled uniformly to construct a minibatch for training. However, the modified goals \(g^{\prime }\) in the minibatch might be similar, resulting in redundant information. In order to form a minibatch with diverse goals for more efficient learning, we use k-DPPs [13] for sampling goals. Compared to the standard DPP, a k-DPP is a conditional DPP where the subset Y has a fixed size k, with the probability distribution function:

$$\begin{aligned} \mathcal {P}_{L}^{k}(\mathbf {Y}=Y) = \frac{\text {det}(L_{Y})}{\sum _{|Y^{\prime }|=k} \text {det}(L_{Y^{\prime }})}. \end{aligned}$$
(9)

k-DPPs are more appropriate for goal selection with a minibatch of fixed size k. Given \(m > k\) trajectories sampled from the replay buffer, we first uniformly sample a transition from each of the m trajectories. Finally, a k-DPP is used to sample a diverse set of transitions based on the relabelled goals \(g'\) (which, in this context, we denote as “candidate goals”). Figure 2a gives an example of uniform vs. k-DPP sampling, demonstrating the increased coverage of the latter. Figure 2b provides corresponding estimated density plots; note that the density of the k-DPP samples is actually more uniform over the support of the candidate goal distribution.

Fig. 2.
figure 2

Visualisation of \(k=32\) goals selected from \(m=100\) candidate goals of the Push task using either uniform sampling or k-DPP sampling, respectively. The candidate goals are distributed over a 2D (xy) space. Note that k-DPP sampling (right hand plots) results in a broader span of selected goals in xy space compared to uniform sampling (left hand plots).

figure a
figure b

Algorithm 1 shows the details of the goal selection subroutine, and Algorithm 2 gives the overall algorithm for our method, DTGSH.

5 Experiments

We evaluate our proposed method, and compare it with current state-of-the-art HER-based algorithms [1, 8, 32] on challenging robotic manipulation tasks [23], pictured in Fig. 3. Furthermore, we perform ablation studies on our diversity-based trajectory and goal selection modules. Our code is based on OpenAI BaselinesFootnote 1, and is available at: https://github.com/TianhongDai/div-hindsight.

Fig. 3.
figure 3

Robotic manipulation environments. (a–b) use the Fetch robot, and (c–e) use the Shadow Dexterous Hand.

5.1 Environments

The robotic manipulation environments used for training and evaluation include five different tasks. Two tasks use the 7-DoF Fetch robotic arm with two-fingers parallel gripper: Push, and Pick&Place, which both require the agent to move a cube to the target position. The remaining three tasks use a 24-DoF Shadow Dexterous Hand to manipulate an egg, a block and a pen, respectively. The sparse reward function is given by Eq. (2).

In the Fetch environments, the state \(s_{t}\) contains the position and velocity of the joints, and the position and rotation of the cube. Each action \(a_{t}\) is a 4-dimensional vector, with three dimensions specifying the relative position of the gripper, and the final dimension specifying the state of the gripper (i.e., open or closed). The desired goal g is the target position, and the achieved goal \(g^{ac}_{t}\) is the position of the cube. Each episode is of length \(T = 50\).

In the Shadow Dexterous Hand environments, the state \(s_{t}\) contains the position and velocity of the joints. Each action \(a_{t}\) is a 20-dimensional vector which specifies the absolute position of 20 non-coupled joints in the hand. The desired goal g and achieved goal \(g^{ac}_t\) specify the rotation of the object for the block and pen tasks, and the position + rotation of the object for the egg task. Each episode is of length \(T = 200\).

5.2 Training Settings

We base our training setup on CHER [8]. We train all agents on minibatches of size \(k = 64\) for 50 epochs using MPI for parallelisation over 16 CPU cores; each epoch consists of 1600 (\(16 \times 100\)) episodes, with evaluation over 160 (\(16 \times 10\)) episodes at the end of each epoch. Remaining hyperparameters for the baselines are taken from the original work [1, 8, 32]. Our method, DTGSH, uses partial trajectories of length \(b = 2\) and \(m = 100\) as the number of candidate goals.

5.3 Benchmark Results

We compare DTGSH to DDPG [16], DDPG+HER [1], DDPG+HEBP [32] and DDPG+CHER [8]. Evaluation results are given based on repeated runs with 5 different seeds; we plot the median success rate with upper and lower bounds given by the \(75^{\mathrm{th}}\) and \(25^{\mathrm{th}}\) percentiles, respectively.

Fig. 4.
figure 4

Success rate of DTGSH and baseline approaches.

Figure 4 and Table 1 show the performance of DDPG+DTGSH and baseline approaches on all five tasks. In the Fetch tasks, DDPG+DTGSH and DDPG+HEBP both learn significantly faster than the other methods, while in the Shadow Dexterous Hand tasks DDPG+DTGSH learns the fastest and achieves higher success rates than all other methods. In particular, DDPG cannot solve any tasks without using HER, and CHER performs worse in the Fetch tasks. We believe the results highlight the importance of sampling both diverse trajectories and goals, as in our proposed method, DTGSH.

Table 1. Final mean success rate ± standard deviation, with best results in bold.

5.4 Ablation Studies

In this section, we perform the following experiments to investigate the effectiveness of each component in DTGSH: 1) diversity-based trajectory selection with HER (DTSH) and diversity-based goal selection with HER (DGSH) are evaluated independently to assess the contribution of each stage; 2) the performance using different partial trajectory lengths b; 3) the performance of using different candidate goal set sizes m.

Fig. 5.
figure 5

Success rate of HER, DTGSH, and ablations DTSH and DGSH.

Figure 5 shows the performance of using DTSH and DGSH independently. DDPG+DTSH outperforms DDPG+HER substantially in all tasks, which supports the view that sampling trajectories with diverse achieved goals can substantially improve performance. Furthermore, unlike DDPG+HEBP, DTSH does not require knowing the structure of the goal space in order to calculate the energy of the target object; DDPG+DGSH achieves better performance than DDPG+HER in three environments, and is only worse in one environment. DGSH performs better in environments where it is easier to solve the task (e.g., Fetch tasks), and hence the trajectories selected are more likely to contain useful transitions. However, DTGSH, which is the combination of both modules, performs the best overall.

Fig. 6.
figure 6

Success rate of DTGSH with different partial trajectory lengths b and different candidate goal set sizes m.

Figure 6 shows the performance of DDPG+DTGSH with different partial trajectory lengths b and different candidate goal set sizes m. In this work, we use \(b = 2\) and \(m = 100\) as the defaults. Performance degrades with \(b \gg 2\), indicating that pairwise diversity is best for learning in our method. \(m \gg 100\) does not affect performance in the Fetch environments, but degrades performance in the Shadow Dexterous Hand environments.

5.5 Time Complexity

Table 2 gives example training times of all of the HER-based algorithms. DTGSH requires an additional calculation of the diversity score of \(\mathcal {O}(N_{p}b^3)\) at the end of every training episode, and sampling of \(\mathcal {O}(mk^2)\) for each minibatch.

Table 2. Training time (hours:minutes:seconds) of DTGSH and baseline approaches on the Push task for 50 epochs.

6 Conclusion

In this paper, we introduced diversity-based trajectory and goal selection with hindsight experience replay (DTGSH) to improve the learning efficiency of goal-orientated RL agents in the sparse reward setting. Our method can be divided into two stages: 1) valuable trajectories are selected according to diversity-based priority, as modelled by determinantal point processes (DPPs) [14]; 2) k-DPPs [13] are leveraged to sample transitions with diverse goal states from previously selected trajectories for training. Our experiments empirically show that DTGSH achieves faster learning and higher final performance in five challenging robotic manipulation tasks, compared to previous state-of-the-art approaches [1, 8, 32]. Furthermore, unlike prior extensions of hindsight experience replay, DTGSH does not require semantic knowledge of the goal space [32], and does not require tuning a curriculum [8].