Diversity-Based Trajectory and Goal Selection with Hindsight Experience Replay

Dai, Tianhong; Liu, Hengyan; Arulkumaran, Kai; Ren, Guangyu; Bharath, Anil Anthony

doi:10.1007/978-3-030-89370-5_3

Tianhong Dai¹²,
Hengyan Liu¹²,
Kai Arulkumaran^12,13,
Guangyu Ren¹² &
…
Anil Anthony Bharath¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13033))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

1520 Accesses
7 Citations

Abstract

Hindsight experience replay (HER) is a goal relabelling technique typically used with off-policy deep reinforcement learning algorithms to solve goal-oriented tasks; it is well suited to robotic manipulation tasks that deliver only sparse rewards. In HER, both trajectories and transitions are sampled uniformly for training. However, not all of the agent’s experiences contribute equally to training, and so naive uniform sampling may lead to inefficient learning. In this paper, we propose diversity-based trajectory and goal selection with HER (DTGSH). Firstly, trajectories are sampled according to the diversity of the goal states as modelled by determinantal point processes (DPPs). Secondly, transitions with diverse goal states are selected from the trajectories by using k-DPPs. We evaluate DTGSH on five challenging robotic manipulation tasks in simulated robot environments, where we show that our method can learn more quickly and reach higher performance than other state-of-the-art approaches on all tasks.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Quantile Regression Hindsight Experience Replay

A Novel Reinforcement Learning Sampling Method Without Additional Environment Feedback in Hindsight Experience Replay

A Guided Evaluation Method for Robot Dynamic Manipulation

Keywords

1 Introduction

Deep reinforcement learning (DRL) [3], in which neural networks are used as function approximators for reinforcement learning (RL), has been shown to be capable of solving complex control problems in several environments, including board games [27, 28], video games [4, 19, 30], simulated and real robotic manipulation [2, 9, 15] and simulated autonomous driving [12].

However, learning from a sparse reward signal, where the only reward is provided upon the completion of a task, still remains difficult. An agent may rarely or never encounter positive examples from which to learn in a sparse-reward environment. Many domains therefore provide dense reward signals [5], or practitioners may turn to reward shaping [20]. Designing dense reward functions typically requires prior domain knowledge, making this approach difficult to generalise across different environments.

Fortunately, a common scenario is goal-oriented RL, where the RL agent is tasked with solving different goals within the same environment [11, 25]. Even if each task has a sparse reward, the agent ideally generalises across goals, making the learning process easier. For example, in a robotic manipulation task, the goal during a single episode would be to achieve a specific position of a target object.

Hindsight experience replay (HER) [1] was proposed to improve the learning efficiency of goal-oriented RL agents in sparse reward settings: when past experience is replayed to train the agent, the desired goal is replaced (in “hindsight”) with the achieved goal, generating many positive experiences. In the above example, the desired target position would be overwritten with the achieved target position, with the achieved reward also being overwritten correspondingly.

We note that HER, whilst it enabled solutions to previously unsolved tasks, can be somewhat inefficient in its use of uniformly sampling transitions during training. In the same way that prioritised experience replay [26] has significantly improved over the standard experience replay in RL, several approaches have improved upon HER by using data-dependent sampling [8, 32]. HER with energy-based prioritisation (HEBP) [32] assumes semantic knowledge about the goal-space and uses the energy of the target objects to sample trajectories with high energies, and then samples transitions uniformly. Curriculum-guided HER (CHER) [8] samples trajectories uniformly, and then samples transitions based on a mixture of proximity to the desired goal and the diversity of the samples; CHER adapts the weighting of these factors over time. In this work, we introduce diversity-based trajectory and goal selection with HER (DTGSH; See Fig. 1), which samples trajectories based on the diversity of the goals achieved within the trajectory, and then samples transitions based on the diversity of the set of samples. In this paper, DTGSH is evaluated on five challenging robotic manipulation tasks. From extensive experiments, our proposed method converges faster and reaches higher rewards than prior work, without requiring domain knowledge [32] or tuning a curriculum [8], and is based on a single concept—determinantal point processes (DPPs) [14].

2 Background

2.1 Reinforcement Learning

RL is the study of agents interacting with their environment in order to maximise their reward, formalised using the framework of Markov decision processes (MDPs) [29]. At each timestep t, an agent receives a state $s_{t}$ from the environment, and then samples an action $a_{t}$ from its policy $\pi (a_{t}|s_{t})$. Next, the action $a_{t}$ is executed in the environment to get the next state $s_{t+1}$, and a reward $r_{t}$. In the episodic RL setting, the objective of the agent is to maximise its expected return $\mathbb {E}[R]$ over a finite trajectory with length T:

$$\begin{aligned} \mathbb {E}[R] = \mathbb {E}\left[ \sum _{t=1}^{T} \gamma ^{t-1}r_{t}\right] , \end{aligned}$$

(1)

where $\gamma \in [0, 1]$ is a discount factor that exponentially downplays the influence of future rewards, reducing the variance of the return.

2.2 Goal-Oriented Reinforcement Learning

RL can be expanded to the multi-goal setting, where the agent’s policy and the environment’s reward function $\mathcal {R}(s_t, a_t)$ are also conditioned on a goal g [11, 25]. In this work, we focus on the goal-oriented setting and environments proposed by OpenAI [23].

In this setting, every episode comes with a desired goal g, which specifies the desired configuration of a target object in the environment (which could include the agent itself). At every timestep t, the agent is also provided with the currently achieved goal $g^{ac}_{t+1}$. A transition in the environment is thus denoted as: $(s_t, a_t, r_t, s_{t+1}, g, g^{ac}_{t+1})$. The environment provides a sparse reward function, where a negative reward is given unless the achieved goal is within a small distance $\epsilon $ of the desired goal:

$$\begin{aligned} \mathcal {R}\left( g, g^{ac}_{t+1}\right) := {\left\{ \begin{array}{ll} 0&{} \text {if } \left\| g^{ac}_{t+1} - g\right\| \le \epsilon \\ -1&{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

(2)

However, in this setting, the agent is unlikely to achieve a non-negative reward through random exploration. To overcome this, HER provides successful experiences for the agent to learn from by relabelling transitions during training: the agent trains on a hindsight desired goal $g'$, which is set to the achieved goal $g^{ac}_{t+1}$, with $r_t$ recomputed using the environment reward function (Eq. 2).

2.3 Deep Deterministic Policy Gradient

Deep deterministic policy gradient (DDPG) [16] is an off-policy actor-critic DRL algorithm for continuous control tasks, and is used as the baseline algorithm for HER [1, 8, 32]. The actor $\pi _{\theta }(s_{t})$ is a policy network parameterised by $\theta $, and outputs the agent’s actions. The critic $Q_{\eta }(s_{t}, a_{t})$ is a state-action-value function approximator parameterised by $\eta $, and estimates the expected return following a given state-action pair. The critic is trained by minimising ${\mathcal {L}_{c}=\mathbb {E}[(Q_{\eta }(s_{t}, a_{t}) - y_{t})^{2}]}$ where ${y_{t} = r_{t} + \gamma Q_{\eta }(s_{t+1}, \pi _{\theta }(s_{t+1}))}$. The actor is trained by maximising ${\mathcal {L}_{a} = \mathbb {E}[Q_{\eta }(s_{t}, \pi _{\theta }(s_{t}))]}$, backpropagating through the critic. Further implementation details can be found in prior work [1, 16].

2.4 Determinantal Point Processes

A DPP [14] is a stochastic process that characterises a probability distribution over sets of points using the determinant of some function. In machine learning it is often used to quantify the diversity of a subset, with applications such as video [18] and document summarisation [10].

Formally, for a discrete set of points $\mathcal {Y}=\{x_{1}, x_{2}, \cdots , x_{N}\}$, a point process $\mathcal {P}$ is a probability measure over all $2^{|\mathcal {Y}|}$ subsets. $\mathcal {P}$ is a DPP if a random subset $\mathbf {Y}$ is sampled with probability:

$$\begin{aligned} \mathcal {P}_{L}(\mathbf {Y}=Y) = \frac{\text {det}(L_{Y})}{\sum _{Y'\subseteq \mathcal {Y}} \text {det}(L_{Y'})} = \frac{\text {det}(L_{Y})}{\text {det}(L+I)}, \end{aligned}$$

(3)

where $Y\subseteq \mathcal {Y}$, I is the identity matrix, $L \in \mathbb {R}^{N\times N}$ is the positive semi-definite DPP kernel matrix, and $L_{Y}$ is the sub-matrix with rows and columns indexed by the elements of the subset Y.

The kernel matrix L can be represented as the Gram matrix $L = X^{T}X$, where each column of X is the feature vector of an item in $\mathcal {Y}$. The determinant, $\text {det}(L_{Y})$, represents the (squared) volume spanned by vectors $x_{i}\in Y$. From a geometric perspective, feature vectors that are closer to being orthogonal to each other will have a larger determinant, and vectors in the spanned subspace are more likely to be sampled: $\mathcal {P}_{L}(\mathbf {Y}=Y) \propto \text {det}(L_{Y})$. Using orthgonality as a measure of diversity, we leverage DPPs to sample diverse trajectories and goals.

3 Related Work

The proposed work is built on HER [1] as a way to effectively augment goal-oriented transitions from a replay buffer: to address the problem of sparse rewards, transitions from unsuccessful trajectories are turned into successful ones. HER uses an episodic replay buffer, with uniform sampling over trajectories, and uniform sampling over transitions. However, these samples may be redundant, and many may contribute little to the successful training of the agent.

In the literature, some efforts have been made to increase the efficiency of HER by prioritising more valuable episodes/transitions. Motivated by the work-energy principle in physics, HEBP [32] assigns higher probability to trajectories in which the target object has higher energy; once the episodes are sampled, the transitions are then sampled uniformly. However, HEBP requires knowing the semantics of the goal space in order to calculate the probability, which is proportional to the sum of the target’s potential, kinetic and rotational energies.

CHER [8] dynamically controls the sampling of transitions during training based on a mixture of goal proximity and diversity. Firstly, m episodes are uniformly sampled from the episodic replay buffer, and then a minibatch of $k < m$ is sampled according to the current state of the curriculum. The curriculum initially biases sampling to achieved goals that are close to the desired goal (requiring a distance function), and later biases sampling towards diverse goals, using a k-nearest neighbour graph and a submodular function to more efficiently sample a diverse subset of goals (using the same distance function).

Other work has expanded HER in orthogonal directions. Hindsight policy gradient [24] and episodic self-imitation learning [6] apply HER to improve the efficiency of goal-based on-policy algorithms. Dynamic HER [7] and competitive ER [17] expand HER to the dynamic goal and multi-agent settings, respectively.

The use of DPPs in RL has been more limited, with applications towards modelling value functions of sets of agents in multiagent RL [21, 31], and most closely related to us, finding diverse policies [22].

4 Methodology

We now formally describe the two main components of our method, DTGSH: 1) a diversity-based trajectory selection module to sample valuable trajectories for the further goal selection; 2) a diversity-based goal selection module to select transitions with diverse goal states from the previously selected trajectories. Together, these select informative transitions from a large area of the goal space, improving the agent’s ability to learn and generalise.

4.1 Diversity-Based Trajectory Selection

We propose a diversity-based prioritization method to select valuable trajectories for efficient training. Related to HEBP’s prioritisation of high-energy trajectories [32], we hypothesise that trajectories that achieve diverse goal states $g^{ac}_{t}$ are more valuable for training; however, unlike HEBP, we do not require knowledge of the goal space semantics.

In a robotic manipulation task, the agent needs to move a target object from its initial position, $g^{ac}_{1}$, to the target position, g. If the agent never moves the object, despite hindsight relabelling it will not be learning information that would directly help in task completion. On the other hand, if the object moves a lot, hindsight relabelling will help the agent learn about meaningful interactions.

In our approach, DPPs are used to model the diversity of achieved goal states $g^{ac}_{t}$ in an episode, or subsets thereof. For a single trajectory $\mathcal {T}$ of length T, we divide it into several partial trajectories $\tau _{j}$ of length b, with achieved goal states $\{g^{ac}_{t}\}_{t=n:n+b-1}$. That is, with a sliding window of $b = 2$, a trajectory $\mathcal {T}$ can be divided into $N_p$ partial trajectories:

$$\begin{aligned} \mathcal {T}_{i} = \{\{\underbrace{g^{ac}_{1}, g^{ac}_{2}}_{\tau _{1}}\}, \{\underbrace{g^{ac}_{2}, g^{ac}_{3}}_{\tau _{2}}\}, \{\underbrace{g^{ac}_{3}, g^{ac}_{4}}_{\tau _{3}}\}, \cdots , \{\underbrace{g^{ac}_{T-1}, g^{ac}_{T}}_{\tau _{N_{p}}}\}\}. \end{aligned}$$

(4)

The diversity $d_{\tau _{j}}$ of each partial trajectory $\tau _{j}$ can be computed as:

$$\begin{aligned} d_{\tau _{j}} = \text {det}(L_{\tau _{j}}), \end{aligned}$$

(5)

where $L_{\tau _{j}}$ is the kernel matrix of partial trajectory $\tau _{j}$:

$$\begin{aligned} L_{\tau _{j}} = M^{T}M, \end{aligned}$$

(6)

and $M=[\hat{g}^{ac}_{n}, \hat{g}^{ac}_{n+1}, \cdots , \hat{g}^{ac}_{n+b-1}]$, where each $\hat{g}^{ac}$ is the $\ell _2$-normalised version of the achieved goal $g^{ac}$ [13]. Finally, the diversity $d_\mathcal {T}$ of trajectory $\mathcal {T}$ is the sum of the diversity of its $N_p$ constituent partial trajectories:

$$\begin{aligned} d_\mathcal {T}= \sum _{j=1}^{N_{p}} d_{\tau _{j}}. \end{aligned}$$

(7)

Similarly to HEBP [32], we use a non-uniform episode sampling strategy. During training, we prioritise sampling episodes proportionally to their diversity; the probability $p(\mathcal {T}_{i})$ of sampling trajectory $\mathcal {T}_{i}$ from a replay buffer of size $N_{e}$ is:

$$\begin{aligned} p(\mathcal {T}_{i}) = \frac{d_{\mathcal {T}_{i}}}{\sum _{n=1}^{N_{e}} d_{\mathcal {T}_{n}}}. \end{aligned}$$

(8)

4.2 Diversity-Based Goal Selection

In prior work [1, 32], after selecting the trajectories from the replay buffer, one transition from each selected trajectory is sampled uniformly to construct a minibatch for training. However, the modified goals $g^{\prime }$ in the minibatch might be similar, resulting in redundant information. In order to form a minibatch with diverse goals for more efficient learning, we use k-DPPs [13] for sampling goals. Compared to the standard DPP, a k-DPP is a conditional DPP where the subset Y has a fixed size k, with the probability distribution function:

$$\begin{aligned} \mathcal {P}_{L}^{k}(\mathbf {Y}=Y) = \frac{\text {det}(L_{Y})}{\sum _{|Y^{\prime }|=k} \text {det}(L_{Y^{\prime }})}. \end{aligned}$$

(9)

k-DPPs are more appropriate for goal selection with a minibatch of fixed size k. Given $m > k$ trajectories sampled from the replay buffer, we first uniformly sample a transition from each of the m trajectories. Finally, a k-DPP is used to sample a diverse set of transitions based on the relabelled goals $g'$ (which, in this context, we denote as “candidate goals”). Figure 2a gives an example of uniform vs. k-DPP sampling, demonstrating the increased coverage of the latter. Figure 2b provides corresponding estimated density plots; note that the density of the k-DPP samples is actually more uniform over the support of the candidate goal distribution.

Algorithm 1 shows the details of the goal selection subroutine, and Algorithm 2 gives the overall algorithm for our method, DTGSH.

5 Experiments

We evaluate our proposed method, and compare it with current state-of-the-art HER-based algorithms [1, 8, 32] on challenging robotic manipulation tasks [23], pictured in Fig. 3. Furthermore, we perform ablation studies on our diversity-based trajectory and goal selection modules. Our code is based on OpenAI Baselines^{Footnote 1}, and is available at: https://github.com/TianhongDai/div-hindsight.

5.1 Environments

The robotic manipulation environments used for training and evaluation include five different tasks. Two tasks use the 7-DoF Fetch robotic arm with two-fingers parallel gripper: Push, and Pick&Place, which both require the agent to move a cube to the target position. The remaining three tasks use a 24-DoF Shadow Dexterous Hand to manipulate an egg, a block and a pen, respectively. The sparse reward function is given by Eq. (2).

In the Fetch environments, the state $s_{t}$ contains the position and velocity of the joints, and the position and rotation of the cube. Each action $a_{t}$ is a 4-dimensional vector, with three dimensions specifying the relative position of the gripper, and the final dimension specifying the state of the gripper (i.e., open or closed). The desired goal g is the target position, and the achieved goal $g^{ac}_{t}$ is the position of the cube. Each episode is of length $T = 50$.

In the Shadow Dexterous Hand environments, the state $s_{t}$ contains the position and velocity of the joints. Each action $a_{t}$ is a 20-dimensional vector which specifies the absolute position of 20 non-coupled joints in the hand. The desired goal g and achieved goal $g^{ac}_t$ specify the rotation of the object for the block and pen tasks, and the position + rotation of the object for the egg task. Each episode is of length $T = 200$.

5.2 Training Settings

We base our training setup on CHER [8]. We train all agents on minibatches of size $k = 64$ for 50 epochs using MPI for parallelisation over 16 CPU cores; each epoch consists of 1600 ($16 \times 100$) episodes, with evaluation over 160 ($16 \times 10$) episodes at the end of each epoch. Remaining hyperparameters for the baselines are taken from the original work [1, 8, 32]. Our method, DTGSH, uses partial trajectories of length $b = 2$ and $m = 100$ as the number of candidate goals.

5.3 Benchmark Results

We compare DTGSH to DDPG [16], DDPG+HER [1], DDPG+HEBP [32] and DDPG+CHER [8]. Evaluation results are given based on repeated runs with 5 different seeds; we plot the median success rate with upper and lower bounds given by the $75^{\mathrm{th}}$ and $25^{\mathrm{th}}$ percentiles, respectively.

Figure 4 and Table 1 show the performance of DDPG+DTGSH and baseline approaches on all five tasks. In the Fetch tasks, DDPG+DTGSH and DDPG+HEBP both learn significantly faster than the other methods, while in the Shadow Dexterous Hand tasks DDPG+DTGSH learns the fastest and achieves higher success rates than all other methods. In particular, DDPG cannot solve any tasks without using HER, and CHER performs worse in the Fetch tasks. We believe the results highlight the importance of sampling both diverse trajectories and goals, as in our proposed method, DTGSH.

Table 1. Final mean success rate ± standard deviation, with best results in bold.

Full size table

5.4 Ablation Studies

In this section, we perform the following experiments to investigate the effectiveness of each component in DTGSH: 1) diversity-based trajectory selection with HER (DTSH) and diversity-based goal selection with HER (DGSH) are evaluated independently to assess the contribution of each stage; 2) the performance using different partial trajectory lengths b; 3) the performance of using different candidate goal set sizes m.

Figure 5 shows the performance of using DTSH and DGSH independently. DDPG+DTSH outperforms DDPG+HER substantially in all tasks, which supports the view that sampling trajectories with diverse achieved goals can substantially improve performance. Furthermore, unlike DDPG+HEBP, DTSH does not require knowing the structure of the goal space in order to calculate the energy of the target object; DDPG+DGSH achieves better performance than DDPG+HER in three environments, and is only worse in one environment. DGSH performs better in environments where it is easier to solve the task (e.g., Fetch tasks), and hence the trajectories selected are more likely to contain useful transitions. However, DTGSH, which is the combination of both modules, performs the best overall.

Figure 6 shows the performance of DDPG+DTGSH with different partial trajectory lengths b and different candidate goal set sizes m. In this work, we use $b = 2$ and $m = 100$ as the defaults. Performance degrades with $b \gg 2$, indicating that pairwise diversity is best for learning in our method. $m \gg 100$ does not affect performance in the Fetch environments, but degrades performance in the Shadow Dexterous Hand environments.

5.5 Time Complexity

Table 2 gives example training times of all of the HER-based algorithms. DTGSH requires an additional calculation of the diversity score of $\mathcal {O}(N_{p}b^3)$ at the end of every training episode, and sampling of $\mathcal {O}(mk^2)$ for each minibatch.

Table 2. Training time (hours:minutes:seconds) of DTGSH and baseline approaches on the Push task for 50 epochs.

Full size table

6 Conclusion

In this paper, we introduced diversity-based trajectory and goal selection with hindsight experience replay (DTGSH) to improve the learning efficiency of goal-orientated RL agents in the sparse reward setting. Our method can be divided into two stages: 1) valuable trajectories are selected according to diversity-based priority, as modelled by determinantal point processes (DPPs) [14]; 2) k-DPPs [13] are leveraged to sample transitions with diverse goal states from previously selected trajectories for training. Our experiments empirically show that DTGSH achieves faster learning and higher final performance in five challenging robotic manipulation tasks, compared to previous state-of-the-art approaches [1, 8, 32]. Furthermore, unlike prior extensions of hindsight experience replay, DTGSH does not require semantic knowledge of the goal space [32], and does not require tuning a curriculum [8].

Notes

1.
https://github.com/openai/baselines.

References

Andrychowicz, M., et al.: Hindsight experience replay. In: Neural Information Processing Systems (2017)
Google Scholar
Andrychowicz, O.M., et al.: Learning dexterous in-hand manipulation. Int. J. Robot. Res. 39(1), 3–20 (2020)
Article Google Scholar
Arulkumaran, K., Deisenroth, M.P., Brundage, M., Bharath, A.A.: Deep reinforcement learning: a brief survey. IEEE Signal Process. Mag. 34(6), 26–38 (2017)
Article Google Scholar
Berner, C., et al.: Dota 2 with large scale deep reinforcement learning. arXiv:1912.06680 (2019)
Brockman, G., et al.: Openai gym. arXiv preprint arXiv:1606.01540 (2016)
Dai, T., Liu, H., Bharath, A.A.: Episodic self-imitation learning with hindsight. Electronics 9(10), 1742 (2020)
Article Google Scholar
Fang, M., Zhou, C., Shi, B., Gong, B., Xu, J., Zhang, T.: DHER: Hindsight experience replay for dynamic goals. In: International Conference on Learning Representations (2018)
Google Scholar
Fang, M., Zhou, T., Du, Y., Han, L., Zhang, Z.: Curriculum-guided hindsight experience replay. In: Neural Information Processing Systems (2019)
Google Scholar
Gu, S., Holly, E., Lillicrap, T., Levine, S.: Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: International Conference on Robotics and Automation (2017)
Google Scholar
Hong, K., Nenkova, A.: Improving the estimation of word importance for news multi-document summarization. In: Conference of the European Chapter of the Association for Computational Linguistics (2014)
Google Scholar
Kaelbling, L.P.: Learning to achieve goals. In: International Joint Conference on Artificial Intelligence (1993)
Google Scholar
Kiran, B.R., et al.: Deep reinforcement learning for autonomous driving: a survey. IEEE Trans. Intell. Transp. Syst. 1–18 (2021)
Google Scholar
Kulesza, A., Taskar, B.: k-DPPs: fixed-size determinantal point processes. In: International Conference on Machine Learning (2011)
Google Scholar
Kulesza, A., et al.: Determinantal Point Processes for Machine Learning. Foundations and Trends in Machine Learning (2012)
Google Scholar
Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17, 1–40 (2016)
MathSciNet MATH Google Scholar
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. In: International Conference on Learning Representations (2016)
Google Scholar
Liu, H., Trott, A., Socher, R., Xiong, C.: Competitive experience replay. In: International Conference on Learning Representations (2019)
Google Scholar
Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial lstm networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)
Article Google Scholar
Ng, A.Y., Harada, D., Russell, S.: Theory and application to reward shaping. In: International Conference on Machine Learning (1999)
Google Scholar
Osogami, T., Raymond, R.: Determinantal reinforcement learning. In: AAAI Conference on Artificial Intelligence (2019)
Google Scholar
Parker-Holder, J., Pacchiano, A., Choromanski, K.M., Roberts, S.J.: Effective diversity in population based reinforcement learning. In: Neural Information Processing Systems (2020)
Google Scholar
Plappert, M., et al.: Multi-goal reinforcement learning: challenging robotics environments and request for research. arXiv:1802.09464 (2018)
Rauber, P., Ummadisingu, A., Mutz, F., Schmidhuber, J.: Hindsight policy gradients. In: International Conference on Learning Representations (2019)
Google Scholar
Schaul, T., Horgan, D., Gregor, K., Silver, D.: Universal value function approximators. In: International Conference on Machine Learning (2015)
Google Scholar
Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. In: International Conference on Learning Representations (2016)
Google Scholar
Schrittwieser, J., et al.: Mastering Atari, go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2020)
Article Google Scholar
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550, 354–359 (2017)
Article Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. The MIT Press, Cambridge (2018)
MATH Google Scholar
Vinyals, O., et al.: Grandmaster level in Starcraft ii using multi-agent reinforcement learning. Nature 575, 350–354 (2019)
Article Google Scholar
Yang, Y., et al.: Multi-agent determinantal q-learning. In: International Conference on Machine Learning (2020)
Google Scholar
Zhao, R., Tresp, V.: Energy-based hindsight experience prioritization. In: Conference on Robot Learning (2018)
Google Scholar

Download references

Acknowledgements

This work was supported by JST, Moonshot R&D Grant Number JPMJMS2012.

Author information

Authors and Affiliations

Imperial College London, London, SW7 2AZ, UK
Tianhong Dai, Hengyan Liu, Kai Arulkumaran, Guangyu Ren & Anil Anthony Bharath
Araya Inc., Tokyo, 107-6024, Japan
Kai Arulkumaran

Authors

Tianhong Dai
View author publications
You can also search for this author in PubMed Google Scholar
Hengyan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Kai Arulkumaran
View author publications
You can also search for this author in PubMed Google Scholar
Guangyu Ren
View author publications
You can also search for this author in PubMed Google Scholar
Anil Anthony Bharath
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tianhong Dai .

Editor information

Editors and Affiliations

MIMOS Berhad, Kuala Lumpur, Malaysia
Duc Nghia Pham
Sirindhorn International Institute of Science and Technology, Thammasat University, Mueang Pathum Thani, Thailand
Thanaruk Theeramunkong
Data61, CSIRO, Brisbane, QLD, Australia
Guido Governatori
Department of Philosophy, Tsinghua University, Beijing, China
Fenrong Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dai, T., Liu, H., Arulkumaran, K., Ren, G., Bharath, A.A. (2021). Diversity-Based Trajectory and Goal Selection with Hindsight Experience Replay. In: Pham, D.N., Theeramunkong, T., Governatori, G., Liu, F. (eds) PRICAI 2021: Trends in Artificial Intelligence. PRICAI 2021. Lecture Notes in Computer Science(), vol 13033. Springer, Cham. https://doi.org/10.1007/978-3-030-89370-5_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-89370-5_3
Published: 01 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89369-9
Online ISBN: 978-3-030-89370-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics