Quantile Regression Hindsight Experience Replay

He, Qiwei; Zhuang, Liansheng; Zhang, Wei; Li, Houqiang

doi:10.1007/978-3-030-63820-7_94

Qiwei He¹¹,
Liansheng Zhuang¹¹,
Wei Zhang^12,13 &
…
Houqiang Li¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1332))

Included in the following conference series:

International Conference on Neural Information Processing

2389 Accesses
1 Citations

Abstract

Efficient learning in the environment with sparse rewards is one of the most important challenges in Deep Reinforcement Learning (DRL). In continuous DRL environments such as robotic manipulation tasks, Multi-goal RL with the accompanying algorithm Hindsight Experience Replay (HER) has been shown an effective solution. However, HER and its variants typically suffer from a major challenge that the agents may perform well in some goals while poorly in the other goals. The main reason for the phenomenon is the popular concept in the recent DRL works called intrinsic stochasticity. In Multi-goal RL, intrinsic stochasticity lies in that the different initial goals of the environment will cause the different value distributions and interfere with each other, where computing the expected return is not suitable in principle and cannot perform well as usual. To tackle this challenge, in this paper, we propose Quantile Regression Hindsight Experience Replay (QR-HER), a novel approach based on Quantile Regression. The key idea is to select the returns that are most closely related to the current goal from the replay buffer without additional data. In this way, the interference between different initial goals will be significantly reduced. We evaluate QR-HER on OpenAI Robotics manipulation tasks with sparse rewards. Experimental results show that, in contrast to HER and its variants, our proposed QR-HER achieves better performance by improving the performances of each goal as we expected.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Diversity-Based Trajectory and Goal Selection with Hindsight Experience Replay

A Guided Evaluation Method for Robot Dynamic Manipulation

A Novel Reinforcement Learning Sampling Method Without Additional Environment Feedback in Hindsight Experience Replay

Keywords

1 Introduction

Reinforcement learning (RL) [10] is designed to predict and control the agent to accomplish different kinds of tasks from the interactions with the environment by receiving rewards. RL combined with Deep Learning [5] has been shown to be an effective framework in a wide range of domains. However, many great challenges still exist in Deep Reinforcement Learning (DRL), one of which is to make the agent learn efficiently with sparse rewards. In the environment with sparse rewards, rewards are zero in most transitions and non-zero only when the agent achieves some special states. This makes it extremely difficult for the policy network to inference the correct behavior in the long-sequence decision making. To tackle this challenge, Universal Value Function Approximator (UVFA) [9] is proposed to sample goals from some special states, which extends the definition of value function by not just over states but also over goals. This is equivalent to giving the value function higher dimensional states as parameters for the gain of extra information in different episodes. Lillicrap et al. developed the Deep Deterministic Policy Gradient (DDPG) [6] by utilizing Gaussian Noise for exploration, which significantly improves the performance in continuous control tasks such as manipulation and locomotion. Experience Replay (ER) [7] is a technique that stores and reuses past experiences with a replay buffer. Inspired by the above methods, Hindsight Experience Replay (HER) [1] replaces the desired goals of training trajectories with the sampled goals in the replay buffer and additionally leverage the rich repository of the failed experiences. Utilizing HER, the RL agent can learn to accomplish complex robotic manipulation tasks [8], which is nearly impossible to be solved with general RL algorithms.

Nevertheless, the above methods based on maximizing the expected return still has its problem called intrinsic stochasticity [2]. This phenomenon occurs because the return depends on internal registers and is truly unobservable. On most occasions, the return can be regarded as a constant value function over states, for instance, the maze. In this way, the optimal value of any state should also be constant after long time of training. However, in some occasions, different initial states of the environment will cause significantly different value functions that form the value distributions, which called parametric uncertainty [3]. Furthermore, the MDP process itself does not include past rewards for the current state so that it cannot even distinguish the predictions for different steps of receiving the rewards, which called MDP intrinsic randomness [3]. The above two reasons are the main sources of intrinsic stochasticity.

In the environment with sparse rewards, the intrinsic stochasticity exists mainly due to the parametric uncertainty. The initial goals in Multi-goal RL, as part of the environment, may be completely different from each other and the value distributions are significantly affected by the distribution of goals. However, from the perspective of the expected return, HER and its variants ignore the intrinsic stochasticity caused by the distribution of initial goals and mix the parameters of different value distributions in the training process. In principle, it may cause the instability and degradation of performance especially when the number of goals is large.

Inspired by the above insights, in this paper, we propose a novel method called Quantile Regression Hindsight Experience Replay (QR-HER) to improve the intrinsic stochasticity of the training process in Multi-goal RL. On the basis of Quantile Regression, our key idea is to reduce the interference between different goals by selecting the proper returns for the current goal from the similar goals in the replay buffer. We evaluate QR-HER on the representative OpenAI Robotics environment and find that QR-HER can achieve better performance compared to HER and its state-of-the-art variant CHER [4]. Furthermore, we infer that the performance improvement of QR-HER is due to the enhancement of the policy for each goal.

2 Preliminary

2.1 Universal Value Function Approximators

UVFA [9] proposed utilizing the concatenation of states $s\in \mathcal {S}$ and goals $g\in \mathcal {G}$ as higher dimensional universal states (s, g) such that the value function approximators V(s) and Q(s, a) can be generalized as V(s, g) and Q(s, a, g). The goals can also be called goal states since in general $\mathcal {G}\subset \mathcal {S}$.

2.2 Multi-goal RL and HER

In Multi-goal RL, random exploration is unlikely to reach the goals. Even if the agent is lucky enough to reach a goal, it does not have enough experience to reach the next one. To address the challenge, [1] proposed Hindsight Experience Replay (HER) including two key techniques, $reward\;shaping$ and $goal\;relabelling$. The key technique called $reward\;shaping$ is to make the reward function dependent on a goal $g \in G$, such that $r_{g}: S \times A \times G \rightarrow R$. The formula is given by:

$$\begin{aligned} r_{t}=r_{g}\left( s_{t}, a_{t}, g\right) =\left\{ \begin{array}{c}{0, \text{ if } \left| s_{t}-g\right| <\delta } \\ {-1, \text{ otherwise } }\end{array}\right. \end{aligned}$$

(1)

where we can figure out that this trick brings much more virtual returns to support the training. The other technique called goal relabelling is to replay each trajectory with different goals sampled from the intermediate states by special schemes. For the transition $\left( s_{t}\left\| g, a_{t}, r_{t}, s_{t+1}\right\| g\right) $, we will store its hindsight transition $\left( s_{t}\left\| g^{\prime }, a_{t}, r^{\prime }, s_{t+1}\right\| g^{\prime }\right) $ in the replay buffer instead.

3 Quantile Regression Hindsight Experience Replay

3.1 Distributional Mutil-Goal RL Objective

For convenience, we replace the state s with x, the Multi-goal Bellman operator is given as:

$$\begin{aligned} \mathcal {T}^{\pi } Q(x, a, g)=\mathbb {E}[R(x, a, g)]+\gamma \mathbb {E}_{P, \pi }\left[ Q\left( x^{\prime }, a^{\prime }, g\right) \right] . \end{aligned}$$

(2)

Using the above formula, the distributional Bellman operator is given as:

$$\begin{aligned} \begin{aligned} \mathcal {T}^{\pi } Z(x, a, g)&: {\mathop {=}\limits ^{D}} R(x, a, g)+\gamma Z\left( x^{\prime }, a^{\prime }, g\right) \\ x^{\prime }&\sim P(\cdot | x, a, g), a^{\prime } \sim \pi \left( \cdot | x^{\prime }, g\right) , \end{aligned}\end{aligned}$$

(3)

where Z denotes the value distribution of Q, $Z: {\mathop {=}\limits ^{D}}U$ denotes equality of probability laws, that is the random variable Z is distributed according to the sam law as U.

3.2 The Wasserstein Metric

For different goals g, different value distributions Z will be produced. In order to minimize the gap among different value distributions, utilizing the inverse CDF (Cumulative Distribution Function) $F^{-1}$, we introduce the p-Wasserstein metric, which is written as:

$$\begin{aligned} W_{p}(Z_{G}, Z_{G^{\prime }})=\left( \int _{0}^{1}\left| F_{(Z_{G}}^{-1}(\omega )-F_{Z_{G^{\prime }}}^{-1}(\omega )\right| ^{p} d \omega \right) ^{1 / p},\end{aligned}$$

(4)

where we use G to represent the initial desired goals which are generated by the environments to be separated from the total sampled goals g, $G \subset g$. G is the main source of intrinsic stochasticity in Multi-goal RL while not the goals generated by hindsight replay. In the cases of RL with expected return, for the current state, we assume that there is only one value distribution and calculate the average probability of its different values. While in quantile distributional RL, we prefer to divide the probability space into different and identical small blocks. For each block, we find out all the corresponding returns of different value distributions utilizing the inverse CDF $F^{-1}$. Then when making action decisions, we consider all the return values in the blocks to select one from a comprehensive view rather than just averaging.

The quantile distributional parameters and corresponding inverse CDF are not available, so we introduce Quantile Regression network as the function to be learned from the samples. The number of blocks called Quant is fixed and the output of the regression network is the Z vector consisting of the returns of the quantiles. In the Bellman update, the Bellman operator continuously changes the value of the Z vector until convergence. In this way, we can use the Wasserstein Metric to calculate the Quantile Regression loss between the Z vectors of the current state and the next state for the network training, given by:

$$\begin{aligned} \begin{array}{c} \mathcal {L}_{\mathrm {QR}}^{\tau }(\theta ):=\mathbb {E}_{\hat{Z} \sim Z}\left[ \rho _{\tau }(\hat{Z}-\theta )\right] , \text{ where } \\ \rho _{\tau }(u)=u\left( \tau -\delta _{\{u<0\}}\right) , \forall u \in \mathbb {R} \end{array}\end{aligned}$$

(5)

where $\theta $ is the parameter to fit the unknown inverse CDF, for minimizing a step of Bellman update $\int _{\tau }^{\tau ^{\prime }}\left| F^{-1}(\omega )-\theta \right| d \omega $, it can be deduced mathematically that:

$$\begin{aligned} \left\{ \theta \in \mathbb {R} | F(\theta )=\left( \frac{\tau +\tau ^{\prime }}{2}\right) \right\} , \end{aligned}$$

(6)

then if $F^{-1}$ is continuous at $\left( \tau +\tau ^{\prime }\right) / 2$, we can use $\theta =F^{-1}\left( \left( \tau +\tau ^{\prime }\right) / 2\right) $ as the unique minimizer.

However, the Quantile Regression loss is not smooth at zero, we will consider use the Huber loss $\rho _{\tau }^{\kappa }(u)$ to replace $\rho _{\tau }(u)$, given by:

$$\begin{aligned} \mathcal {L}_{\kappa }(u)=\left\{ \begin{array}{ll} \frac{1}{2} u^{2}, &{} \text{ if } |u| \le \kappa \\ \kappa \left( |u|-\frac{1}{2} \kappa \right) , &{} \text{ otherwise } \end{array}\right. , \rho _{\tau }^{\kappa }(u)=\left| \tau -\delta _{\{u<0\}}\right| \frac{\mathcal {L}_{\kappa }(u)}{\kappa } \end{aligned}$$

(7)

3.3 Return Selection for Multi-goal RL

As demonstrated in the Introduction and Preliminary, the different initial goals mainly cause the intrinsic stochasticity in RL with sparse rewards. When updating the parameters for the current goal, we should exclude interference from other less relevant goals as much as possible. Hence, we propose using the Wasserstein Metric to eliminate the interference of the value distributions of goals with low correlation as the following formula:

$$\begin{aligned} \begin{aligned} Z_{\theta }(x, a, g, G):=\frac{1}{N} \sum _{i=1}^{N} \sum _{G_{\epsilon }}^{}\delta _{\theta _{i}(x, a, g, G_{\epsilon })}, G_{\epsilon }\subset G,\frac{W_{p}(Z_{G}, Z_{G_{\epsilon }})}{W_{p}(Z_{G}, 0)}<\epsilon , \end{aligned}\end{aligned}$$

(8)

where we only adopt $G_{\epsilon }$ as the subset of G for the return selection to update the value network. Therefore the value distributions significantly different from the current initial goal will not be selected in the replay buffer in a self-attention way. This method is somewhat similar to knowledge distillation in deep learning.

3.4 Algorithm

Utilizing the above derivations, we propose the algorithm for Quantile Regression Multi-goal RL as Algorithm 1.

4 Experiments

4.1 Environments

We evaluate QR-HER and compare QR-HER to HER and its SOTA variant on several challenging robotic manipulation tasks in simulated Mujoco environments Robotics [8] as the Fig. 1 shows, including two kinds of tasks, Fetch robotic arm tasks and Shadow Dexterous Hand tasks. Both two kinds of tasks have sparse binary rewards and follow a Multi-goal RL framework. We choose the most challenging tasks, FetchSlide and HandManipulatePen to carry out our experiments.

4.2 Implementation Details

We run the experiments using PyTorch on a machine with 2 14-cores Intel Xeon E5-2690 v4 CPUs and 4 TITAN X(Pascal) GPUs. To make a fair comparison, for all algorithms, each off-policy algorithm is implemented with identical hyper-parameters. In the experiments, one epoch is equivalent to 500 episodes with a unique seed(one goal). 10 percent of the episodes are used for testing set to get the mean success rate. The seeds are different in different epochs. Both policy networks and value networks are using MLP with three hidden layers (256,256,256) and optimized using Adam optimizer with critic learning rate of 0.001 and actor learning rate of $3 \times 10^{-4}$. The replay buffer size is $10^{6}$ and the batch size is 64. The $\gamma $ for the Bellman backup is 0.97 and the polyak for target network updating is 0.95. The distributional parameters Quant is choosed from [20,50,100,200,500] and the range of return selection parameter $\epsilon $ is [0.1, 0.3].

4.3 Benchmark Performance

In the benchmark experiments, the better mean success rate represents for better performance to accomplish robotic manipulation tasks. Now we compare the mean success rates in Fig. 1, where the shaded area represents the standard deviation since we use different seeds. Actually, the training process is extremely unstable but we use filters to smooth the curve.

The agent trained with QR-HER shows the best benchmark performance at the end of the training. The value of Quant is the key hyper-parameter of QR-HER, as is shown in Fig. 1. The performance at Quant = 20 is 20% higher than at Quant = 200. Our conclusion is that the agent has its best performance when the Quant is about half of the number of goals(epochs).

4.4 Performance Analysis of Each Goal

According to our assumption, the quantile regression method with return selection is supposed to reduce the interference between different goals to improve the performance of each goal. The corresponding result is shown in Table 1. From the table, we can infer that QR-HER improves the overall performance through the optimization of the policy of each goal as we expected.

Table 1. Final success rates in HandManipulatePenRotate with different goals(seeds)

Full size table

5 Summary

The main contributions of this paper are summarized as follows: (1) We raise the issue of performance instability and performance degradation in Multi-goal RL, and attribute the cause to intrinsic stochasticity; (2) We introduce Wasserstein Metric and Quantile Regression into Multi-goal RL to derive QR-HER; (3) We show that QR-HER can exceed HER and its variants to achieve the state-of-the-art performance on OpenAI Robotics; (4) We show that QR-HER improves the performance of each goal to become the powerful evidence for the correctness of our theory.

References

Andrychowicz, M., et al.: Hindsight experience replay. In: Advances in Neural Information Processing Systems, pp. 5048–5058 (2017)
Google Scholar
Bellemare, M.G., Dabney, W., Munos, R.: A distributional perspective on reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 449–458. JMLR. org (2017)
Google Scholar
Dabney, W., Rowland, M., Bellemare, M.G., Munos, R.: Distributional reinforcement learning with quantile regression (2017)
Google Scholar
Fang, M., Zhou, T., Du, Y., Han, L., Zhang, Z.: Curriculum-guided hindsight experience replay. In: Advances in Neural Information Processing Systems, pp. 12602–12613 (2019)
Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Article Google Scholar
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
Lin, L.J.: Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 8(3–4), 293–321 (1992)
Google Scholar
Plappert, M., et al.: Multi-goal reinforcement learning: challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464 (2018)
Schaul, T., Horgan, D., Gregor, K., Silver, D.: Universal value function approximators. In: International Conference on Machine Learning, pp. 1312–1320 (2015)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. MIT press (2018)
Google Scholar

Download references

Acknowledgments

This work was supported in part to Dr. Liansheng Zhuang by NSFC under Grant contract No. 61976199, in part to Dr. Houqiang Li by NSFC under Grant contract No. 61836011.

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, 230027, China
Qiwei He, Liansheng Zhuang & Houqiang Li
Peng Cheng Laboratory, Shenzhen, 518000, China
Wei Zhang
Science and Technology on Electronic Information Control Lab., Chengdu, 610036, China
Wei Zhang

Authors

Qiwei He
View author publications
You can also search for this author in PubMed Google Scholar
Liansheng Zhuang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Houqiang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liansheng Zhuang .

Editor information

Editors and Affiliations

Department of AI, Ping An Life, Shenzhen, China
Haiqin Yang
Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand
Kitsuchart Pasupa
City University of Hong Kong, Kowloon, Hong Kong
Andrew Chi-Sing Leung
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
James T. Kwok
School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand
Jonathan H. Chan
The Chinese University of Hong Kong, New Territories, Hong Kong
Irwin King

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, Q., Zhuang, L., Zhang, W., Li, H. (2020). Quantile Regression Hindsight Experience Replay. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Communications in Computer and Information Science, vol 1332. Springer, Cham. https://doi.org/10.1007/978-3-030-63820-7_94

Download citation

DOI: https://doi.org/10.1007/978-3-030-63820-7_94
Published: 17 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63819-1
Online ISBN: 978-3-030-63820-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics