Keywords

1 Introduction

Reinforcement learning (RL) [10] is designed to predict and control the agent to accomplish different kinds of tasks from the interactions with the environment by receiving rewards. RL combined with Deep Learning [5] has been shown to be an effective framework in a wide range of domains. However, many great challenges still exist in Deep Reinforcement Learning (DRL), one of which is to make the agent learn efficiently with sparse rewards. In the environment with sparse rewards, rewards are zero in most transitions and non-zero only when the agent achieves some special states. This makes it extremely difficult for the policy network to inference the correct behavior in the long-sequence decision making. To tackle this challenge, Universal Value Function Approximator (UVFA) [9] is proposed to sample goals from some special states, which extends the definition of value function by not just over states but also over goals. This is equivalent to giving the value function higher dimensional states as parameters for the gain of extra information in different episodes. Lillicrap et al. developed the Deep Deterministic Policy Gradient (DDPG)  [6] by utilizing Gaussian Noise for exploration, which significantly improves the performance in continuous control tasks such as manipulation and locomotion. Experience Replay (ER) [7] is a technique that stores and reuses past experiences with a replay buffer. Inspired by the above methods, Hindsight Experience Replay (HER)  [1] replaces the desired goals of training trajectories with the sampled goals in the replay buffer and additionally leverage the rich repository of the failed experiences. Utilizing HER, the RL agent can learn to accomplish complex robotic manipulation tasks [8], which is nearly impossible to be solved with general RL algorithms.

Nevertheless, the above methods based on maximizing the expected return still has its problem called intrinsic stochasticity [2]. This phenomenon occurs because the return depends on internal registers and is truly unobservable. On most occasions, the return can be regarded as a constant value function over states, for instance, the maze. In this way, the optimal value of any state should also be constant after long time of training. However, in some occasions, different initial states of the environment will cause significantly different value functions that form the value distributions, which called parametric uncertainty [3]. Furthermore, the MDP process itself does not include past rewards for the current state so that it cannot even distinguish the predictions for different steps of receiving the rewards, which called MDP intrinsic randomness [3]. The above two reasons are the main sources of intrinsic stochasticity.

In the environment with sparse rewards, the intrinsic stochasticity exists mainly due to the parametric uncertainty. The initial goals in Multi-goal RL, as part of the environment, may be completely different from each other and the value distributions are significantly affected by the distribution of goals. However, from the perspective of the expected return, HER and its variants ignore the intrinsic stochasticity caused by the distribution of initial goals and mix the parameters of different value distributions in the training process. In principle, it may cause the instability and degradation of performance especially when the number of goals is large.

Inspired by the above insights, in this paper, we propose a novel method called Quantile Regression Hindsight Experience Replay (QR-HER) to improve the intrinsic stochasticity of the training process in Multi-goal RL. On the basis of Quantile Regression, our key idea is to reduce the interference between different goals by selecting the proper returns for the current goal from the similar goals in the replay buffer. We evaluate QR-HER on the representative OpenAI Robotics environment and find that QR-HER can achieve better performance compared to HER and its state-of-the-art variant CHER [4]. Furthermore, we infer that the performance improvement of QR-HER is due to the enhancement of the policy for each goal.

2 Preliminary

2.1 Universal Value Function Approximators

UVFA [9] proposed utilizing the concatenation of states \(s\in \mathcal {S}\) and goals \(g\in \mathcal {G}\) as higher dimensional universal states (sg) such that the value function approximators V(s) and Q(sa) can be generalized as V(sg) and Q(sag). The goals can also be called goal states since in general \(\mathcal {G}\subset \mathcal {S}\).

2.2 Multi-goal RL and HER

In Multi-goal RL, random exploration is unlikely to reach the goals. Even if the agent is lucky enough to reach a goal, it does not have enough experience to reach the next one. To address the challenge, [1] proposed Hindsight Experience Replay (HER) including two key techniques, \(reward\;shaping\) and \(goal\;relabelling\). The key technique called \(reward\;shaping\) is to make the reward function dependent on a goal \(g \in G\), such that \(r_{g}: S \times A \times G \rightarrow R\). The formula is given by:

$$\begin{aligned} r_{t}=r_{g}\left( s_{t}, a_{t}, g\right) =\left\{ \begin{array}{c}{0, \text{ if } \left| s_{t}-g\right| <\delta } \\ {-1, \text{ otherwise } }\end{array}\right. \end{aligned}$$
(1)

where we can figure out that this trick brings much more virtual returns to support the training. The other technique called goal relabelling is to replay each trajectory with different goals sampled from the intermediate states by special schemes. For the transition \(\left( s_{t}\left\| g, a_{t}, r_{t}, s_{t+1}\right\| g\right) \), we will store its hindsight transition \(\left( s_{t}\left\| g^{\prime }, a_{t}, r^{\prime }, s_{t+1}\right\| g^{\prime }\right) \) in the replay buffer instead.

3 Quantile Regression Hindsight Experience Replay

3.1 Distributional Mutil-Goal RL Objective

For convenience, we replace the state s with x, the Multi-goal Bellman operator is given as:

$$\begin{aligned} \mathcal {T}^{\pi } Q(x, a, g)=\mathbb {E}[R(x, a, g)]+\gamma \mathbb {E}_{P, \pi }\left[ Q\left( x^{\prime }, a^{\prime }, g\right) \right] . \end{aligned}$$
(2)

Using the above formula, the distributional Bellman operator is given as:

$$\begin{aligned} \begin{aligned} \mathcal {T}^{\pi } Z(x, a, g)&: {\mathop {=}\limits ^{D}} R(x, a, g)+\gamma Z\left( x^{\prime }, a^{\prime }, g\right) \\ x^{\prime }&\sim P(\cdot | x, a, g), a^{\prime } \sim \pi \left( \cdot | x^{\prime }, g\right) , \end{aligned}\end{aligned}$$
(3)

where Z denotes the value distribution of Q, \(Z: {\mathop {=}\limits ^{D}}U\) denotes equality of probability laws, that is the random variable Z is distributed according to the sam law as U.

3.2 The Wasserstein Metric

For different goals g, different value distributions Z will be produced. In order to minimize the gap among different value distributions, utilizing the inverse CDF (Cumulative Distribution Function) \(F^{-1}\), we introduce the p-Wasserstein metric, which is written as:

$$\begin{aligned} W_{p}(Z_{G}, Z_{G^{\prime }})=\left( \int _{0}^{1}\left| F_{(Z_{G}}^{-1}(\omega )-F_{Z_{G^{\prime }}}^{-1}(\omega )\right| ^{p} d \omega \right) ^{1 / p},\end{aligned}$$
(4)

where we use G to represent the initial desired goals which are generated by the environments to be separated from the total sampled goals g, \(G \subset g\). G is the main source of intrinsic stochasticity in Multi-goal RL while not the goals generated by hindsight replay. In the cases of RL with expected return, for the current state, we assume that there is only one value distribution and calculate the average probability of its different values. While in quantile distributional RL, we prefer to divide the probability space into different and identical small blocks. For each block, we find out all the corresponding returns of different value distributions utilizing the inverse CDF \(F^{-1}\). Then when making action decisions, we consider all the return values in the blocks to select one from a comprehensive view rather than just averaging.

The quantile distributional parameters and corresponding inverse CDF are not available, so we introduce Quantile Regression network as the function to be learned from the samples. The number of blocks called Quant is fixed and the output of the regression network is the Z vector consisting of the returns of the quantiles. In the Bellman update, the Bellman operator continuously changes the value of the Z vector until convergence. In this way, we can use the Wasserstein Metric to calculate the Quantile Regression loss between the Z vectors of the current state and the next state for the network training, given by:

$$\begin{aligned} \begin{array}{c} \mathcal {L}_{\mathrm {QR}}^{\tau }(\theta ):=\mathbb {E}_{\hat{Z} \sim Z}\left[ \rho _{\tau }(\hat{Z}-\theta )\right] , \text{ where } \\ \rho _{\tau }(u)=u\left( \tau -\delta _{\{u<0\}}\right) , \forall u \in \mathbb {R} \end{array}\end{aligned}$$
(5)

where \(\theta \) is the parameter to fit the unknown inverse CDF, for minimizing a step of Bellman update \(\int _{\tau }^{\tau ^{\prime }}\left| F^{-1}(\omega )-\theta \right| d \omega \), it can be deduced mathematically that:

$$\begin{aligned} \left\{ \theta \in \mathbb {R} | F(\theta )=\left( \frac{\tau +\tau ^{\prime }}{2}\right) \right\} , \end{aligned}$$
(6)

then if \(F^{-1}\) is continuous at \(\left( \tau +\tau ^{\prime }\right) / 2\), we can use \(\theta =F^{-1}\left( \left( \tau +\tau ^{\prime }\right) / 2\right) \) as the unique minimizer.

However, the Quantile Regression loss is not smooth at zero, we will consider use the Huber loss \(\rho _{\tau }^{\kappa }(u)\) to replace \(\rho _{\tau }(u)\), given by:

$$\begin{aligned} \mathcal {L}_{\kappa }(u)=\left\{ \begin{array}{ll} \frac{1}{2} u^{2}, &{} \text{ if } |u| \le \kappa \\ \kappa \left( |u|-\frac{1}{2} \kappa \right) , &{} \text{ otherwise } \end{array}\right. , \rho _{\tau }^{\kappa }(u)=\left| \tau -\delta _{\{u<0\}}\right| \frac{\mathcal {L}_{\kappa }(u)}{\kappa } \end{aligned}$$
(7)

3.3 Return Selection for Multi-goal RL

As demonstrated in the Introduction and Preliminary, the different initial goals mainly cause the intrinsic stochasticity in RL with sparse rewards. When updating the parameters for the current goal, we should exclude interference from other less relevant goals as much as possible. Hence, we propose using the Wasserstein Metric to eliminate the interference of the value distributions of goals with low correlation as the following formula:

$$\begin{aligned} \begin{aligned} Z_{\theta }(x, a, g, G):=\frac{1}{N} \sum _{i=1}^{N} \sum _{G_{\epsilon }}^{}\delta _{\theta _{i}(x, a, g, G_{\epsilon })}, G_{\epsilon }\subset G,\frac{W_{p}(Z_{G}, Z_{G_{\epsilon }})}{W_{p}(Z_{G}, 0)}<\epsilon , \end{aligned}\end{aligned}$$
(8)

where we only adopt \(G_{\epsilon }\) as the subset of G for the return selection to update the value network. Therefore the value distributions significantly different from the current initial goal will not be selected in the replay buffer in a self-attention way. This method is somewhat similar to knowledge distillation in deep learning.

3.4 Algorithm

Utilizing the above derivations, we propose the algorithm for Quantile Regression Multi-goal RL as Algorithm 1.

4 Experiments

4.1 Environments

We evaluate QR-HER and compare QR-HER to HER and its SOTA variant on several challenging robotic manipulation tasks in simulated Mujoco environments Robotics [8] as the Fig. 1 shows, including two kinds of tasks, Fetch robotic arm tasks and Shadow Dexterous Hand tasks. Both two kinds of tasks have sparse binary rewards and follow a Multi-goal RL framework. We choose the most challenging tasks, FetchSlide and HandManipulatePen to carry out our experiments.

4.2 Implementation Details

We run the experiments using PyTorch on a machine with 2 14-cores Intel Xeon E5-2690 v4 CPUs and 4 TITAN X(Pascal) GPUs. To make a fair comparison, for all algorithms, each off-policy algorithm is implemented with identical hyper-parameters. In the experiments, one epoch is equivalent to 500 episodes with a unique seed(one goal). 10 percent of the episodes are used for testing set to get the mean success rate. The seeds are different in different epochs. Both policy networks and value networks are using MLP with three hidden layers (256,256,256) and optimized using Adam optimizer with critic learning rate of 0.001 and actor learning rate of \(3 \times 10^{-4}\). The replay buffer size is \(10^{6}\) and the batch size is 64. The \(\gamma \) for the Bellman backup is 0.97 and the polyak for target network updating is 0.95. The distributional parameters Quant is choosed from [20,50,100,200,500] and the range of return selection parameter \(\epsilon \) is [0.1, 0.3].

4.3 Benchmark Performance

In the benchmark experiments, the better mean success rate represents for better performance to accomplish robotic manipulation tasks. Now we compare the mean success rates in Fig. 1, where the shaded area represents the standard deviation since we use different seeds. Actually, the training process is extremely unstable but we use filters to smooth the curve.

figure a
Fig. 1.
figure 1

The open AI robotics experiments for QR-HER

The agent trained with QR-HER shows the best benchmark performance at the end of the training. The value of Quant is the key hyper-parameter of QR-HER, as is shown in Fig. 1. The performance at Quant = 20 is 20% higher than at Quant = 200. Our conclusion is that the agent has its best performance when the Quant is about half of the number of goals(epochs).

4.4 Performance Analysis of Each Goal

According to our assumption, the quantile regression method with return selection is supposed to reduce the interference between different goals to improve the performance of each goal. The corresponding result is shown in Table 1. From the table, we can infer that QR-HER improves the overall performance through the optimization of the policy of each goal as we expected.

Table 1. Final success rates in HandManipulatePenRotate with different goals(seeds)

5 Summary

The main contributions of this paper are summarized as follows: (1) We raise the issue of performance instability and performance degradation in Multi-goal RL, and attribute the cause to intrinsic stochasticity; (2) We introduce Wasserstein Metric and Quantile Regression into Multi-goal RL to derive QR-HER; (3) We show that QR-HER can exceed HER and its variants to achieve the state-of-the-art performance on OpenAI Robotics; (4) We show that QR-HER improves the performance of each goal to become the powerful evidence for the correctness of our theory.