Keywords

1 Introduction

One of the most significant breakthroughs in reinforcement learning was the development of an off-policy temporal difference (TD) control algorithm, known as Q-learning, which is introduced in Chap. 2. Q-Learning has been proven to converge towards the optimal solution in a tabular case or using linear function approximation. However, it is known that Q-learning is unstable or even to diverge when using a non-linear function approximator such as a neural network to represent the Q-value function (Tsitsiklis and Van Roy 1996). With the advances in training deep neural networks, deep Q-networks (DQN) (Mnih et al. 2015) addressed this issue and ignited the research of deep reinforcement learning. In this chapter, we first review the background of Q-learning. Then we introduce DQN and its variants with detailed theories and explanations. Finally, in Sect. 4.10, we demonstrate their implementation details and empirical performance on the Atari games with code examples, for providing the readers a quick hands-on learning process. The complete implementation of each algorithm is available in the repository provided together with the book.Footnote 1

2 Background

Model-free methods provide a general way to tackle MDP-based decision-making problems, where “model” means an explicit model for the transition probability distribution and the reward function associated with the MDP. TD learning is a class of model-free methods. Recall that in Sect. 2.4 we discuss that if a perfect model of the MDP is available, one can get the optimal plan with dynamic programming by reusing the optimal solution of sub-problems recursively. TD learning follows such an idea that we can estimate the value of sub-problems with bootstrapping even though the estimation is not optimal all the time.

Sub-problems are represented by states in MDP. The value v π(s) of a state s with a policy π is defined by the expected return starting from s and acting with π:

$$\displaystyle \begin{aligned} v_\pi(s) = \mathbb{E}_\pi[R_{t} + \gamma v_\pi(S_{t+1})|S_t=s], \end{aligned} $$
(4.1)

where γ ∈ [0, 1] is the discount rate. TD learning decomposes the estimation above with bootstrapping. Given a value function \(V: \mathcal {S} \rightarrow \mathbb {R}\), the simplest version, TD(0), is the following one-step bootstrapping:

$$\displaystyle \begin{aligned} V(S_t) \leftarrow V(S_t) + \alpha [R_t + \gamma V(S_{t+1}) - V(S_t)], \end{aligned} $$
(4.2)

where R t + γV (S t+1) and R t + γV (S t+1) − V (S t) are known as the TD target and TD error, respectively.

The value of a policy provides a way to estimate the acting performance. To further know how to select the action in a particular state, we would like to calculate the quality of the state-action combinations. Q-value allows such estimation:

$$\displaystyle \begin{aligned} q_\pi(s, a) = \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1})|S_t=s, A_t=a]. \end{aligned} $$
(4.3)

The simplest way to perform a better policy is acting greedily \(\pi ^\prime (s, a) = \operatorname *{\mbox{arg max}}_{a^\prime }q^\pi (s, a^\prime )\), where the improvement can be ensured with \(q_{\pi ^\prime }(s, a) = \max _{a^\prime }q_\pi (s, a^\prime ) \ge q_\pi (s, a)\). An alternative for considering exploration is to act greedily most of the time, but with a small probability 𝜖 instead to select randomly from all actions with equal probability regardless of their Q-values. This method is called 𝜖-greedy. We can calculate the Q-value of 𝜖-greedy policy π by

$$\displaystyle \begin{aligned} \begin{array}{rcl} q_\pi(s, \pi^\prime(s)) = (1 - \epsilon) \max_{a \in \mathcal{A}} q_\pi(s, a) + \frac{\epsilon}{|\mathcal{A}|}\sum_{a \in \mathcal{A}} q_\pi(s, a). \end{array} \end{aligned} $$
(4.4)

Note that the sum of \(\frac {\pi (s, a) - \epsilon / |\mathcal {A}|}{1 - \epsilon }\) over \(a \in \mathcal {A}\) is equal to 1. With the truth that the maximum is not less than the weighted average, we can get

$$\displaystyle \begin{aligned} \begin{array}{rcl} \begin{aligned} q_\pi(s, \pi^\prime(s)) &\displaystyle = (1 - \epsilon) \max_{a \in \mathcal{A}} q_\pi(s, a) \sum_{a \in \mathcal{A}} \frac{\pi(s, a) - \epsilon / |\mathcal{A}|}{1 - \epsilon} + \frac{\epsilon}{|\mathcal{A}|}\sum_{a \in \mathcal{A}} q_\pi(s, a) \\ &\displaystyle \ge (1 - \epsilon) \sum_{a \in \mathcal{A}} \frac{\pi(s, a) - \epsilon / |\mathcal{A}|}{1 - \epsilon} q_\pi(s, a) \\ &\displaystyle \quad + \frac{\epsilon}{|\mathcal{A}|}\sum_{a \in \mathcal{A}} q_\pi(s, a) = q_\pi(s, \pi(s)), \end{aligned} \end{array} \end{aligned} $$
(4.5)

which tells us that the Q-value of acting with the 𝜖-greedy policy π is not less than the origin policy π, i.e., 𝜖-greedy method ensures policy improvement. We will discuss policy improvement with Q-value function in the next section.

3 Sarsa and Q-Learning

Similar to the update rule for the value function in TD(0), it is straightforward to update the Q-value function after every transition from a non-terminal state S t:

$$\displaystyle \begin{aligned} Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_t + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)], \end{aligned} $$
(4.6)

where both A t and A t+1 are selected 𝜖-greedily with respect to Q. If S t+1 is a terminal state, then Q(S t+1, A t+1) is defined as zero. We can continually estimate Q for the behavior policy π, and change π toward greediness with respect to Q at the same time. This algorithm is known as Sarsa. Note that π plays two roles in Sarsa—for experience generation and policy improvement. Basically, the policy used to generate behavior is called the behavior policy, and the policy that is evaluated and improved is called the target policy. The algorithm where the behavior policy and the target policy are the same, such as Sarsa, is known as the on-policy method.

On-policy methods are kinds of trial-and-error processes but only the experiences generated by the current policy are used for improvement. Off-policy methods address this issue with introspection, where the experience generated by the behavior policy is “off” (not following) the target policy. The off-policy technique allows reusing past experience. Q-learning is an off-policy method. Its simplest form, one-step Q-learning, follows the update rule below:

$$\displaystyle \begin{aligned} Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_t + \gamma \max_{A_{t+1}} Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)], \end{aligned} $$
(4.7)

where A t is sampled by 𝜖-greedy with respect to Q. Note that A t+1 is selected greedily, i.e., contrast to Sarsa, in Q-learning the behavior policy is also 𝜖-greedy but the target policy is greedy. One-step Q-learning only takes in current transition. An alternative way to get more accurate Q-values in approximation case is to use multi-steps rewards, i.e., multi-steps Q-learning. Note that multi-steps Q-learning needs to consider the mismatches in subsequent rewards to keep the Q-value function approximating the expected return under the target policy as in Eq. (4.3). We will discuss multi-steps Q-learning in Sect. 4.9.

4 Why Deep Learning: Value Function Approximation

In tabular settings, the action-value functions can be represented by a big two-dimensional table, i.e., one entry for each discrete state and action. However, it is inefficient to deal with large-scale space such as raw pixels input, and let alone continuous control tasks. Fortunately, generalization from different inputs by function approximation has been widely studied, and we can utilize this technique in value-based reinforcement learning.

Let us consider the function approximation in Q-learning with some parameter θ. The approximator can be linear models, decision trees, or neural networks. Then the update rule in Eq. (4.7) is rewritten as

$$\displaystyle \begin{aligned} \theta_{t} \leftarrow \arg\min_\theta\mathcal{L}(Q(S_t, A_t;\theta), R_t + \gamma Q(S_{t+1}, A_{t+1};\theta)), \end{aligned} $$
(4.8)

where \(\mathcal {L}\) represents the loss function, e.g., mean squared error. While one can solve the optimization problem above by collecting batch samples, which constructs the fitted Q iteration (Riedmiller 2005) shown as Algorithm 1, where \(S_i^\prime \) is the successor state of S i. An online stochastic variant with gradient is the online Q iteration algorithm presented in Algorithm 2.

Algorithm 1 Fitted Q iteration

Algorithm 2 Online Q iteration

Note that both the fitted Q iteration and online Q iteration are off-policy algorithms so that the past experience can be reused many times. We will discuss this topic further in the next section.

Recall that we introduce the convergence of value iteration in Sect. 2.4.2 with the Bellman optimality backup operator \(\mathcal {T}^*\). We define a new operator \(\mathcal {B}\) with function approximation by \(\mathcal {B}V = \arg \min _{V^\prime \in \varOmega } \mathcal {L}(V^\prime , V)\), where Ω is the set of all possible value functions that can be approximated. Note that the \(\arg \min \) in \(\mathcal {B}\) can be viewed as a projection from \(\mathcal {T}^*V\) to Ω. So the backup operator with function approximation can be represented by \(\mathcal {B}\mathcal {T}^*\). While \(\mathcal {T}^*\) is contracted with -norm and \(\mathcal {B}\) is contracted with L2-norm for MSE loss. However, \(\mathcal {B}\mathcal {T}^*\) is not contracted of any kind. So value iteration is unstable and might even diverge when a non-linear function approximator such as a neural network is used to represent the value function (Tsitsiklis and Van Roy 1997). We leave the discussion about the stability of training with deep neural networks in the next section.

5 DQN

In the last section, we introduce the method to learn action-value functions with approximation and its instability of convergence. To achieve the end-to-end decision-making in complex problems with raw pixel input, DQN combines Q-learning with deep learning with two key ideas to address the instability issue and achieves significant progress on Atari games.

The first one is known as replay buffer, which is a biologically inspired mechanism termed experience replay (McClelland et al. 1995; O’Neill et al. 2010; Lin 1993). At each time step t, DQN stores the experience of the agent (S t, A t, R t, S t+1) into replay buffer, and then draws a mini-batch of samples from this buffer uniformly to apply the Q-learning update. Replay buffer has several advantages over the fitted Q iteration. First, the experience in each step can be reused to learn the Q-function, which allows for greater data efficiency. Second, if there is no replay buffer, as in the fitted Q iteration, mini-batch samples are collected consecutively, i.e., they are highly correlated, which increases the variance of the updates. Third, experience replay avoids the situation that the samples used to train are determined by the previous parameters, which smooths out learning and reduces oscillations or divergence in the parameters. In practice, only the last N experience tuples are stored in the replay buffer to save the memory.

The second idea aims to further improve the stability with neural networks. Instead of the desired Q-network, a separate network, known as target network, is used to generate the Q-learning targets. Furthermore, at every C steps, the target network will be synchronized with the primary Q-network by copying directly (hard update) or exponentially decaying average (soft update). The target network makes the generation of the Q-learning target delay with old parameters, which reduces the divergence and oscillations much more. For example, the update making Q-value increase on action (S t, A t) may increase Q(S t+1, a) for all action a because of the similarity between S t and S t+1, where the training target constructed by Q-network will be overestimated.

The effect of two enhancements above on five Atari games is shown in Table 4.1. Agents were trained for 1e7 frames with the hyperparameters search. Each agent was evaluated every 250,000 training frames for 135,000 validation frames, and the highest average episode score is reported.

Table 4.1 The effects of replay and separating the target Q-network

Algorithm 3 DQN

Since it is challenging to feed histories of arbitrary length as inputs to a neural network, DQN instead works on the fixed-length representation of histories produced by a function ϕ. More precisely, ϕ concentrates on the current and the previous three frame, which is useful for tracking temporal information, e.g., object moving. The full algorithm is presented in Algorithm 3. The raw frames are resized to 84 × 84 and converted to gray-scale. The function ϕ stacks the 4 most recent frames as the input to the neural network. In addition, the architecture of the neural network consists of three convolutional layers and two fully connected layers with a single output for each valid action. We will discuss more training details in Sect. 4.10.2.

6 Double DQN

Double DQN is an enhancement of DQN for reducing overestimation (Van Hasselt et al. 2016). Before taking a closer look, let us first illustrate the overestimation problem in classic DQN. The Q-learning target R t + γmaxaQ(S t+1, a) contains a \(\max \) operator. Q is noisy, which may be caused by environment, non-stationarity, function approximation or any other reasons. Note that the expectation of maximum noise is not less than the maximum expectation of noises, i.e., \(\mathbb {E}[\max (\epsilon _1, \ldots , \epsilon _n)] \ge (\max (\mathbb {E}[\epsilon _1], \ldots , \mathbb {E}[\epsilon _n]))\). So the next Q-values are always overestimated. Thrun and Schwartz (1993) provides further theoretical analysis and experimental results.

Note that the training target in standard DQN can be rewritten by

$$\displaystyle \begin{aligned} R_t + \gamma \hat{Q}(S_{t+1}, \arg\max_a \hat{Q}(S_{t+1}, a;\hat{\theta});\hat{\theta}), \end{aligned} $$
(4.9)

where \(\hat {\theta }\) is used in both action selection and value evaluation. The central idea of double DQN is to decorrelate the noises in selection and evaluation by using two different networks in these two stages. The Q-network in the DQN architecture provides a natural candidate for the extra network. Recall that it is the evaluation role of the target network that improves the stability more. As a consequence, the Q-learning target used in double DQN is

$$\displaystyle \begin{aligned} R_t + \gamma \hat{Q}(S_{t+1}, \arg\max_a{Q}(S_{t+1},a;\theta);\hat{\theta}). \end{aligned} $$
(4.10)

Following Wang et al. (2016), we measure improvement in percentage (positive or negative) in score over the better of human and baseline agent scores:

$$\displaystyle \begin{aligned} \frac{\text{Score}_{\text{Agent}} - \text{Score}_{\text{Baseline}}}{\max(\text{Score}_{\text{Baseline}}, \text{Score}_{\text{Human}}) - \text{Score}_{\text{Random}}}. \end{aligned} $$
(4.11)

The improvement over DQN are available in Fig. 4.1.

Fig. 4.1
figure 1

Improvements of double DQN (Van Hasselt et al. 2016) over DQN (Mnih et al. 2015) in Atari benchmark, using the metric described in Eq. (4.11). All scores come from Wang et al. (2016) (Table 2)

7 Dueling DQN

For some states, different actions are not relevant to the expected value, and we do not need to learn the effect of each action for such states. For example, imagine standing on the mountain and watching the sunrise. The pleasant view comforts you a lot, which provides a high reward. You can stay here, and the Q-values of different actions do not matter. So decoupling the action-independent value of state and Q-value may lead to more robust learning.

Dueling DQN proposes a new network architecture to achieve this idea (Wang et al. 2016). More precisely, the Q-value can be split into state value part and action advantage part as following:

$$\displaystyle \begin{aligned} Q^\pi(s, a) = V^\pi(s) + A^\pi(s, a) \end{aligned} $$
(4.12)

and dueling DQN separates the representations of these two parts by

$$\displaystyle \begin{aligned} Q(s, a; \theta, \theta_v, \theta_a) = V(s; \theta, \theta_v) + (A(s, a; \theta, \theta_a) - \max_{a^\prime}A(s, a^\prime; \theta, \theta_a)), \end{aligned} $$
(4.13)

where θ v and θ a are parameters of the two streams of fully connected layers, θ represent the parameters in convolutional layers. Note that the \(\max \) operator in Eq. (4.13) ensures identifiability that the Q-value recovers state value and action advantage uniquely. Otherwise, the training may ignore the state value term and make the advantage function converge to Q-value only. Moreover, Wang et al. (2016) also proposed to replace max with average as the following for better stability:

$$\displaystyle \begin{aligned} Q(s, a; \theta, \theta_v, \theta_a) = V(s; \theta, \theta_v) + (A(s, a; \theta, \theta_a) - \frac{1}{|\mathcal{A}|}\sum_{a^\prime}A(s, a^\prime; \theta, \theta_a)) \end{aligned} $$
(4.14)

by which the advantage function only need to adapt to the direct of mean advantage instead of pursuing the optimal advantage.

Training of the dueling architectures, as with standard DQN, requires only more layers. The experiments show that dueling architectures lead to better policy evaluation in the presence of many similar-valued actions. The improvement over DQN is available in Fig. 4.2.

Fig. 4.2
figure 2

Improvements of dueling DQN (Wang et al. 2016) over DQN (Mnih et al. 2015) in Atari benchmark, using the metric described in Eq. (4.11). All scores come from Wang et al. (2016) (Table 2)

8 Prioritized Experience Replay

One remaining area of improvement in standard DQN is a better sampling strategy for experience replay. Prioritized experience replay (PER) is a technique for prioritizing experience, so as to replay important transitions more frequently (Schaul et al. 2015). The central idea of PER is to consider the importance of transitions with TD error δ, which can be viewed as a surprising measure. Why this can be of help is that some of the experience might contain more information to learn as compared to the others. Giving those more information-rich experience a greater chance of being replayed will help make the whole learning process faster and more efficient.

The most direct idea is using TD error for prioritization directly. However, it has several issues. First, sweeping over whole memory is inefficient. In addition, it is sensitive to noises such as approximation error and stochastic rewards. Finally, greedy makes error shrink slowly, which may cause the beginning transitions with high error replayed frequently. To overcome these issues, Schaul et al. (2015) proposed to use the following sampling probability for transition i:

$$\displaystyle \begin{aligned} P(i) = \frac{p_i^\alpha}{\sum_{k}p_k^\alpha}, {} \end{aligned} $$
(4.15)

where p i > 0, known as the priority of transition i, α is an exponent hyper-parameter with α = 0 corresponding to the uniform case, and k is enumerated on sampled transitions. There are two variants of p i. The first one is proportional prioritization p i = |δ i| + 𝜖, where δ i is the TD error of transition i and 𝜖 is a small positive value for numerical stability. The second one is rank-based prioritization \(p_i = \frac {1}{\text{rank}(i)}\), where rank(i) is the rank of transition i according to |δ i|.

Remind that it is the random sampling from a large replay buffer that helps to decorrelate the samples. But the purely random sampling is abandoned when adding priority sampling. Decreasing the training weight for high priority transitions may make sense. PER uses the importance-sampling weights to correct this bias for transition i:

$$\displaystyle \begin{aligned} w_i = (NP(i))^{-\beta}, \end{aligned} $$
(4.16)

where N is the size of replay buffer, P is the probability defined in Eq. (4.15), and β is a hyper-parameter annealed up to 1 during training because the unbiased nature of the updates will nearly converge at the end of the training. This weight is usually folded into the loss function to construct weighted learning.

For efficient implementation, the cumulative density function of sampling probability is approximated by a piece-wise linear function with k segments. More precisely, the priorities are stored in a query-efficient data structure called the segment tree. During run-time, PER first samples a segment, and then sample uniformly among the transitions within it. The improvement over DQN is available in Fig. 4.3.

Fig. 4.3
figure 3

Improvements of prioritized experience replay (Schaul et al. 2015) with rank-based prioritization over DQN (Mnih et al. 2015) in Atari benchmark, using the metric described in Eq. (4.11). All scores come from Wang et al. (2016) (Table 2)

9 Other Improvements: Multi-Step Learning, Noisy Nets, and Distributional Reinforcement Learning

Including double Q-learning, dueling architecture, and PER, Rainbow combines three more extensions to DQN and achieves significant results on the Atari domain (Hessel et al. 2018). We discuss them and their expansions in this section. The first one is multi-step learning. n-step return allows for accurate estimation and was proven to lead faster learning with suitably tuned n (Sutton and Barto 2018). However, there may exist a mismatch in the action selection between the target and behavior policy within the multi-steps during off-policy learning. One can find a systematic study about correcting this mismatch in Hernandez-Garcia and Sutton (2019). Rainbow uses the truncated n-step return \(R_t^{(k)}\) from a given state S t directly (Hessel et al. 2018; Castro et al. 2018), where \(R_t^{(k)}\) is defined by

$$\displaystyle \begin{aligned} R_t^{(k)} = \sum_{k=0}^{n-1}\gamma^{k}R_{t+k}. \end{aligned} $$
(4.17)

The Q-learning target in multi-step variant of Q-learning is then defined by

$$\displaystyle \begin{aligned} R_t^{(k)} + \gamma^{k}\max_{a}Q(S_{t+k}, a). \end{aligned} $$
(4.18)

The second one is noisy nets (Fortunato et al. 2017). It is an alternative exploration algorithm for 𝜖-greedy, especially for games requiring huge exploration such as Montezuma’s Revenge. The noise is added into linear layer y = (Wx + b) by an extra noisy stream

$$\displaystyle \begin{aligned} \boldsymbol{y} = (\boldsymbol{W}\boldsymbol{x} + \boldsymbol{b}) + ((\boldsymbol{W_{noisy}}\odot\epsilon_{w})\boldsymbol{x} + \boldsymbol{b_{noisy}}\odot\epsilon_{b}), \end{aligned} $$
(4.19)

where ⊙ refers to the element-wise product, both W noisy and b noisy are trainable parameters whereas 𝜖 w and 𝜖 b are random scales annealed down to zero. The experiment shows that noisy nets yield substantially higher scores for a wide range of Atari games over several baselines.

The last one is distributional reinforcement learning (Bellemare et al. 2017), which gives a new perspective on value estimation. Instead of considering the expectation of returns represented by random variable Z π, Bellemare et al. (2017) proposed to estimate the distribution of Z π with the distributional Bellman operator \(\mathcal {T}^\pi \):

$$\displaystyle \begin{aligned} \mathcal{T}^\pi Z = R + \gamma P^\pi Z. \end{aligned} $$
(4.20)

Figure 4.4 shows a continuous case of \(\mathcal {T}^\pi \).

Fig. 4.4
figure 4

A distributional Bellman operator in continuous case. Given the return distribution of the next state under policy π (blue curve), it is first discounted by the reward discounter γ (red curve), and then be shifted by the reward in current time step (black curve)

The distributional variant of DQN used in Rainbow, known as categorical DQN (Bellemare et al. 2017), models the action-value distribution by a discrete distribution parameterized by a vector z with N elements (also known as atoms) z i = V min + (i − 1) Δz, where [V min, V max] is the action-value range and \(\Delta z = \frac {V_{max} - V_{min}}{N - 1}\). In practice, N is usually specified to 51 so sometimes this algorithm is also called C51. The parametric model θ of C51 outputs the probabilities \(p_i(s, a) = {e^{\theta _i(s, a)}}/{\sum _j e^{\theta _j(s, a)}}\) on each atom as distribution Z θ. Note that discrete approximation causes disjoint supports of the Bellman update \(\mathcal {T}^\pi Z\) and the parametrization Z θ. C51 addresses this issue by projecting the target distribution \(\mathcal {T}^\pi Z_{\hat {\theta }}\) onto the support Z θ. More precisely, given a transition (S t, A t, R t, S t+1), the i-th component of projected target \(\varPhi \mathcal {T}^\pi Z_{\hat {\theta }}(S_t, A_t)\) with double Q-learning is calculated by

$$\displaystyle \begin{aligned} \sum_{j=1}^{N}p_j\left(S_{t+1}, \arg\max_a \boldsymbol{z}^\intercal p(S_{t+1}, a; \theta); \hat{\theta}\right)[1 - \frac{|[R_t + \gamma z_j]_{V_{min}}^{V_{max}} - z_i|}{\Delta z}]_0^1, \end{aligned} $$
(4.21)

where \([\cdot ]_a^b\) bounds its argument in the range [a, b]. TD error cannot measure the difference between value distributions. As a result, C51 proposes to use the following Kullback–Leibler divergence as training loss:

$$\displaystyle \begin{aligned} D_{\text{KL}}(\varPhi\mathcal{T}^\pi Z_{\hat{\theta}}(S_t, A_t)||Z_{\theta}(S_t, A_t)). \end{aligned} $$
(4.22)

In addition, the priority is also replaced by KL-Divergence for prioritized experience replay. For dueling architecture, the output distribution is also split into value stream and advantage stream, and the aggregated distribution is estimated by

$$\displaystyle \begin{aligned} p_i(s, a) = \frac{exp(V_i(s) + A_i(s, a) - \bar{A_i}(s, a))}{\sum_j{exp(V_j(s) + A_j(s, a) - \bar{A_j}(s, a))}}, \end{aligned} $$
(4.23)

where \(\bar {A_j}(s, a)\) is defined by \(\frac {1}{|\mathcal {A}|}\sum _{a^\prime }A_j(s, a^\prime )\).

The main drawback of C51 to achieve distributional reinforcement learning is that it can only estimate values on a fixed discrete set. Dabney et al. (2018b) proposed quantile regression DQN (QR-DQN) to address this issue by estimating the quantiles of the full distribution with quantile regression. Before introducing QR-DQN, we first review the quantile regression. Recall that empirical risk minimization with absolute loss function makes the prediction fit the medium value (50% quantile). More precisely, given random variable x and its label y, for estimation function f, the empirical mean absolute error is \(\mathcal {L}_{\text{mae}} = \mathbb {E}[|f(x) - y|]\). Then with the following partial difference:

$$\displaystyle \begin{aligned} \frac{\partial \mathcal{L}_{\text{mae}}}{\partial f(x)} &= \frac{\partial}{\partial f(x)} (P(f(x) > y)(f(x) - y) + P(f(x) \le y)(y - f(x))) \\ & = P(f(x)>y) - P(f(x) \le y) = 0, \end{aligned} $$
(4.24)

we can get F(x) = 0.5, where F is the primitive function of f. Generally, for quantile τ, the quantile loss is defined as \(\mathcal {L}_{\text{quantile}}(\tau ) = \mathbb {E}[\rho _\tau (f(x) - y)]\) with

$$\displaystyle \begin{aligned} \rho_\tau(\alpha)= \begin{cases} \tau \alpha,& \text{if } \alpha > 0\\ (\tau - 1)\alpha,& \text{otherwise}. \end{cases} \end{aligned} $$
(4.25)

Similarly, by \(\frac {\partial \mathcal {L}_{\text{quantile}}}{\partial f(x)}\) we can get F(x) = 1 − τ, i.e., f(x) is the τ quantile value of random variable y.

Specifically, QR-DQN considers N uniform quantiles \(q_i = \frac {1}{N}\) for the value distribution. For a QR-DQN model \(\theta : \mathcal {S} \rightarrow \mathbb {R}^{N \times |\mathcal {A}|}\), during sampling, the Q-value of the state s and action a is the mean of N estimations: \(Q(s, a) = \sum _{i=1}^{N} q_i\theta _i(s, a)\). During training, the greedy policy with respect to the Q-values in the next state provides \(a^* = \arg \max _{a^\prime } Q(s^\prime , a^\prime )\), and the distributional Bellman target is \(\mathcal {T}\theta _j = r + \gamma \theta _j(s^\prime , a^*)\) according to Eq. (4.20). The Lemma 2 in Dabney et al. (2018b) points out that the following sum minimizes the 1-Wasserstein distance between the approximate value distribution and the ground truth:

$$\displaystyle \begin{aligned} \sum_{i=1}^N \mathbb{E}_j\left[\rho_{\hat{\tau}_i}(\mathcal{T}\theta_j - \theta_i(s, a))\right], \end{aligned} $$
(4.26)

where \(\hat {\tau }_i = \frac {i}{N} - \frac {1}{2N}\).

Figure 4.5 shows a comparison of DQN, C51, and QR-DQN. There are further works in the flexibility or robustness of parameterized distribution for distributional reinforcement learning. Readers with more interest in this topic can find related resources from Dabney et al. (2018a), Mavrin et al (2019), Yang et al. (2019).

Fig. 4.5
figure 5

Comparison of DQN, C51, and QR-DQN for state s and action a, where arrows point out the estimation and the number of quantiles in QR-DQN is specified to 4. The architecture of DQN only outputs the approximation of the actual Q-value. For distributional reinforcement learning, C51 estimates probabilities on several Q-value supports while QR-DQN provides quantiles of Q-value

10 DQN Examples

In this section, we discuss more training details in DQN and its variants. Before that, we first demonstrate the process of setting up Atari environments and how to implement some useful wrappers that make training easy and stable.

10.1 Related Gym Environment

OpenAI Gym is an open-source toolkit for developing and comparing reinforcement learning algorithms. It contains a collection of environments, as shown in Fig. 4.6. One can install it with Atari extension directly from PyPI

Fig. 4.6
figure 6

Sample frames of some environments in OpenAI Gym

or from source

An environment object env can be created by

where env_id is a string that represents an environment. All possible env_ids are available at https://github.com/openai/gym/wiki/Table-of-environments.

There are some important methods of env:

  1. 1.

    env.reset( ) resets the state of the environment and returns the initial observation.

  2. 2.

    env.render( mode) renders the environment with the given mode. The default mode is human which renders to the current display or terminal and returns nothing. You can specify rgb_array mode to make env.render return numpy.ndarray objects, which is suitable for generating videos.

  3. 3.

    env.step( action) runs one time step of the environment’s dynamics with the given action, and then returns a tuple ( observation, reward, done, info) where observation is the observation of the current environment, reward is the transition reward, done points out whether the episode has ended, and info contains some auxiliary information.

  4. 4.

    env.seed( seed) sets the seed manually, which is useful for reproduction.

Here is an example of classic game Breakout. It will run an instance of the BreakoutNoFrameskip-v4 environment until the episode has ended. A sample frame shows in Fig. 4.7.

Fig. 4.7
figure 7

An example frame of Breakout. There are several rows to destroy at the top of the screen. The agent can control the bar at the bottom of the screen to angle your shots at the images you want to smash with the ball. The observations are an RGB images of the screen with shape (210, 160, 3)

Note that NoFrameskip means no frame-skip and no action repeat, and v4 means the 4th version which is the newest now. We will use this environment in following experiments.

Another useful feature in OpenAI Gym is the environment wrapper. It can wrap the environment object and make the training code more concise. Here is a time limit wrapper which limits the maximum length of each episode and is a default wrapper for Atari games.

For efficient training, gym.vector.AsyncVectorEnv provides an implementation of vectorized wrapper that runs n environments in parallel. All interfaces receive and return n variables together. Furthermore, it is also possible to implement a vectorized wrapper with buffer whose interfaces also receive and return n variables but maintains m > n workers in the background. It is efficient for environments where some transitions spend more time.

Included some classic control problems, Gym also provides standard interfaces of a collection of Atari 2600 games with RAM or screen images as input, using the Arcade Learning Environment (Bellemare et al. 2013). There are at most 18 different buttons in Atari 2600 games as follows:

  1. 1.

    Moving buttons: NOOP, UP, RIGHT, LEFT, DOWN, UPRIGHT, UPLEFT, DOWNRIGHT, DOWNLEFT

  2. 2.

    Fire buttons: FIRE, UPFIRE, RIGHTFIRE, LEFTFIRE, DOWNFIRE, UPRIGHTFIRE, UPLEFTFIRE, DOWNRIGHTFIRE, DOWNLEFTFIRE

where NOOP means do-nothing, and FIRE may also be used to start the game. For convenience, we will refer to buttons’ names as actions later.

10.2 DQN

There are three more training tricks in DQN. First, the following wrappers are used in order for stable and efficient training:

  1. 1.

    NoopResetEnv takes random number of NOOPs in reset stage to ensure random initial states. The default maximum no-ops number is 30. This wrapper helps agent collect more beginning situations, which provides robust learning.

  2. 2.

    MaxAndSkipEnv repeats each action 4 times for efficient training. To further denoising observation, the returned frame is the max pooling result over pixels across the last two frames.

  3. 3.

    Monitor records the raw reward. We can also implement some useful functions such as speed tracer in this wrapper.

  4. 4.

    EpisodicLifeEnv makes the end of life equal to the end of episode, which helps value estimation (Roderick et al. 2017).

  5. 5.

    FireResetEnv takes action FIRE on reset for environments that need action FIRE to start the game. This is a prior knowledge for quick start of games.

  6. 6.

    WarpFrame converts the observations to 84 × 84 gray-scale images.

  7. 7.

    ClipRewardEnv wraps the rewards by their sign, which further improves the stability by not allowing any single mini-batch update to change the parameters drastically.

  8. 8.

    FrameStack stacks the last 4 frames. Recall that for capturing moving information, DQN preprocesses observations by concentrating the current frame and the previous three, represented by function ϕ. FrameStack and WarpFrame implement the ϕ. Note that we can optimize memory usage by storing common frames between the observations only once, which is also called lazy-frame trick.

Second, to avoid gradient explosion, DQN (Mnih et al. 2015; DeepMind 2015) uses gradient clipping of the squared error, which is equivalent to replace MSE by the Huber loss (Huber 1992) with δ = 1. The Huber loss is given by

$$\displaystyle \begin{aligned} L_{\delta}(x)=\left\{ \begin{array}{rcl} &\frac{1}{2}x^2 & {|x| \le \delta}\\ &\delta(|x| - \frac{1}{2}\delta) & {\text{otherwise}}. \end{array} \right. \end{aligned} $$
(4.27)

Finally, replay buffer samples batch of experiences with replacement, and there are some warm start steps before updating for a stable beginning.

Note that all three tricks above are applied to all experiments in this section. Now we show how to build an agent to play Breakout. First of all, for reproducibility, we set random seeds in related libraries manually:

Then we build a Q-network with tf.keras.Model:

The definition of a DQN object consists of attributes Q-network, target Q-network, number of time steps and optimizer, and synchronize Q-network and target Q-network as following:

Declare an internal method to wrap the Q-network and then add a get_action method to DQN object for 𝜖-greedy behavior:

where epsilon is a function that anneals 𝜖 linearly from 1.0 to 0.01 over the first 10% training time steps. For training, we use three common interfaces train, _train_func, _tderror_func for DQN and its variants in the following sections:

where train calls _train_func and synchronizes the Q-network and target Q-network every target_q_update_freq time steps.

Finally, we build the main training procedure:

We run 107 time steps (4 × 107 frames) over three random seeds on Breakout. For better visualization, we smooth the episode rewards during training. Then we plot the mean and the standard deviation by following codes:

The performance is shown in Fig. 4.8 with red area.

Fig. 4.8
figure 8

Performances of DQN and its variants on breakout

10.3 Double DQN

Double DQN can be implemented easily by using the following double Q estimation in _tderror_func of the agent:

We also run 107 time steps over three random seeds on Breakout. The performance is shown in Fig. 4.8 with green area.

10.4 Dueling DQN

The dueling architecture only changes the Q-network, which can be implemented by

We also run 107 time steps over three random seeds on Breakout. The performance is shown in Fig. 4.8 with cyan area.

10.5 Prioritized Experience Replay

There are three main changes in PER from standard DQN. First, the replay buffer maintains two segment trees with min operator and add operator to calculate the minimum priority and sum of priorities efficiently. More precisely, attribute _it_sum is the segment tree object with operation add with two interfaces, sum for getting the sum of elements in the given range and find_prefixsum_idx for finding the highest index i such that the sum of the smallest i elements is less than the input value.

Second, instead of uniform sampling, the sampling strategy of the proportional information is shown as follows:

Third, apart from standard replay buffer, PER must return indexes and normalized weights of sampled experiences. Weights are used for weighting Huber loss, and indexes are used to update priorities. The sampling step is modified to

and the _train_func is modified to

We also run 107 time steps over three random seeds on Breakout. The performance is shown in Fig. 4.8 with magenta area.

10.6 Distributed DQN

Distributional reinforcement learning estimates the distribution of the Q-value. In this section, we show how to implement one of these techniques, C51, to achieve distributed DQN. In game Breakout, the rewards are all positive. So we replace the value range [−10, 10] used in Bellemare et al. (2017) by [−1, 19], where − 1 allows some approximation error. To implement C51, first of all, the Q-Network outputs 51 estimations for each action, which can be implemented by adding more output units in the last fully connection layer. Then instead of the TD error, KL-divergence between target Q distribution and the estimated distribution is used as loss:

We also run 107 time steps over three random seeds on Breakout. The performance is shown in Fig. 4.8 with blue area.