Low-Cost Model-Free Deep Reinforcement Learning on Continuous Control

Zhang, Huihui; Han, Xu; Cheng, Yanlong; Yan, Cong

doi:10.1007/978-3-031-37963-5_50

Huihui Zhang¹⁰,
Xu Han¹⁰,
Yanlong Cheng¹⁰ &
…
Cong Yan¹⁰

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 739))

Included in the following conference series:

Science and Information Conference

753 Accesses

Abstract

Training agents to perform reinforcement learning (RL) is a difficult task in most cases since they are often misguided by the reward signal, getting into invalid states and behaving in unwanted ways when exploring and interacting with the environment. Sometimes the cost needs to be considered when optimizing the performance during RL training. An additional cost signal from the environment may help to lead agents towards the desired actions, maximizing the expected return at low cost. In most practical situations, the state or action space is usually large or continuous, which makes model-free training more difficult if too much constraint is given for low cost. Therefore, in continuous Markov Decision Processes, we present a notion to optimize the set objective function at low cost with free exploration, which avoids the performance degrading. In this paper, we propose a novel approach to replace the objective and cost with surrogate functions, as well as applying a state-dependent multiplier to provide a highly efficient tradeoff between them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 219.00; Price excludes VAT (USA)

Softcover Book: USD 279.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Complementary Reward Function Based Learning Enhancement for Deep Reinforcement Learning

Exploiting Reward Machines with Deep Reinforcement Learning in Continuous Action Domains

Learning Agents with Prioritization and Parameter Noise in Continuous State and Action Space

Notes

1.
https://github.com/openai/gym/tree/master/gym/envs/classic_control.

References

Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: Proceedings of the 34th International Conference on Machine Learning-vol. 70, pp. 22–31. JMLR. org, 2017
Google Scholar
Altman, E.: Constrained Markova decision processes with total cost criteria: Lagrangian approach and dual linear program. Math. Methods Oper. Res. 48(3), 387–417 (1998)
Article MathSciNet MATH Google Scholar
Altman, E.: Constrained Markov Decision Processes, vol. 7. CRC Press, Boca Raton (1999)
Google Scholar
Bohez, S., Abdolmaleki, A., Neunert, M., Buchli, J., Heess, N., Hadsell, R.: Value constrained model-free continuous control. arXiv preprint arXiv:1902.04623 (2019)
Chow, Y., Nachum, O., Duenez-Guzman, E., Ghavamzadeh. M.: A lyapunov-based approach to safe reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 8103–8112 (2018)
Google Scholar
Chow, Y., Nachum, O., Faust, A., Ghavamzadeh, M., Duenez-Guzman, E.: Lyapunov-based safe policy optimization for continuous control. arXiv preprintarXiv:1901.10031 (2019)
Dalal, G., Dvijotham, K., Vecerik, M., Hester, T., Paduraru, C., Tassa, Y.: Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757 (2018)
Gattami. A.: Reinforcement learning for multi-objective and constrained Markov decision processes. arXiv preprint arXiv:1901.08978 (2019)
Geibel, P., Wysotzki, F.: Risk-sensitive reinforcement learning applied to control under constraints. J, Artif. Intell. Res. 24, 81–108 (2005)
Article MATH Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, pp. 1861–1870 (2018)
Google Scholar
Konda, V.R., Tsitsiklis. J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008–1014 (2000)
Google Scholar
Lee, J.D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M.I., Recht, B.: First-order methods almost always avoid saddle points. arXiv preprintarXiv:1710.07406 (2017)
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprintarXiv:1509.02971 (2015)
Liu, T., Zhou, R., Kalathil, D., Kumar, P., Tian, C.: Learning policies with zero or bounded constraint violation for constrained MDPS. Adv. Neural. Inf. Process. Syst. 34, 17183–17193 (2021)
Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Google Scholar
Perkins, T.J., Barto, A.G.: Lyapunov design for safe reinforcement learning. J. Mach. Learn. Res. 3(December), 803–832 (2002)
Google Scholar
Satija, H., Amortila, P., Pineau, J.: Constrained Markov decision processes via backward value functions. In: International Conference on Machine Learning, pp. 8502–8511. PMLR (2020)
Google Scholar
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprintarXiv:1707.06347, 2017
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017)
Article Google Scholar
Singh, S., Jaakkola, T., Littman, M.I.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000)
Google Scholar
Singh, S., Jaakkola, T., Littman, M.I.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000)
Google Scholar
Tessler, C., Mankowitz, D.J., Mannor, S.: Reward constrained policy optimization. arXiv preprintarXiv:1805.11074 (2018)
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
Google Scholar
Wachi, A., Sui, A.: Safe reinforcement learning in constrained Markov decision processes. In: International Conference on Machine Learning, pp. 9797–9806. PMLR (2020)
Google Scholar
Wiering, M., Van Otterlo, M.: Reinforcement learning. Adapt. Learn. Optim. 12, 3 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Dongsheng Intelligent Technolody Co., Ltd., Suzhou, China
Huihui Zhang, Xu Han, Yanlong Cheng & Cong Yan

Authors

Huihui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xu Han
View author publications
You can also search for this author in PubMed Google Scholar
Yanlong Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Cong Yan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huihui Zhang .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai

Appendices

Appendix

Proof of Eq. (5)

Proof

For $s\in \chi $,

$$\begin{aligned}&C_{\pi }(s,a)=\mathbb {E}_{p^{\pi }(h|s_0,a_0)}\left[ \sum _{t=0}^{\infty }\gamma ^t c(s_t,a_t)|s_0=s,a_0=a\right] \nonumber \\&=\mathbb {E}_{p(s'|s,a)}\{c(s,a)+\mathbb {E}_{p^{\pi }(h_1^{\infty }|s_1)}[\sum _{t=1}^{\infty }\gamma ^t c(s_t,a_t)|s_1=s']\} \nonumber \\&=\mathbb {E}_{p(s'|s,a)}\{c(s,a)+\gamma \mathbb {E}_{p^{\pi }(h|s_0,a_0)}[\sum _{t=0}^{\infty }\gamma ^t c(s_t,a_t)|s_0=s']\} \nonumber \\&=c(s,a)+\gamma \mathbb {E}_{p(s'|s,a)}\left[ C_{\pi }(s')\right] , \end{aligned}$$

(15)

where

$$\begin{aligned} p^{\pi }(h_1^{\infty }|s_1)=\prod _{t=1}^{\infty }\pi (a_t|s_t)p(s_{t+1}|s_t,a_t). \end{aligned}$$

(16)

$\square $

Proof of Theorem 1

Proof

This proof is based on Lemma 1 of [20], which is moved originally below for convenience.

Lemma 1 of [20]: Consider a stochastic process $(\alpha _t,\varDelta _t,F_t)$, $t\ge 0$, where $\alpha _t,\varDelta _t,F_t:X\rightarrow \mathbb {R}$, which satisfies the equations

$$\begin{aligned} \varDelta _{t+1}(x)=(1-\alpha _t(x))\varDelta _t(x)+\alpha _t(x)F_t(x),\quad x\in X,t=0,1,2,\cdots \end{aligned}$$

(17)

Let $P_t$ be a sequence of increasing $\sigma $-fields such that $\alpha _0, \varDelta _0$ are $P_0$-measurable and $\alpha _t,\varDelta _t$ and $F_{t-1}$ are $P_t$-measurable, $t=1,2,\cdots $ Assume that the following hold:

1.
the set of possible states X is fixed.
2.
$0\le \alpha _t(x)\le 1, \sum _t\alpha _t(x)=\infty , \sum _t\alpha _t^2(x)<\infty \ w.p.1$.
3.
$\Vert \mathbb {E}\{F_t(\cdot )|P_t\}\Vert \le \gamma \Vert \varDelta _t\Vert +c_t$, where $\gamma \in [0,1)$ and $c_t$ converges to zero w.p.1.
4.
$Var\{F_t(x)|P_t\}\le K(1+\Vert \varDelta _t\Vert )^2$, where K is some constant.

Then $\varDelta _t$ converges to zero with probability one (w.p.1).

Within the scope of this paper, the MDP state space is fixed, satisfying condition 1 in Lemma 1 of [20], and Lemma condition 2 holds by proper selection of learning rate. According to [21], even the commonly used constant learning rate can make algorithms converge in distribution.

$$\begin{aligned} L(\omega )=\frac{1}{N}\sum _{n=1}^N(r_n+\gamma Q(s_n',a_n'|\omega ')-Q(s_n,a_n|\omega ))^2 , \end{aligned}$$

(18)

$$\begin{aligned} L(\varOmega )=\frac{1}{N}\sum _{n=1}^N(c_n+\gamma C(s_n',a_n'|\varOmega ')-C(s_n,a_n|\varOmega ))^2 , \end{aligned}$$

(19)

We apply Lemma 1 of [20] with $P_t=\{Q(\cdot ,\cdot |\omega _0),C(\cdot ,\cdot |\varOmega _0),s_0,a_0,r_1,s_1,\cdots ,s_t,a_t\}$. Following the update rule for optimizing Eq. (18) and Eq. (19) and using the current policy to produce the action $a_{t+1}=\mu (s_{t+1}|\theta )$, we have

$$\begin{aligned} Q(s_t,a_t|\omega _{t+1})&=(1-\alpha _t)Q(s_t,a_t|\omega _t)+\alpha _t \left[ r_t+\gamma Q(s_{t+1},a_{t+1}|\omega _t)\right] , \end{aligned}$$

(20)

$$\begin{aligned} C(s_t,a_t|\varOmega _{t+1})&=(1-\alpha _t)C(s_t,a_t|\varOmega _t)+\alpha _t \left[ r_t+\gamma C(s_{t+1},a_{t+1}|\varOmega _t)\right] . \end{aligned}$$

(21)

Define the costed action-value function as

$$\begin{aligned} \hat{Q}(s,a|\omega ,\varOmega ,\lambda )=Q(s,a|\omega )-\varLambda (s|\lambda )(C(s,a|\varOmega )-d_0). \end{aligned}$$

(22)

Using the definition of Eq. (22) under the setting of our proposed algorithm, we denote $\varDelta _t=\hat{Q}(\cdot ,\cdot |\omega _t,\varOmega _t,\lambda )-\hat{Q}^{\star }(\cdot ,\cdot )$, which is the difference between the penalized action-value function and optimal penalized value function. Then we have

$$\begin{aligned}&\varDelta _{t+1}(s_t,a_t)=\hat{Q}(s_t,a_t|\omega _{t+1},\varOmega _{t+1},\lambda )-\hat{Q}^{\star }(s_t,a_t) \nonumber \\ =\,&Q(s_t,a_t|\omega _{t+1})-\varLambda (s|\lambda )(C(s_t,a_t|\varOmega _{t+1})-d_0)-\hat{Q}^{\star }(s_t,a_t) \nonumber \\ =\,&(1-\alpha _t)\left[ \hat{Q}(s_t,a_t|\omega _t,\varOmega _t,\lambda )-\hat{Q}^{\star }(s_t,a_t)\right] +\alpha _t F_t \nonumber \\ =\,&(1-\alpha _t)\varDelta _t(s_t,a_t)+\alpha _t F_t(s_t,a_t), \end{aligned}$$

(23)

where the third equality is due to the substitution of Eq. (20) and Eq. (21), and

$$\begin{aligned} F_t(s_t,a_t)=r_t+\gamma \hat{Q}(s_{t+1},a_{t+1}|\omega _t,\varOmega _t,\lambda )-\hat{Q}^{\star }(s_t,a_t). \end{aligned}$$

(24)

Since the reward is bounded within the scope of this paper, the action-values are also bounded, then condition 4 in Lemma 1 of [20] holds. According to the proof in Theorem 2 of [20], there is $\Vert \mathbb {E}\left[ F_t(s_t,a_t)|P_t\right] \Vert \le \gamma \Vert \varDelta _t\Vert $, which satisfies condition 3 in Lemma 1 of [20].

Finally, it can be concluded that $\hat{Q}(\cdot ,\cdot |\omega _t,\varOmega _t,\lambda )$ converges to $\hat{Q}^{\star }(\cdot ,\cdot )$ with probability 1. $\square $

Proof of Theorem 2

Proof

We rewrite (12) and (13) in expected forms as

$$\begin{aligned} J(\theta )=\mathbb {E}_s\left[ Q(s,\mu (s|\theta )|\omega )-\varLambda (s|\lambda ')\delta _1(s|\theta )\right] , \end{aligned}$$

$$\begin{aligned} J(\lambda )=\mathbb {E}_s\left[ Q(s,\mu (s|\theta ')|\omega ')-\varLambda (s|\lambda )\delta _2(s|\theta ')\right] , \end{aligned}$$

We assume the worst case starting from the i-th iteration, when the expected cost term is nonnegative, i.e., $\mathbb {E}_s\left[ \delta _2(s|\theta _i')\right] \ge 0$. Then the expected weighted cost term $\mathbb {E}_s\left[ \varLambda (s|\lambda _i)\delta _2(s|\theta _i')\right] \ge 0$ since the multiplier is nonnegative as well. According to maximization of (13), $\varLambda (s|\lambda _i)$ is upper unbounded with respect to $\lambda _i$ when the expected weighted cost term is nonnegative, which means $\exists \lambda _{i+1}$ so that $J(\theta )\le 0$, $\forall \theta $. Here we have made a simple premise of $\lambda _{i+1}'=\lambda _{i+1}$, which can be realized by setting the soft update parameter $\tau _{\lambda }$ as 1.

Under the above circumstance, for the maximization of $J(\theta _{i+1})$, there must be $\mathbb {E}_s\left[ \varLambda (s|\lambda _{i+1}')\delta _1(s|\theta _{i+1})\right] \le 0$. With the premise of $\theta _{i+1}'=\theta _{i+1}$, we can conclude $\max _s{\varLambda (s|\lambda _{i+2})}\mathbb {E}_s\left[ \delta _2(s|\theta _{i+1}')\right] \le \mathbb {E}_s\left[ \varLambda (s|\lambda _{i+2})\delta _2(s|\theta _{i+1}')\right] \le 0$, i.e., $\mathbb {E}_s\left[ \delta _2(s|\theta _{i+1}')\right] \le 0$, which means the constraint requirement is recovered. $\square $

Hyperparameters

Table 1 lists the common hyperparameters shared by all experiments and their respective settings.

Table 1. List of Hyperparameters.

Full size table

Experiment Details

1.1 Continuous Maze

The continuous maze filled with obstacles is suitable for this evaluation. During the training process, the agent may travel through the obstacles in order to reach the goal. However, the agent can keep a safe distance from the obstacles asymptotically by receiving cost signals as warnings from the environment. The environment of continuous maze problem includes continuous state-action space which is shown in Fig. 6(a) and Fig. 6(b). At every step, the agent moves towards all directions with a fixed step size to any possible position in the maze, even traveling across the barriers represented by the gray grids. The wall is drawn as the dark solid line on the edge of the maze.

The task of the agent is to move from the starting point to the goal colored yellow without hitting any barrier or wall. During the task, the agent receives $-1000$ reward if it hits or goes through the barrier, receives $-50$ if it hits the wall. A reward of 100 score will be assigned to the agent if it arrives at the goal within the timeout. In other blank areas, the rewards is set as the minus distance from the agent to the goal. At every step where the received reward is nonpositive, the agent is given a cost of $\frac{1}{1+d}$, where d represents the minimum distance of the agent from any barrier or wall.

1.2 Classical Control Environment

Among these, the cartpole environment is illustrated in Fig. 6(c). We add the absolute value of distance into the rewards to encourage the cart to move farther instead of staying at origin for stability. The setting of this experiment is as follows. At every step, when the cart moves randomly towards left or right, a reward of 1 score plus a value positively proportional to the position and velocity of the cart is given if the deviation angle of the pole is less than $24^\circ $. Otherwise, the position of the cart and the angle of the pole will be reset for the next episode. The goal of the experiment is to keep the environment from resetting until the timeout which is set as 1000 steps.

Overall, the cost of each experiment is set proportional to the absolute value of action, which determines the torque applied by the agent. And accordingly, the influence of action is excluded from the reward shaping. The setting of rewards and costs are listed in Table 2.

Table 2. Reward Shaping of Continuous Control Experiments.

Full size table

Network Architecture

We construct the critic network using a fully-connected MLP with two hidden layers. The input is composed of the state and action, outputting a value representing the Q-value. The ReLU function is adopted to activate the first hidden layer. The setting of actor network is similar to that of the critic network, except that the input is the state and the output is multiplied by the action supremum after tanh nonlinearity. The network of multiplier $\varLambda $ is constructed similar to the actor network except replacing the tanh nonlinearity by clipping $\varLambda $ in $[0, \infty )$. The architecture of networks are plotted in Fig. 7.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, H., Han, X., Cheng, Y., Yan, C. (2023). Low-Cost Model-Free Deep Reinforcement Learning on Continuous Control. In: Arai, K. (eds) Intelligent Computing. SAI 2023. Lecture Notes in Networks and Systems, vol 739. Springer, Cham. https://doi.org/10.1007/978-3-031-37963-5_50

Download citation

DOI: https://doi.org/10.1007/978-3-031-37963-5_50
Published: 20 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37962-8
Online ISBN: 978-3-031-37963-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Low-Cost Model-Free Deep Reinforcement Learning on Continuous Control

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Complementary Reward Function Based Learning Enhancement for Deep Reinforcement Learning

Exploiting Reward Machines with Deep Reinforcement Learning in Continuous Action Domains

Learning Agents with Prioritization and Parameter Noise in Continuous State and Action Space

Notes

References