Skip to main content

Low-Cost Model-Free Deep Reinforcement Learning on Continuous Control

  • Conference paper
  • First Online:
Intelligent Computing (SAI 2023)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 739))

Included in the following conference series:

  • 753 Accesses

Abstract

Training agents to perform reinforcement learning (RL) is a difficult task in most cases since they are often misguided by the reward signal, getting into invalid states and behaving in unwanted ways when exploring and interacting with the environment. Sometimes the cost needs to be considered when optimizing the performance during RL training. An additional cost signal from the environment may help to lead agents towards the desired actions, maximizing the expected return at low cost. In most practical situations, the state or action space is usually large or continuous, which makes model-free training more difficult if too much constraint is given for low cost. Therefore, in continuous Markov Decision Processes, we present a notion to optimize the set objective function at low cost with free exploration, which avoids the performance degrading. In this paper, we propose a novel approach to replace the objective and cost with surrogate functions, as well as applying a state-dependent multiplier to provide a highly efficient tradeoff between them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 219.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 279.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/openai/gym/tree/master/gym/envs/classic_control.

References

  1. Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: Proceedings of the 34th International Conference on Machine Learning-vol. 70, pp. 22–31. JMLR. org, 2017

    Google Scholar 

  2. Altman, E.: Constrained Markova decision processes with total cost criteria: Lagrangian approach and dual linear program. Math. Methods Oper. Res. 48(3), 387–417 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  3. Altman, E.: Constrained Markov Decision Processes, vol. 7. CRC Press, Boca Raton (1999)

    Google Scholar 

  4. Bohez, S., Abdolmaleki, A., Neunert, M., Buchli, J., Heess, N., Hadsell, R.: Value constrained model-free continuous control. arXiv preprint arXiv:1902.04623 (2019)

  5. Chow, Y., Nachum, O., Duenez-Guzman, E., Ghavamzadeh. M.: A lyapunov-based approach to safe reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 8103–8112 (2018)

    Google Scholar 

  6. Chow, Y., Nachum, O., Faust, A., Ghavamzadeh, M., Duenez-Guzman, E.: Lyapunov-based safe policy optimization for continuous control. arXiv preprintarXiv:1901.10031 (2019)

  7. Dalal, G., Dvijotham, K., Vecerik, M., Hester, T., Paduraru, C., Tassa, Y.: Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757 (2018)

  8. Gattami. A.: Reinforcement learning for multi-objective and constrained Markov decision processes. arXiv preprint arXiv:1901.08978 (2019)

  9. Geibel, P., Wysotzki, F.: Risk-sensitive reinforcement learning applied to control under constraints. J, Artif. Intell. Res. 24, 81–108 (2005)

    Article  MATH  Google Scholar 

  10. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, pp. 1861–1870 (2018)

    Google Scholar 

  11. Konda, V.R., Tsitsiklis. J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008–1014 (2000)

    Google Scholar 

  12. Lee, J.D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M.I., Recht, B.: First-order methods almost always avoid saddle points. arXiv preprintarXiv:1710.07406 (2017)

  13. Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprintarXiv:1509.02971 (2015)

  14. Liu, T., Zhou, R., Kalathil, D., Kumar, P., Tian, C.: Learning policies with zero or bounded constraint violation for constrained MDPS. Adv. Neural. Inf. Process. Syst. 34, 17183–17193 (2021)

    Google Scholar 

  15. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

    Google Scholar 

  16. Perkins, T.J., Barto, A.G.: Lyapunov design for safe reinforcement learning. J. Mach. Learn. Res. 3(December), 803–832 (2002)

    Google Scholar 

  17. Satija, H., Amortila, P., Pineau, J.: Constrained Markov decision processes via backward value functions. In: International Conference on Machine Learning, pp. 8502–8511. PMLR (2020)

    Google Scholar 

  18. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprintarXiv:1707.06347, 2017

  19. Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017)

    Article  Google Scholar 

  20. Singh, S., Jaakkola, T., Littman, M.I.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000)

    Google Scholar 

  21. Singh, S., Jaakkola, T., Littman, M.I.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000)

    Google Scholar 

  22. Tessler, C., Mankowitz, D.J., Mannor, S.: Reward constrained policy optimization. arXiv preprintarXiv:1805.11074 (2018)

  23. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)

    Google Scholar 

  24. Wachi, A., Sui, A.: Safe reinforcement learning in constrained Markov decision processes. In: International Conference on Machine Learning, pp. 9797–9806. PMLR (2020)

    Google Scholar 

  25. Wiering, M., Van Otterlo, M.: Reinforcement learning. Adapt. Learn. Optim. 12, 3 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huihui Zhang .

Editor information

Editors and Affiliations

Appendices

Appendix

Proof of Eq. (5)

Proof

For \(s\in \chi \),

$$\begin{aligned}&C_{\pi }(s,a)=\mathbb {E}_{p^{\pi }(h|s_0,a_0)}\left[ \sum _{t=0}^{\infty }\gamma ^t c(s_t,a_t)|s_0=s,a_0=a\right] \nonumber \\&=\mathbb {E}_{p(s'|s,a)}\{c(s,a)+\mathbb {E}_{p^{\pi }(h_1^{\infty }|s_1)}[\sum _{t=1}^{\infty }\gamma ^t c(s_t,a_t)|s_1=s']\} \nonumber \\&=\mathbb {E}_{p(s'|s,a)}\{c(s,a)+\gamma \mathbb {E}_{p^{\pi }(h|s_0,a_0)}[\sum _{t=0}^{\infty }\gamma ^t c(s_t,a_t)|s_0=s']\} \nonumber \\&=c(s,a)+\gamma \mathbb {E}_{p(s'|s,a)}\left[ C_{\pi }(s')\right] , \end{aligned}$$
(15)

where

$$\begin{aligned} p^{\pi }(h_1^{\infty }|s_1)=\prod _{t=1}^{\infty }\pi (a_t|s_t)p(s_{t+1}|s_t,a_t). \end{aligned}$$
(16)

\(\square \)

Proof of Theorem 1

Proof

This proof is based on Lemma 1 of [20], which is moved originally below for convenience.

Lemma 1 of [20]: Consider a stochastic process \((\alpha _t,\varDelta _t,F_t)\), \(t\ge 0\), where \(\alpha _t,\varDelta _t,F_t:X\rightarrow \mathbb {R}\), which satisfies the equations

$$\begin{aligned} \varDelta _{t+1}(x)=(1-\alpha _t(x))\varDelta _t(x)+\alpha _t(x)F_t(x),\quad x\in X,t=0,1,2,\cdots \end{aligned}$$
(17)

Let \(P_t\) be a sequence of increasing \(\sigma \)-fields such that \(\alpha _0, \varDelta _0\) are \(P_0\)-measurable and \(\alpha _t,\varDelta _t\) and \(F_{t-1}\) are \(P_t\)-measurable, \(t=1,2,\cdots \) Assume that the following hold:

  1. 1.

    the set of possible states X is fixed.

  2. 2.

    \(0\le \alpha _t(x)\le 1, \sum _t\alpha _t(x)=\infty , \sum _t\alpha _t^2(x)<\infty \ w.p.1\).

  3. 3.

    \(\Vert \mathbb {E}\{F_t(\cdot )|P_t\}\Vert \le \gamma \Vert \varDelta _t\Vert +c_t\), where \(\gamma \in [0,1)\) and \(c_t\) converges to zero w.p.1.

  4. 4.

    \(Var\{F_t(x)|P_t\}\le K(1+\Vert \varDelta _t\Vert )^2\), where K is some constant.

Then \(\varDelta _t\) converges to zero with probability one (w.p.1).

Within the scope of this paper, the MDP state space is fixed, satisfying condition 1 in Lemma 1 of [20], and Lemma condition 2 holds by proper selection of learning rate. According to [21], even the commonly used constant learning rate can make algorithms converge in distribution.

$$\begin{aligned} L(\omega )=\frac{1}{N}\sum _{n=1}^N(r_n+\gamma Q(s_n',a_n'|\omega ')-Q(s_n,a_n|\omega ))^2 , \end{aligned}$$
(18)
$$\begin{aligned} L(\varOmega )=\frac{1}{N}\sum _{n=1}^N(c_n+\gamma C(s_n',a_n'|\varOmega ')-C(s_n,a_n|\varOmega ))^2 , \end{aligned}$$
(19)

We apply Lemma 1 of [20] with \(P_t=\{Q(\cdot ,\cdot |\omega _0),C(\cdot ,\cdot |\varOmega _0),s_0,a_0,r_1,s_1,\cdots ,s_t,a_t\}\). Following the update rule for optimizing Eq. (18) and Eq. (19) and using the current policy to produce the action \(a_{t+1}=\mu (s_{t+1}|\theta )\), we have

$$\begin{aligned} Q(s_t,a_t|\omega _{t+1})&=(1-\alpha _t)Q(s_t,a_t|\omega _t)+\alpha _t \left[ r_t+\gamma Q(s_{t+1},a_{t+1}|\omega _t)\right] , \end{aligned}$$
(20)
$$\begin{aligned} C(s_t,a_t|\varOmega _{t+1})&=(1-\alpha _t)C(s_t,a_t|\varOmega _t)+\alpha _t \left[ r_t+\gamma C(s_{t+1},a_{t+1}|\varOmega _t)\right] . \end{aligned}$$
(21)

Define the costed action-value function as

$$\begin{aligned} \hat{Q}(s,a|\omega ,\varOmega ,\lambda )=Q(s,a|\omega )-\varLambda (s|\lambda )(C(s,a|\varOmega )-d_0). \end{aligned}$$
(22)

Using the definition of Eq. (22) under the setting of our proposed algorithm, we denote \(\varDelta _t=\hat{Q}(\cdot ,\cdot |\omega _t,\varOmega _t,\lambda )-\hat{Q}^{\star }(\cdot ,\cdot )\), which is the difference between the penalized action-value function and optimal penalized value function. Then we have

$$\begin{aligned}&\varDelta _{t+1}(s_t,a_t)=\hat{Q}(s_t,a_t|\omega _{t+1},\varOmega _{t+1},\lambda )-\hat{Q}^{\star }(s_t,a_t) \nonumber \\ =\,&Q(s_t,a_t|\omega _{t+1})-\varLambda (s|\lambda )(C(s_t,a_t|\varOmega _{t+1})-d_0)-\hat{Q}^{\star }(s_t,a_t) \nonumber \\ =\,&(1-\alpha _t)\left[ \hat{Q}(s_t,a_t|\omega _t,\varOmega _t,\lambda )-\hat{Q}^{\star }(s_t,a_t)\right] +\alpha _t F_t \nonumber \\ =\,&(1-\alpha _t)\varDelta _t(s_t,a_t)+\alpha _t F_t(s_t,a_t), \end{aligned}$$
(23)

where the third equality is due to the substitution of Eq. (20) and Eq. (21), and

$$\begin{aligned} F_t(s_t,a_t)=r_t+\gamma \hat{Q}(s_{t+1},a_{t+1}|\omega _t,\varOmega _t,\lambda )-\hat{Q}^{\star }(s_t,a_t). \end{aligned}$$
(24)

Since the reward is bounded within the scope of this paper, the action-values are also bounded, then condition 4 in Lemma 1 of [20] holds. According to the proof in Theorem 2 of [20], there is \(\Vert \mathbb {E}\left[ F_t(s_t,a_t)|P_t\right] \Vert \le \gamma \Vert \varDelta _t\Vert \), which satisfies condition 3 in Lemma 1 of [20].

Finally, it can be concluded that \(\hat{Q}(\cdot ,\cdot |\omega _t,\varOmega _t,\lambda )\) converges to \(\hat{Q}^{\star }(\cdot ,\cdot )\) with probability 1. \(\square \)

Proof of Theorem 2

Proof

We rewrite (12) and (13) in expected forms as

$$\begin{aligned} J(\theta )=\mathbb {E}_s\left[ Q(s,\mu (s|\theta )|\omega )-\varLambda (s|\lambda ')\delta _1(s|\theta )\right] , \end{aligned}$$
$$\begin{aligned} J(\lambda )=\mathbb {E}_s\left[ Q(s,\mu (s|\theta ')|\omega ')-\varLambda (s|\lambda )\delta _2(s|\theta ')\right] , \end{aligned}$$

We assume the worst case starting from the i-th iteration, when the expected cost term is nonnegative, i.e., \(\mathbb {E}_s\left[ \delta _2(s|\theta _i')\right] \ge 0\). Then the expected weighted cost term \(\mathbb {E}_s\left[ \varLambda (s|\lambda _i)\delta _2(s|\theta _i')\right] \ge 0\) since the multiplier is nonnegative as well. According to maximization of (13), \(\varLambda (s|\lambda _i)\) is upper unbounded with respect to \(\lambda _i\) when the expected weighted cost term is nonnegative, which means \(\exists \lambda _{i+1}\) so that \(J(\theta )\le 0\), \(\forall \theta \). Here we have made a simple premise of \(\lambda _{i+1}'=\lambda _{i+1}\), which can be realized by setting the soft update parameter \(\tau _{\lambda }\) as 1.

Under the above circumstance, for the maximization of \(J(\theta _{i+1})\), there must be \(\mathbb {E}_s\left[ \varLambda (s|\lambda _{i+1}')\delta _1(s|\theta _{i+1})\right] \le 0\). With the premise of \(\theta _{i+1}'=\theta _{i+1}\), we can conclude \(\max _s{\varLambda (s|\lambda _{i+2})}\mathbb {E}_s\left[ \delta _2(s|\theta _{i+1}')\right] \le \mathbb {E}_s\left[ \varLambda (s|\lambda _{i+2})\delta _2(s|\theta _{i+1}')\right] \le 0\), i.e., \(\mathbb {E}_s\left[ \delta _2(s|\theta _{i+1}')\right] \le 0\), which means the constraint requirement is recovered. \(\square \)

Hyperparameters

Table 1 lists the common hyperparameters shared by all experiments and their respective settings.

Table 1. List of Hyperparameters.

Experiment Details

1.1 Continuous Maze

The continuous maze filled with obstacles is suitable for this evaluation. During the training process, the agent may travel through the obstacles in order to reach the goal. However, the agent can keep a safe distance from the obstacles asymptotically by receiving cost signals as warnings from the environment. The environment of continuous maze problem includes continuous state-action space which is shown in Fig. 6(a) and Fig. 6(b). At every step, the agent moves towards all directions with a fixed step size to any possible position in the maze, even traveling across the barriers represented by the gray grids. The wall is drawn as the dark solid line on the edge of the maze.

The task of the agent is to move from the starting point to the goal colored yellow without hitting any barrier or wall. During the task, the agent receives \(-1000\) reward if it hits or goes through the barrier, receives \(-50\) if it hits the wall. A reward of 100 score will be assigned to the agent if it arrives at the goal within the timeout. In other blank areas, the rewards is set as the minus distance from the agent to the goal. At every step where the received reward is nonpositive, the agent is given a cost of \(\frac{1}{1+d}\), where d represents the minimum distance of the agent from any barrier or wall.

Fig. 6.
figure 6

(a) Continuous Maze with 1 Array of Barriers (Size 2); (b) Continuous Maze with 2 Arrays of Barriers (Size 3); (c) Cartpole Environment.

1.2 Classical Control Environment

Among these, the cartpole environment is illustrated in Fig. 6(c). We add the absolute value of distance into the rewards to encourage the cart to move farther instead of staying at origin for stability. The setting of this experiment is as follows. At every step, when the cart moves randomly towards left or right, a reward of 1 score plus a value positively proportional to the position and velocity of the cart is given if the deviation angle of the pole is less than \(24^\circ \). Otherwise, the position of the cart and the angle of the pole will be reset for the next episode. The goal of the experiment is to keep the environment from resetting until the timeout which is set as 1000 steps.

Overall, the cost of each experiment is set proportional to the absolute value of action, which determines the torque applied by the agent. And accordingly, the influence of action is excluded from the reward shaping. The setting of rewards and costs are listed in Table 2.

Table 2. Reward Shaping of Continuous Control Experiments.
Fig. 7.
figure 7

Architecture of Deterministic Networks.

Network Architecture

We construct the critic network using a fully-connected MLP with two hidden layers. The input is composed of the state and action, outputting a value representing the Q-value. The ReLU function is adopted to activate the first hidden layer. The setting of actor network is similar to that of the critic network, except that the input is the state and the output is multiplied by the action supremum after tanh nonlinearity. The network of multiplier \(\varLambda \) is constructed similar to the actor network except replacing the tanh nonlinearity by clipping \(\varLambda \) in \([0, \infty )\). The architecture of networks are plotted in Fig. 7.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, H., Han, X., Cheng, Y., Yan, C. (2023). Low-Cost Model-Free Deep Reinforcement Learning on Continuous Control. In: Arai, K. (eds) Intelligent Computing. SAI 2023. Lecture Notes in Networks and Systems, vol 739. Springer, Cham. https://doi.org/10.1007/978-3-031-37963-5_50

Download citation

Publish with us

Policies and ethics