Abstract
Training agents to perform reinforcement learning (RL) is a difficult task in most cases since they are often misguided by the reward signal, getting into invalid states and behaving in unwanted ways when exploring and interacting with the environment. Sometimes the cost needs to be considered when optimizing the performance during RL training. An additional cost signal from the environment may help to lead agents towards the desired actions, maximizing the expected return at low cost. In most practical situations, the state or action space is usually large or continuous, which makes model-free training more difficult if too much constraint is given for low cost. Therefore, in continuous Markov Decision Processes, we present a notion to optimize the set objective function at low cost with free exploration, which avoids the performance degrading. In this paper, we propose a novel approach to replace the objective and cost with surrogate functions, as well as applying a state-dependent multiplier to provide a highly efficient tradeoff between them.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: Proceedings of the 34th International Conference on Machine Learning-vol. 70, pp. 22–31. JMLR. org, 2017
Altman, E.: Constrained Markova decision processes with total cost criteria: Lagrangian approach and dual linear program. Math. Methods Oper. Res. 48(3), 387–417 (1998)
Altman, E.: Constrained Markov Decision Processes, vol. 7. CRC Press, Boca Raton (1999)
Bohez, S., Abdolmaleki, A., Neunert, M., Buchli, J., Heess, N., Hadsell, R.: Value constrained model-free continuous control. arXiv preprint arXiv:1902.04623 (2019)
Chow, Y., Nachum, O., Duenez-Guzman, E., Ghavamzadeh. M.: A lyapunov-based approach to safe reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 8103–8112 (2018)
Chow, Y., Nachum, O., Faust, A., Ghavamzadeh, M., Duenez-Guzman, E.: Lyapunov-based safe policy optimization for continuous control. arXiv preprintarXiv:1901.10031 (2019)
Dalal, G., Dvijotham, K., Vecerik, M., Hester, T., Paduraru, C., Tassa, Y.: Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757 (2018)
Gattami. A.: Reinforcement learning for multi-objective and constrained Markov decision processes. arXiv preprint arXiv:1901.08978 (2019)
Geibel, P., Wysotzki, F.: Risk-sensitive reinforcement learning applied to control under constraints. J, Artif. Intell. Res. 24, 81–108 (2005)
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, pp. 1861–1870 (2018)
Konda, V.R., Tsitsiklis. J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008–1014 (2000)
Lee, J.D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M.I., Recht, B.: First-order methods almost always avoid saddle points. arXiv preprintarXiv:1710.07406 (2017)
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprintarXiv:1509.02971 (2015)
Liu, T., Zhou, R., Kalathil, D., Kumar, P., Tian, C.: Learning policies with zero or bounded constraint violation for constrained MDPS. Adv. Neural. Inf. Process. Syst. 34, 17183–17193 (2021)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Perkins, T.J., Barto, A.G.: Lyapunov design for safe reinforcement learning. J. Mach. Learn. Res. 3(December), 803–832 (2002)
Satija, H., Amortila, P., Pineau, J.: Constrained Markov decision processes via backward value functions. In: International Conference on Machine Learning, pp. 8502–8511. PMLR (2020)
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprintarXiv:1707.06347, 2017
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017)
Singh, S., Jaakkola, T., Littman, M.I.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000)
Singh, S., Jaakkola, T., Littman, M.I.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000)
Tessler, C., Mankowitz, D.J., Mannor, S.: Reward constrained policy optimization. arXiv preprintarXiv:1805.11074 (2018)
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
Wachi, A., Sui, A.: Safe reinforcement learning in constrained Markov decision processes. In: International Conference on Machine Learning, pp. 9797–9806. PMLR (2020)
Wiering, M., Van Otterlo, M.: Reinforcement learning. Adapt. Learn. Optim. 12, 3 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix
Proof of Eq. (5)
Proof
For \(s\in \chi \),
where
\(\square \)
Proof of Theorem 1
Proof
This proof is based on Lemma 1 of [20], which is moved originally below for convenience.
Lemma 1 of [20]: Consider a stochastic process \((\alpha _t,\varDelta _t,F_t)\), \(t\ge 0\), where \(\alpha _t,\varDelta _t,F_t:X\rightarrow \mathbb {R}\), which satisfies the equations
Let \(P_t\) be a sequence of increasing \(\sigma \)-fields such that \(\alpha _0, \varDelta _0\) are \(P_0\)-measurable and \(\alpha _t,\varDelta _t\) and \(F_{t-1}\) are \(P_t\)-measurable, \(t=1,2,\cdots \) Assume that the following hold:
-
1.
the set of possible states X is fixed.
-
2.
\(0\le \alpha _t(x)\le 1, \sum _t\alpha _t(x)=\infty , \sum _t\alpha _t^2(x)<\infty \ w.p.1\).
-
3.
\(\Vert \mathbb {E}\{F_t(\cdot )|P_t\}\Vert \le \gamma \Vert \varDelta _t\Vert +c_t\), where \(\gamma \in [0,1)\) and \(c_t\) converges to zero w.p.1.
-
4.
\(Var\{F_t(x)|P_t\}\le K(1+\Vert \varDelta _t\Vert )^2\), where K is some constant.
Then \(\varDelta _t\) converges to zero with probability one (w.p.1).
Within the scope of this paper, the MDP state space is fixed, satisfying condition 1 in Lemma 1 of [20], and Lemma condition 2 holds by proper selection of learning rate. According to [21], even the commonly used constant learning rate can make algorithms converge in distribution.
We apply Lemma 1 of [20] with \(P_t=\{Q(\cdot ,\cdot |\omega _0),C(\cdot ,\cdot |\varOmega _0),s_0,a_0,r_1,s_1,\cdots ,s_t,a_t\}\). Following the update rule for optimizing Eq. (18) and Eq. (19) and using the current policy to produce the action \(a_{t+1}=\mu (s_{t+1}|\theta )\), we have
Define the costed action-value function as
Using the definition of Eq. (22) under the setting of our proposed algorithm, we denote \(\varDelta _t=\hat{Q}(\cdot ,\cdot |\omega _t,\varOmega _t,\lambda )-\hat{Q}^{\star }(\cdot ,\cdot )\), which is the difference between the penalized action-value function and optimal penalized value function. Then we have
where the third equality is due to the substitution of Eq. (20) and Eq. (21), and
Since the reward is bounded within the scope of this paper, the action-values are also bounded, then condition 4 in Lemma 1 of [20] holds. According to the proof in Theorem 2 of [20], there is \(\Vert \mathbb {E}\left[ F_t(s_t,a_t)|P_t\right] \Vert \le \gamma \Vert \varDelta _t\Vert \), which satisfies condition 3 in Lemma 1 of [20].
Finally, it can be concluded that \(\hat{Q}(\cdot ,\cdot |\omega _t,\varOmega _t,\lambda )\) converges to \(\hat{Q}^{\star }(\cdot ,\cdot )\) with probability 1. \(\square \)
Proof of Theorem 2
Proof
We rewrite (12) and (13) in expected forms as
We assume the worst case starting from the i-th iteration, when the expected cost term is nonnegative, i.e., \(\mathbb {E}_s\left[ \delta _2(s|\theta _i')\right] \ge 0\). Then the expected weighted cost term \(\mathbb {E}_s\left[ \varLambda (s|\lambda _i)\delta _2(s|\theta _i')\right] \ge 0\) since the multiplier is nonnegative as well. According to maximization of (13), \(\varLambda (s|\lambda _i)\) is upper unbounded with respect to \(\lambda _i\) when the expected weighted cost term is nonnegative, which means \(\exists \lambda _{i+1}\) so that \(J(\theta )\le 0\), \(\forall \theta \). Here we have made a simple premise of \(\lambda _{i+1}'=\lambda _{i+1}\), which can be realized by setting the soft update parameter \(\tau _{\lambda }\) as 1.
Under the above circumstance, for the maximization of \(J(\theta _{i+1})\), there must be \(\mathbb {E}_s\left[ \varLambda (s|\lambda _{i+1}')\delta _1(s|\theta _{i+1})\right] \le 0\). With the premise of \(\theta _{i+1}'=\theta _{i+1}\), we can conclude \(\max _s{\varLambda (s|\lambda _{i+2})}\mathbb {E}_s\left[ \delta _2(s|\theta _{i+1}')\right] \le \mathbb {E}_s\left[ \varLambda (s|\lambda _{i+2})\delta _2(s|\theta _{i+1}')\right] \le 0\), i.e., \(\mathbb {E}_s\left[ \delta _2(s|\theta _{i+1}')\right] \le 0\), which means the constraint requirement is recovered. \(\square \)
Hyperparameters
Table 1 lists the common hyperparameters shared by all experiments and their respective settings.
Experiment Details
1.1 Continuous Maze
The continuous maze filled with obstacles is suitable for this evaluation. During the training process, the agent may travel through the obstacles in order to reach the goal. However, the agent can keep a safe distance from the obstacles asymptotically by receiving cost signals as warnings from the environment. The environment of continuous maze problem includes continuous state-action space which is shown in Fig. 6(a) and Fig. 6(b). At every step, the agent moves towards all directions with a fixed step size to any possible position in the maze, even traveling across the barriers represented by the gray grids. The wall is drawn as the dark solid line on the edge of the maze.
The task of the agent is to move from the starting point to the goal colored yellow without hitting any barrier or wall. During the task, the agent receives \(-1000\) reward if it hits or goes through the barrier, receives \(-50\) if it hits the wall. A reward of 100 score will be assigned to the agent if it arrives at the goal within the timeout. In other blank areas, the rewards is set as the minus distance from the agent to the goal. At every step where the received reward is nonpositive, the agent is given a cost of \(\frac{1}{1+d}\), where d represents the minimum distance of the agent from any barrier or wall.
1.2 Classical Control Environment
Among these, the cartpole environment is illustrated in Fig. 6(c). We add the absolute value of distance into the rewards to encourage the cart to move farther instead of staying at origin for stability. The setting of this experiment is as follows. At every step, when the cart moves randomly towards left or right, a reward of 1 score plus a value positively proportional to the position and velocity of the cart is given if the deviation angle of the pole is less than \(24^\circ \). Otherwise, the position of the cart and the angle of the pole will be reset for the next episode. The goal of the experiment is to keep the environment from resetting until the timeout which is set as 1000 steps.
Overall, the cost of each experiment is set proportional to the absolute value of action, which determines the torque applied by the agent. And accordingly, the influence of action is excluded from the reward shaping. The setting of rewards and costs are listed in Table 2.
Network Architecture
We construct the critic network using a fully-connected MLP with two hidden layers. The input is composed of the state and action, outputting a value representing the Q-value. The ReLU function is adopted to activate the first hidden layer. The setting of actor network is similar to that of the critic network, except that the input is the state and the output is multiplied by the action supremum after tanh nonlinearity. The network of multiplier \(\varLambda \) is constructed similar to the actor network except replacing the tanh nonlinearity by clipping \(\varLambda \) in \([0, \infty )\). The architecture of networks are plotted in Fig. 7.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, H., Han, X., Cheng, Y., Yan, C. (2023). Low-Cost Model-Free Deep Reinforcement Learning on Continuous Control. In: Arai, K. (eds) Intelligent Computing. SAI 2023. Lecture Notes in Networks and Systems, vol 739. Springer, Cham. https://doi.org/10.1007/978-3-031-37963-5_50
Download citation
DOI: https://doi.org/10.1007/978-3-031-37963-5_50
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37962-8
Online ISBN: 978-3-031-37963-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)