Abstract
Reinforcement learning (RL) concerns the problem of a learning agent interacting with its environment to achieve a goal. Instead of being given examples of desired behavior, the learning agent must discover by trial and error how to behave in order to get the most reward. The environment is a Markov decision process (MDP) with state set, \( \mathcal{S} \), and action set, \( \mathcal{A} \). The agent and the environment interact in a sequence of discrete steps, t = 0, 1, 2,... The state and action at one time step, \( s_t \in \mathcal{S} \) and \( a_t \in \mathcal{A} \), determine the probability distribution for the state at the next time step, \( s_{t + 1} \in \mathcal{S} \) and, jointly, the distribution for the next reward, r t+1 ∈ ℜ. The agent’s objective is to chose each aint to maximize the subsequent return:
where the discount rate, 0 ≤ γ ≤ 1, determines the relative weighting of immediate and delayed rewards. In some environments, the interaction consists of a sequence of episodes, each starting in a given state and ending upon arrival in a terminal state, terminating the series above. In other cases the interaction is continual, without interruption, and the sum may have an infinite number of terms (in which case we usually assume γ < 1). Infinite horizon cases with γ = 1 are also possible though less common (e.g., see Mahadevan, 1996).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baird, L.C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 30–37. Morgan Kaufmann, San Francisco.
Bertsekas, D.P., and Tsitsiklis, J.N. (1996). Neuro-Dynamic Programming. Athena Scientific, Belmont, MA.
Crites, R.H., and Barto, A.G.(1996). Improving elevator performance using reinforcement learning. In Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pp. 1017–1023. MIT Press, Cambridge, MA.
Kearns, M., Mansour, Y., Ng, A.Y. (in prep.). Sparse sampling methods for planning and learning in large and partially observable Markov decision processes.
Loch J., and Singh S. (1998). Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco.
Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22:159–196.
Moore, A.W., and Atkeson, C.G.(1993). Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13:103–130.
Singh, S.P.(1993). Learning to Solve Markovian Decision Processes. Ph.D. thesis, University of Massachusetts, Amherst. Appeared as CMPSCI Technical Report 93-77.
Singh, S.P., and Bertsekas, D. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In Advances in Neural Information Processing Systems: Proceedings of the 1996 Conference, pp. 974–980. MIT Press, Cambridge, MA.
Singh S., and Dayan P. (1998). Analytical mean squared error curves for temporal difference learning. Machine Learning.
Singh, S.P., and Sutton, R.S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123–158.
Sutton, R.S. (1984). Temporal Credit Assignment in Reinforcement Learning. Ph.D. thesis, University of Massachusetts, Amherst.
Sutton, R.S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pp. 1038–1044. MIT Press, Cambridge, MA.
Sutton, R.S., and Barto, A.G. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.
Tesauro, G.J. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM, 38:58–68.
Tsitsiklis, J.N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674–690.
Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. Ph.D. thesis, Cambridge University.
Watkins, C.J.C.H., and Dayan, P. (1992). Q-learning. Machine Learning, 8:279–292.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sutton, R.S. (1999). Open Theoretical Questions in Reinforcement Learning. In: Fischer, P., Simon, H.U. (eds) Computational Learning Theory. EuroCOLT 1999. Lecture Notes in Computer Science(), vol 1572. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49097-3_2
Download citation
DOI: https://doi.org/10.1007/3-540-49097-3_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65701-9
Online ISBN: 978-3-540-49097-5
eBook Packages: Springer Book Archive