Abstract
This paper gives a compact, self-contained tutorial survey of reinforcement learning, a tool that is increasingly finding application in the development of intelligent dynamic systems. Research on reinforcement learning during the past decade has led to the development of a variety of useful algorithms. This paper surveys the literature and presents the algorithms in a cohesive framework.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Albus J S 1975 A new approach to manipulator control: The cerebellar model articulation controller (CMAC).Trans. ASME, J. Dyn. Syst., Meas., Contr. 97:220–227
Anderson C W 1986Learning and problem solving with multilayer connectionist systems. Ph D thesis, University of Massachusetts, Amherst, MA
Anderson C W 1987 Strategy learning with multilayer connectionist representations. Technical report, TR87-509.3, GTE Laboratories, INC., Waltham, MA
Anderson C W 1989 Learning to control an inverted pendulum using neural networks.IEEE Contr. Syst. Mag.: 31–37
Anderson C W 1993 Q-Learning with hidden-unit restarting. InAdvances in neural information processing systems 5 (eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 81–88
Bacharach J R 1991 A connectionist learning control architecture for navigation. InAdvances in neural information processing systems 3 (eds) R P Lippman, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 457–463
Bacharach J R 1992Connectionist modeling and control of finite state environments. Ph D thesis, University of Massachusetts, Amherst, MA
Barto A G 1985 Learning by statistical cooperation of self-interested neuron-like computing elements.Human Neurobiology 4: 229–256
Barto A G 1986 Game-theoritic cooperativity in networks of self interested units. InNeural networks for computing (ed.) J S Denker (New York: American Institute of Physics) pp 41–46
Barto A G 1992 Reinforcemnet learning and adaptive critic methods. InHandbook of intelligent control: Neural, fuzzy, and adaptive approaches (eds) D A White, D A Sofge (New York: Van Nostrand Reinhold) pp 469–491
Barto A G, Anandan P 1985 Pattern recognizing stocahstic learning automata.IEEE Trans. Syst., Man Cybern. 15: 360–375
Barto A G, Anandan P, Anderson C W 1985 Cooperativity in networks of pattern recognizing stochastic learning automata. InProceedings of the Fourth Yale Workshop on Applications of Adaptive Systems Theory, New Haven, CT
Barto A G, Bradtke S J, Singh S P 1992 Real-time learning and control using asymchronous dynamic programming. Technical Report COINS 91-57, University of Massachusetts, Amherst, MA
Barto A G, Jordan M I 1987 Gradient following without back-propagation in layered networks. InProceedings of the IEEE First Annual Conference on Neural Networks, (eds) M Caudill, C Butler (New York: IEEE) pp II629-II636
Barto A G, Singh S P 1991 On the computational economics of reinforcement learning. InConnectionist Models Proceedings of the 1990 Summer School. (eds) D S Touretzky, J L Elman, T J Sejnowski, G E Hinton (San Mateo, CA: Morgan Kaufmann) pp 35–44
Barto A G, Sutton R S 1981, Landmark learning: an illustration of associative search.Biol. Cybern. 42: 1–8
Barto A G, Sutton R S 1982 Simulation of anticipatory responses in classical conditioning by a neuron-like adaptive element.Behav. Brain Res. 4: 221–235
Barto A G, Sutton R S, Anderson C W 1983 Neuronlike elements that can solve difficult learning control problems.IEEE Trans. Syst., Man Cybern. 13: 835–846
Barto A G, Sutton R S, Brouwer P S 1981 Associative search network: a reinforcement learning associative memory.IEEE Trans. Syst., Man Cybern. 40: 201–211
Barto A G, Sutton R S, Watkins C J C H 1990 Learning and sequential decision making. InLearning and computational neuroscience: Foundations of adaptive networks. (eds) M Gabriel, J Moore (Cambridge, MA: MIT Press) pp 539–602
Bellman R E, Dreyfus S E 1962Applied dynamic programming. RAND Corporation
Bertsekas D P 1982 Distributed dynamic programming.IEEE Trans. Autom. Contr. 27: 610–616
Bertsekas D P 1989Dynamic programming: Deterministic and stochastic models (Englewood Cliffs, NJ: Prentice-Hall)
Bertsekas D P, Tsitsiklis J N 1989Parallel and distributed computation: Numerical methods (Englewood Cliffs, NJ: Prentice-Hall)
Boyen J 1992Modular neural networks for learning context-dependent game strategies. Masters thesis, Computer Speech and Language Processing, University of Cambridge, Cambridge, England
Bradtke S J 1993 Reinforcement learning applied to linear quadratic regulation. InAdvances in neural information processing systems 5 (eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 295–302
Bradtke S J 1994Incremental dynamic programming for on-line adaptive optimal control. CMPSCI Technical Report 94-62
Brody C 1992 Fast learning with predictive forward models. InAdvances in neural information processing systems 4 (eds) J E Moody, S J Hanson, R P Lippmann (San Mateo, CA: Morgan Kaufmann) pp 563–570
Brooks R A 1986 Achieving artificial intelligence through building robots. Technical Report, A I Memo 899, Massachusetts Institute of Technology, Aritificial Intelligence Laboratory, Cambridge, MA
Buckland K M, Lawrence P D 1994 Transition point dynamic programming. InAdvances in neural information processing systems 6 (eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp 639–646
Chapman D 1991Vision, Instruction, and Action (Cambridge, MA: MIT Press)
Chapman D, Kaelbling L P 1991 Input generalization in delayed reinforcement learning: an algorithm and performance comparisons. InProceedings of the 1991 International Joint Conference on Artificial Intelligence
Chrisman L 1992 Planning for closed-loop execution using partially observable markovian decision processes. InProceedings of AAAI
Dayan P 1991a Navigating through temporal difference. InAdvances in neural information processing systems 3 (eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 464–470
Dayan P 1991bReinforcing connectionism: Learning the statistical way. Ph D thesis, University of Edinburgh, Edinburgh
Dayan P, Hinton G E 1993 Feudal reinforcement learning. InAdvances in neural information processing systems 5 (eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 271–278
Dayan P, Sejnowski T J 1993 TD(λ) converges with probability 1. Technical Report, CNL, The Salk Institute, San Diego, CA
Dean T L, Wellman M P 1991Planning and control (San Mateo, CA: Morgan Kaufmann)
Gullapalli V 1990 A stochastic reinforcement algorithm for learning real-valued functions.Neural Networks 3: 671–692
Gullapalli V 1992a Reinforcement learning and its application to control. Technical Report, COINS, 92-10, Ph D thesis, University of Massachusetts, Amherst, MA
Gullapalli V 1992b A comparison of supervised and reinforcement learning methods on a reinforcment learning task. InProceedings of the 1991 IEEE Symposium on Intelligent Control, Arlignton, VA
Gullapalli V, Barto A G 1994 Convergence of indirect adaptive asynchronous value iteration algorithms. InAdvances in neural information processing systems 6(eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp 695–702
Gullapalli V, Franklin J A, Benbrahim H 1994 Acquiring robot skills via reinforcement learning.IEEE Contr. Syst. Mag.: 13–24
Hertz J A, Krogh A S, Palmer R G 1991Introduction to the theory of neural computation (Reading, MA: Addison-Wesley)
Jaakkola T, Jordan M I, Singh S P 1994 Convergence of stochastic iterative dynamic programming algorithms. InAdvances in Neural information processing systems 6(eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp. 703–710
Jacobs R A, Jordan M I, Nowlan S J, Hinton G E 1991 Adaptive mixtures of local experts.Neural Comput. 3: 79–87
Jordan M I, Jacobs R A 1990 Learning to control an unstable system with forward modeling. InAdvances in neural information processing systems 2 (ed.) D S Touretzky (San Mateo, CA: Morgan Kaufmann)
Jordan M I, Rumelhart D E 1990 Forward models: Supervised learning with a distal teacher. Center for Cognitive Science, Occasional Paper # 40, Massachusetts Institute of Technology, Cambridge, MA
Kaelbling L P 1990Learning in embedded systems. (Technical Report, TR-90-04) Ph D thesis, Department of Computer Science, Stanford University, Stanford, CA
Kaelbling L P 1991Learning in Embedded Systems (Cambridge, MA: MIT Press)
Klopf A H 1972 Brain function and adaptive systems — a heterostatic theory. Teachnical report AFCRL-72-0164, Air Force Cambridge Research Laboratories, Bedford, MA
Klopf A H 1982The hedonistic neuron: A theory of memory, learning and intelligence. (Washington, D C: Hemisphere)
Klopf A H 1988 A neuronal model of classical conditioning.Psychobiology 16: 85–125
Korf R E 1990 Real-time heuristic search.Artif. Intell. 42: 189–211
Kumar P R 1985 A survey of some results in stochastic adaptive control.SIAM J. Contr. Optim. 23: 329–380
Lin C S, Kim H 1991 CMAC-based adaptive critic self-learning control.IEEE Trans. Neural Networks 2: 530–533
Lin L J 1991a Programming robots using reinforcement learning and teaching. InProceedings of the Ninth National Conference on Artificial Intelligence, pages 781–786, MIT Press, Cambridge, MA
Lin L J 1991b Self-improvement based on reinforcement learning, planning and teaching. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 323–327
Lin L J 1991c Self-improving reactive agents: Case studies of reinforcement learning frameworks. InFrom Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behaviour, (Cambridge, MA: MIT Press) pp 297–305
Lin L J 1992 Self-improving reactive agents based on reinforcement learning, planning and teaching.Mach. Learning 8: 293–321
Lin L J 1993 Hierarchical learning of robot skills by reinforcement. InProceedings of the 1993 International Conference on Neural Networks pp 181–186
Linden A 1993 On discontinuous Q-functions in reinforcement learning. Available via anonymous ftp from archive.cis.ohio-state.edu in directory /pub/neuroprose
Maes P, Brooks R 1990 Learning to coordinate behaviour. InProceedings of the Eighth National Conference on Artificial Intelligence (San Mateo, CA: Morgan Kaufmann) pp 796–802
Magriel P 1976Backgammon (New York: Times Books)
Mahadevan S, Connell J 1991 Scaling reinforcement learning to robotics by exploiting the subsumption architecture. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 328–332
Mazzoni P, Andersen R A, Jordan M I 1990A R-P learning applied to a network model of cortical area 7a. InProceedings of the 1990 International Joint Conference on Neural Networks 2: 373–379
Michie D, Chambers R A 1968 BOXES: An experiment in adaptive control.Machine intelligence 2 (eds) E Dale, D Michie (New York: Oliver and Boyd) pp 137–152
Millan J D R, Torras C 1992 A reinforcement connectionist approach to robot path finding in non maze-like environments.Mach. Learning 8: 363–395
Minsky M L 1954Theory of neural-analog reinforcement systems and application to the brain-model problem. Ph D thesis, Princeton University, Princeton, NJ
Minsky M L 1961 Steps towards artificial intelligence. InProceedings of the Institute of Radio Engineers 49: 8–30 (Reprinted 1963 inComputers and thought (eds) E A Feigenbaum, J Feldman (New York: McGraw-Hill) pp 406–450
Moore A W 1990Efficient memory-based learning for robot control. Ph D thesis, University of Cambridge, Cambridge, UK
Moore A W 1991 Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-vlaued state-spaces. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 328–332
Moore A W, Atkeson C G 1993 Memory-based reinforcement learning: Efficient computation with prioritized sweeping. InAdvances in neural information processing systems 5 (eds) S J Hanson, J D Cowan, C L Giles (San Mateo, CA: Morgan Kaufmann) pp 263–270
Mozer M C, Bacharach J 1990a Discovering the structure of reactive environment by exploration. InAdvances in neural information processing 2 (ed.) D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 439–446
Mozer M C, Bacharach J 1990b Discovering the structure of reactive environment by exploration.Neural Computation 2: 447–457
Narendra K, Thathachar M A L 1989Learning automata: An introduction (Englewood Cliffs, NJ: Prentice Hall)
Peng J, Williams R J 1993 Efficient learning and planning within the Dyna framework. InProceedings of the 1993 International Joint Conference on Neural Networks, pp 168–174
Platt J C 1991 Learning by combining memorization and gradient descent.Advances in neural information processing systems 3 (eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 714–720
Rosen B E, Goodwin J M, Vidal J J 1991 Adaptive range coding.Advances in neural information processing systems 3 (eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 486–494
Ross S 1983Introduction to stochastic dynamic programming (New York: Academic Press)
Rummery G A, Niranjan M 1994 On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, University of Cambridge, Cambridge, England
Samuel A L 1959 Some studies in machine learning using the game of checkers.IBM J. Res. Dev.: 210–229 (Reprinted 1963 inComputers and thought (eds) E A Feigenbaum, J Feldman (New York: McGraw-Hill)
Samuel A L 1967 Some studies in machine learning using the game of checkers, II Recent progress.IBM J. Res. Dev.: 601–617
Selfridge O, Sutton R S, Barto A G 1985 Training and tracking in robotics. InProceedings of the Ninth International Joint Conference of Artificial Intelligence (ed.) A Joshi (San Mateo, CA: Morgan Kaufmann) pp 670–672
Shepansky J F, Macy S A 1987 Teaching artificial neural systems to drive: Manual training techniques for autonomous systems. InProceedings of the First Annual International Conference on Neural Networks, San Diego, CA
Singh S P 1991 Transfer of learning across composition of sequential tasks. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 348–352
Singh S P 1992a Reinforcement learning with a hierarchy of abstract models. InProceedings of the Tenth National Conference on Artificial Intelligence, San Jose, CA
Singh S P 1992b On the efficient learning of multiple sequential tasks. InAdvances in neural information processing systems 4 (eds) J E Moody, S J Hanson, R P Lippmann (San Mateo, CA: Morgan Kaufmann) pp 251–258
Singh S P 1992c Scaling Reinforcement learning algorithms by learning variable temporal resolution models. InProceedings of the Ninth International Machine Learning Conference
Singh S P 1992d Transfer of learning by composing solutions of elemental sequential tasks.Mach. Learning 8: 323–339
Singh S P, Barto A G, Grupen R, Connelly C 1994 Robust reinforcement learning in motion planning. InAdvances in neural information processing systems 6 (eds) J D Cowan, G Tesauro, J Alspector (San Fransisco, CA: Morgan Kaufmann) pp 655–662
Singh S P, Yee R C 1993 An upper bound on the loss from approximate optimal-value functions. Technical Report, University of Massachusetts, Amherst, MA
Sutton R S 1984Temporal credit assignment in reinforcement learning. Ph D thesis, University of Massachusetts, Amherst, MA
Sutton R S 1988 Learning to predict by the method of temporal differences.Mach. Learning 3: 9–44
Sutton R S 1990 Integrated architecture for learning, planning, and reacting based on approximating dyanmic programming. InProceedings of the Seventh International Conference on Machine Learning (San Mateo, CA: Morgan Kaufmann) pp 216–224
Sutton R S 1991a Planning by incremental dynamic programming. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 353–357
Sutton R S 1991b Integrated modeling and control based on reinforcement learning and dynamic programming. InAdvances in neural information processing systems 3 (eds) R P Lippmann, J E Moody, D S Touretzky (San Mateo, CA: Morgan Kaufmann) pp 471–478
Sutton R S, Barto A G 1981 Toward a modern theory of adaptive networks: Expectation and prediction.Psychol. Rev. 88: 135–170
Sutton R S, Barto A G 1987 A temporal-difference model of classical conditioning. InProceedings of the Ninth Annual Conference of the Cognitive Science Society, Erlbaum, Hillsdale, NJ
Sutton R S, Barto A G 1990 Time-derivative models of Pavlovian reinforcement.Learning and Computational Neuroscience: Foundations of Adaptive Networks (eds) M Gabriel, J Moore (Cambridge, MA: MIT Press) pp 497–537
Sutton R S, Singh S P 1994 On step-size and bias in TD-learning. InProceedings of the Eighth Yale Workshop on Adaptive and Learning Systems Yale University, pp 91–96
Sutton R S, Barto A G, Williams R J 1991 Reinforcement learning is direct adaptive optimal control. InProceedings of th American Control Conference Boston, MA, pp 2143–2146
Tan M 1991 Larning a cost-sensitive internal representation for reinforcement learning. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 358–362
Tesauro G J 1992 Practical issues in temporal difference learning.Mach. Learning 8: 257–278
Tham C K, Prager R W 1994 A modular Q-learning architecture for manipulator task decomposition. InMachine Learning: Proceedings of the Eleventh Intenational Conference (eds) W W Cohen, H Hirsh (Princeton, NJ: Morgan Kaufmann) (Available via gopher from Dept. of Eng., University of Cambridge, Cambridge, England)
Thrun S B 1986 Efficient exploration in reinforcement learning. Technical report CMU-CS-92-102, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
Thrun S B 1993 Exploration and model building in mobile robot domains. InProceedings of the 1993 International Conference on Neural Networks
Thrun S B, Muller K 1992 Active exploration in dynamic environments. InAdvances in neural information processing systems 4 (eds) J E Moody, S J Hanson, R P Lippmann (San Mateo, CA: Morgan Kaufmann)
Thrun S B, Schwartz A 1993 Issues in using function approximation for reinforcement learning. InProceedings of the Fourth Connectionist Models Summer School (Hillsdale, NJ: Lawrence Erlbaum)
Tsitsiklis J N 1993 Asynchronous stochastic approximation and Q-learning. Technical Report, LIDS-P-2172, Laboratory for Information and Decision Systems, MIT, Cambridge, MA
Utgoff P E, Clouse J A 1991 Two kinds of training information for evaluation function learning. InProceedings of the Ninth Annual Conference on Artificial Intelligence (San Mateo, CA: Morgan Kaufmann) pp 596–600
Watkins 1989Learning from delayed rewards. Ph D thesis, Cambridge University, Cambridge, England
Watkins C J C H, Dayan P 1992 Technical note: Q-learning.Mach. Learning 8: 279–292
Werbos P J 1987 Building and understanding adaptive systems: a statistical/numerical approach to factory automation and brain research.IEEE Trans. Syst, Man Cybern.
Werbos P J 1988 Generalization of back propagation with application to recurrent gas market model.Neural Networks 1: 339–356
Werbos P J 1989 Neural network for control and system identification. InProceedings of the 28th Conference on Decision and Control Tampa, FL, pp 260–265
Werbos P J 1990a Consistency of HDP applied to simple reinforcement learning problems.Neural Networks 3: 179–189
Werbos P J 1990b A menu of designs for reinforcement learning over time, InNeural networks for control (eds) W T Miller, R S Sutton, P J Werbos (Cambridge, MA: MIT Press) pp 67–95
Werbos P J 1992 Approximate dynamic programming for real-time control and neural modeling. InHandbook of intelligent control: Neural, fuzzy, and adaptive approaches (eds) D A White, D A Sofge (New York: Van Nostrand Reinhold) pp 493–525
Whitehead S D 1991a A complexity analysis of cooperative mechanisms in reinforcement learning. InProceedings of the Ninth Conference on Artificial Intelligence, (Cambridge, MA: MIT Press) pp 607–613
Whitehead S D 1991b Complexity and cooperation in Q-learning. InMachine Learning: Proceedings of the Eighth International Workshop (eds) L A Birnbaum, G C Collins (San Mateo, CA: Morgan Kaufmann) pp 363–367
Whitehead S D, Ballard D H 1990 Active perception and reinforcement learning.Neural Comput. 2: 409–419
Williams R J 1986 Reinforcement learning in connectionist networks: a mathematical analysis. Technical report ICS 8605, Institute for Cognitive Science, University of California at San Diego, La Jolla, CA
Williams R J 1987 Reinforcement-learning connectionist systems. Technical report NU-CCS-87-3, College of Computer Science, Northeastern University, Boston, MA
Williams R J, Baird L C III 1990 A mathematical analysis of actor-critic architectures for learning optimal controls through incremental dynamic programming. InProceedings of the Sixth Yale Workshop on Adaptive and Learning Systems New Haven, CT, pp 96–101
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Sathiya Keerthi, S., Ravindran, B. A tutorial survey of reinforcement learning. Sadhana 19, 851–889 (1994). https://doi.org/10.1007/BF02743935
Issue Date:
DOI: https://doi.org/10.1007/BF02743935