Skip to main content
Log in

High-accuracy model-based reinforcement learning, a survey

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript


Deep reinforcement learning has shown remarkable success in the past few years. Highly complex sequential decision making problems from game playing and robotics have been solved with deep model-free methods. Unfortunately, the sample complexity of model-free methods is often high. Model-based reinforcement learning, in contrast, can reduce the number of environment samples, by learning an explicit internal model of the environment dynamics. However, achieving good model accuracy in high dimensional problems is challenging. In recent years, a diverse landscape of model-based methods has been introduced to improve model accuracy, using methods such as probabilistic inference, model-predictive control, latent models, and end-to-end learning and planning. Some of these methods succeed in achieving high accuracy at low sample complexity in typical benchmark applications. In this paper, we survey these methods; we explain how they work and what their strengths and weaknesses are. We conclude with a research agenda for future work to make the methods more robust and applicable to a wider range of applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.


  1. A dataset is static. In reinforcement learning the choice of actions may depend on the rewards that are returned during the learning process, giving rise to a dynamic, potentially unstable, learning process.


  • Abbeel P, Coates A, Quigley M, Ng AY (2007) An application of reinforcement learning to aerobatic helicopter flight. In: Advances in neural information processing systems, pp 1–8

  • Alpaydin E (2020) Introduction to machine learning, 3rd edn. MIT Press, Cambridge

    MATH  Google Scholar 

  • Anthony T, Tian Z, Barber D (2017) Thinking fast and slow with deep learning and tree search. In: Advances in neural information processing systems, pp 5360–5370

  • Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47:253–279

    Article  Google Scholar 

  • Bellman R (2013) Dynamic programming. Courier Corporation, 1957

  • Bertsekas DP, Tsitsiklis J (1996) Neuro-dynamic programming. MIT Press, Cambridge

    MATH  Google Scholar 

  • Bishop CM (2006) Pattern recognition and machine learning. Information science and statistics. Springer, Heidelberg

    MATH  Google Scholar 

  • Botev ZI, Kroese DP, Rubinstein RY, L’Ecuyer P (2013) The cross-entropy method for optimization. In: Handbook of statistics. Elsevier, vol 31, pp 35–59

  • Brazdil P, van Rijn J, Soares C, Joaquin V (2022) Metalearning: applications to automated machine learning and data mining. Springer, Berlin

    Book  Google Scholar 

  • Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) OpenAI Gym. arXiv preprint arXiv:1606.01540

  • Brown N, Sandholm T (2019) Superhuman AI for multiplayer poker. Science 365(6456):885–890

    Article  MathSciNet  MATH  Google Scholar 

  • Browne CB, Powley E, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of Monte Carlo Tree Search methods. IEEE Trans Comput Intell AI Games 4(1):1–43

    Article  Google Scholar 

  • Buesing L, Weber T, Racaniere S, Eslami SMA, Rezende D, Reichert DP, Viola F, Besse F, Gregor K, Hassabis D, Wierstra D (2018) Learning and querying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006

  • Çalışır S, Pehlivanoğlu MK (2019) Model-free reinforcement learning algorithms: a survey. In: 2019 27th signal processing and communications applications conference (SIU), pp 1–4

  • Campbell M, Hoane AJ Jr, Hsu F-H (2002) Deep blue. Artif Intell 134(1–2):57–83

    Article  MATH  Google Scholar 

  • Chao Y (2013) Share and play new sokoban levels.

  • Chiappa S, Racaniere S, Wierstra D, Mohamed S (2017) Recurrent environment simulators. In: International conference on learning representations

  • Chua K, Calandra R, McAllister R, Levine S (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Advances in neural information processing systems, pp 4754–4765

  • Clavera I, Rothfuss J, Schulman J, Fujita Y, Asfour T, Abbeel P (2018) Model-based reinforcement learning via meta-policy optimization. In: 2nd Annual conference on robot learning, CoRL 2018, Zürich, Switzerland, pp 617–629

  • Coulom R (2006) Efficient selectivity and backup operators in Monte-Carlo Tree Search. In: International conference on computers and games. Springer, pp 72–83

  • Deisenroth MP, Neumann G, Peters J (2013) A survey on policy search for robotics. In: Foundations and trends in robotics 2. Now Publishers, pp 1–142

  • Deisenroth M, Rasmussen CE (2011) PILCO: a model-based and data-efficient approach to policy search. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 465–472

  • Dietterich TG (1998) The MAXQ method for hierarchical reinforcement learning. Int Conf Mach Learn 98:118–126

    Google Scholar 

  • Doerr A, Daniel C, Schiegg M, Nguyen-Tuong D, Schaal S, Toussaint M, Trimpe S (2018) Probabilistic recurrent state-space models. arXiv preprint arXiv:1801.10395

  • Duan Y, Schulman J, Xi C, Bartlett PL, Sutskever I, Abbeel P (2016) RL\({}^{2}\): fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779

  • Ebert F, Finn C, Dasari S, Xie A, Lee A, Levine S (2018) Visual foresight: model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568

  • Farquhar G, Rocktäschel T, Igl M, Whiteson SA (2018) TreeQN and ATreeC: differentiable tree planning for deep reinforcement learning. In International conference on learning representations

  • Feinberg V, Wan A, Stoica I, Jordan MI, Gonzalez JE, Levine S (2018) Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101

  • Finn C, Abbeel P, Levine S (2017) Model-Agnostic Meta-Learning for fast adaptation of deep networks. In: International conference on machine learning. PMLR, pp 1126–1135

  • Finn C, Levine S (2017) Deep visual foresight for planning robot motion. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE, pp 2786–2793

  • Flack JC (2017) Coarse-graining as a downward causation mechanism. Philos Trans R Soc A: Math Phys Eng Sci 375(2109):20160338

    Article  Google Scholar 

  • Garcia CE, Prett DM, Morari M (1989) Model predictive control: theory and practice—a survey. Automatica 25(3):335–348

    Article  MATH  Google Scholar 

  • Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge

    MATH  Google Scholar 

  • Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680

  • Grimm C, Barreto A, Singh S, Silver D (2020) The value equivalence principle for model-based reinforcement learning. In: Advances in neural information processing systems

  • Guez A, Mirza M, Gregor K, Kabra R, Racanière S, Weber T, Raposo D, Santoro A, Orseau L, Eccles T, Wayne G, Silver D, Lillicrap TP (2019) An investigation of model-free planning. In: International conference on machine learning, pp 2464–2473

  • Guez A, Weber T, Antonoglou I, Simonyan K, Vinyals O, Wierstra D, Munos R, Silver D (2018) Learning to search with MCTSnets. arXiv preprint arXiv:1802.04697

  • Gu S, Lillicrap T, Sutskever I, Levine S (2016) Continuous deep Q-learning with model-based acceleration. In: International conference on machine learning, pp 2829–2838

  • Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning. PMLR, pp 1861–1870

  • Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, Zhu H, Gupta A, Abbeel P, Levine S (2018) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905

  • Hafner D, Lillicrap T, Ba J, Norouzi M (2020) Dream to control: learning behaviors by latent imagination. In: International conference on learning representations

  • Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H, Davidson J (2019) Learning latent dynamics for planning from pixels. In: International conference on machine learning, pp 2555–2565

  • Hafner D, Lillicrap T, Norouzi M, Ba J (2021) Mastering atari with discrete world models. In: International conference on learning representations

  • Hamrick JB, Ballard AJ, Pascanu R, Vinyals O, Heess N, Battaglia PW (2017) Metacontrol for adaptive imagination-based optimization. arXiv preprint arXiv:1705.02670

  • Hamrick JB (2019) Analogues of mental simulation and imagination in deep learning. Curr Opin Behav Sci 29:8–16

    Article  Google Scholar 

  • Ha D, Schmidhuber J (2018) Recurrent world models facilitate policy evolution. In: Advances in neural information processing systems, pp 2450–2462

  • Ha D, Schmidhuber J (2018) World models. arXiv preprint arXiv:1803.10122

  • Heess N, Wayne G, Silver D, Lillicrap T, Erez T, Tassa Y (2015) Learning continuous control policies by stochastic value gradients. In: Advances in neural information processing systems, pp 2944–2952

  • Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar M, Silver D (2018) Rainbow: combining improvements in deep reinforcement learning. In: AAAI, pp 3215–3222

  • Heuillet A, Couthouis F, Díaz-Rodríguez N (2021) Explainability in deep reinforcement learning. Knowl-Based Syst 214:106685

    Article  Google Scholar 

  • Ho TK (1995) Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition. IEEE, vol 1, pp 278–282

  • Hospedales T, Antoniou A, Micaelli P, Storkey A (2020) Meta-learning in neural networks: a survey. arXiv preprint arXiv:2004.05439

  • Hui J (2018) Model-based reinforcement learning Medium post

  • Huisman M, van Rijn JN, Plaat A (2021) A survey of deep meta-learning. Artif Intell Rev 54:4483–4541

    Article  Google Scholar 

  • Ilin R, Kozma R, Werbos PJ (2007) Efficient learning in cellular simultaneous recurrent neural networks—the case of maze navigation problem. In: 2007 IEEE international symposium on approximate dynamic programming and reinforcement learning, pp 324–329

  • Itzkovitz S, Levitt R, Kashtan N, Milo R, Itzkovitz M, Alon U (2005) Coarse-graining and self-dissimilarity of complex networks. Phys Rev E 71(1):016127

    Article  Google Scholar 

  • Janner M, Fu J, Zhang M, Levine S (2019) When to trust your model: model-based policy optimization. In: Advances in neural information processing systems, pp 12498–12509

  • Justesen N, Bontrager P, Togelius J, Risi S (2019) Deep learning for video game playing. IEEE Trans Games 12(1):1–20

    Article  Google Scholar 

  • Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285

    Article  Google Scholar 

  • Kahneman D (2011) Thinking, fast and slow. Farrar, Straus and Giroux

  • Kaiser L, Babaeizadeh M, Milos P, Osinski B, Campbell RH, Czechowski K, Erhan D, Finn C, Kozakowski P, Levine S, Sepassi R, Tucker G, Michalewski H (2019) Model-based reinforcement learning for Atari. arXiv:1903.00374

  • Kalweit G, Boedecker J (2017) Uncertainty-driven imagination for continuous deep reinforcement learning. In: Conference on robot learning, pp 195–206

  • Kamyar R, Taheri E (2014) Aircraft optimal terrain/threat-based trajectory planning and control. J Guid Control Dyn 37(2):466–483

    Article  Google Scholar 

  • Karl M, Soelch M, Bayer J, Van der Smagt P (2016) Deep variational Bayes filters: unsupervised learning of state space models from raw data. arXiv preprint arXiv:1605.06432

  • Kelley HJ (1960) Gradient theory of optimal flight paths. Am Rocket Soc J 30(10):947–954

    MATH  Google Scholar 

  • Kempka M, Wydmuch M, Runc G, Toczek J, Jaśkowski W (2016) VizDoom: a doom-based AI research platform for visual reinforcement learning. In: 2016 IEEE conference on computational intelligence and games, pp 1–8

  • Kingma DP, Welling M (2014) Auto-encoding variational Bayes. In: International conference on learning representations

  • Kingma DP, Welling M (2019) An introduction to variational autoencoders. Found Trends Mach Learn 12(4):307–392

    Article  MATH  Google Scholar 

  • Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: a survey. Int J Robot Res 32(11):1238–1274

    Article  Google Scholar 

  • Konda VR, Tsitsiklis JN (2000) Actor–critic algorithms. In: Advances in neural information processing systems, pp 1008–1014

  • Kwon WH, Bruckstein AM, Kailath T (1983) Stabilizing state-feedback design via the moving horizon method. Int J Control 37(3):631–643

    Article  MathSciNet  MATH  Google Scholar 

  • Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in neural information processing systems, pp 6402–6413

  • LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436

    Article  Google Scholar 

  • Levine S, Abbeel P (2014) Learning neural network policies with guided policy search under unknown dynamics. In: Advances in neural information processing systems, pp 1071–1079

  • Levine S, Koltun V (2013) Guided policy search. In: International conference on machine learning, pp 1–9

  • Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In: Machine learning proceedings 1994. Elsevier, pp 157–163

  • Mandhane A, Zhernov A, Rauh M, Gu C, Wang M, Xue F, Shang W, Pang D, Claus R, Chiang C-H et al. (2022) Muzero with self-competition for rate control in vp9 video compression. arXiv preprint arXiv:2202.06626

  • Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller MA, Fidjeland A, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  Google Scholar 

  • Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, pp 1928–1937

  • Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602

  • Moerland TM, Broekens J, Jonker CM (2020) A framework for reinforcement learning and planning. arXiv preprint arXiv:2006.15009

  • Moerland TM, Broekens J, Jonker CM (2020) Model-based reinforcement learning: a survey. arXiv preprint arXiv:2006.16712

  • Moerland TM, Broekens J, Plaat A, Jonker CM (2018) A0C: alpha zero in continuous action space. arXiv preprint arXiv:1805.09613

  • Nagabandi A, Kahn G, Fearing RS, Levine S (2018) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: 2018 IEEE international conference on robotics and automation (ICRA), pp 7559–7566

  • Nardelli N, Synnaeve G, Lin Z, Kohli P, Torr PHS, Usunier N (2018) Value propagation networks. In: 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019

  • Ng AY, Harada D, Russell S (1999) Policy invariance under reward transformations: theory and application to reward shaping. Int Conf Mach Learn 99:278–287

    Google Scholar 

  • Oh J, Guo X, Lee H, Lewis RL, Singh S (2015) Action-conditional video prediction using deep networks in Atari games. In: Advances in neural information processing systems, pp 2863–2871

  • Oh J, Singh S, Lee H (2017) Value prediction network. In: Advances in neural information processing systems, pp 6118–6128

  • Ontanón S, Synnaeve G, Uriarte A, Richoux F, Churchill D, Preuss M (2013) A survey of real-time strategy game AI research and competition in StarCraft. IEEE Trans Comput Intell AI Games 5(4):293–311

    Article  Google Scholar 

  • Pascanu R, Li Y, Vinyals O, Heess N, Buesing L, Racanière S, Reichert D, Weber T, Wierstra D, Battaglia P (2017) Learning model-based planning from scratch. arXiv preprint arXiv:1707.06170

  • Plaat A (2020) Learning to play: reinforcement learning and games. Springer, Heidelberg.

  • Plaat A (2022) Deep reinforcement learning. Springer, Singapore.

  • Polydoros AS, Nalpantidis L (2017) Survey of model-based reinforcement learning: applications on robotics. J Intell Robot Syst 86(2):153–173

    Article  Google Scholar 

  • Racanière S, Weber T, Reichert DP, Buesing L, Guez A, Rezende DJ, Badia AP, Vinyals O, Heess N, Li Y, Pascanu R, Battaglia PW, Hassabis D, Silver D, Wierstra D (2017) Imagination-augmented agents for deep reinforcement learning. In: Advances in neural information processing systems, pp 5690–5701

  • Richards AG (2005) Robust constrained model predictive control. PhD thesis, Massachusetts Institute of Technology

  • Risi S, Preuss M (2020) From Chess and Atari to StarCraft and beyond: how game AI is driving the world of AI. KI-Künstliche Intelligenz, pp 1–11

  • Rosin CD (2011) Multi-armed bandits with episode context. Ann Math Artif Intell 61(3):203–230

    Article  MathSciNet  MATH  Google Scholar 

  • Sauter A, Acar E, François-Lavet V (2021) A meta-reinforcement learning algorithm for causal discovery

  • Schaal S (1996) Learning from demonstration. Advances in neural information processing systems, vol 9

  • Schleich D, Klamt T, Behnke S (2019) Value iteration networks on multiple levels of abstraction. Science and Systems XV, University of Freiburg, Freiburg im Breisgau, Germany, In Robotics

  • Schmidhuber J (1990) An on-line algorithm for dynamic reinforcement learning and planning in reactive environments. In: 1990 IJCNN international joint conference on neural networks. IEEE, pp 253–258

  • Schmidhuber J (1990) Making the world differentiable: On using self-supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments. Inst. für Informatik, Technical report

  • Schneider J (1996) Exploiting model uncertainty estimates for safe dynamic control learning. Advances in neural information processing systems, vol 9

  • Schölkopf B, Locatello F, Bauer S, Ke NR, Kalchbrenner N, Goyal A, Bengio Y (2021) Toward causal representation learning. Proc IEEE 109(5):612–634

    Article  Google Scholar 

  • Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T, Lillicrap T, Silver D (2020) Mastering Atari, go, chess and shogi by planning with a learned model. Nature 588(7839):604–609

    Article  Google Scholar 

  • Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

  • Sekar R, Rybkin O, Daniilidis K, Abbeel P, Hafner D, Pathak D (2020) Planning to explore via self-supervised world models. In: International conference on machine learning

  • Silver D, Sutton RS, Müller M (2012) Temporal-difference search in computer Go. Mach Learn 87(2):183–219

    Article  MathSciNet  MATH  Google Scholar 

  • Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484

    Article  Google Scholar 

  • Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen Y, Lillicrap T, Hui F, Sifre L, van den Driessche G, Graepel T, Hassabis D (2017) Mastering the game of Go without human knowledge. Nature 550(7676):354

    Article  Google Scholar 

  • Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T, Lillicrap T, Simonyan K, Hassabis D (2018) A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362(6419):1140–1144

    Article  MathSciNet  MATH  Google Scholar 

  • Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. In: International conference on machine learning, pp 387–395

  • Silver D, van Hasselt H, Hessel M, Schaul T, Guez A, Harley T, Dulac-Arnold G, Reichert D, Rabinowitz N, Barreto A, Degris T (2017) The predictron: end-to-end learning and planning. In: Proceedings of the 34th international conference on machine learning, pp 3191–3199

  • Srinivas A, Jabri A, Abbeel P, Levine S, Finn C (2018) Universal planning networks. In: International conference on machine learning, pp 4739–4748

  • Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Machine learning proceedings 1990. Elsevier, pp 216–224

  • Sutton RS (1991) Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bull 2(4):160–163

    Article  Google Scholar 

  • Sutton RS, Barto AG (2018) Reinforcement learning, an introduction, 2nd edn. MIT Press, Cambridge

    MATH  Google Scholar 

  • Talvitie E (2015) Agnostic system identification for Monte Carlo planning. In: Twenty-ninth AAAI conference on artificial intelligence

  • Tamar A, Wu Y, Thomas G, Levine S, Abbeel P (2016) Value iteration networks. In: Advances in neural information processing systems, pp 2154–2162

  • Tassa Y, Doron Y, Muldal A, Erez T, Li Y, de Las Casas D, Budden D, Abdolmaleki A, Merel J, Lefrancq A, Lillicrap T, Riedmiller M (2018) Deepmind control suite. arXiv preprint arXiv:1801.00690

  • Tassa Y, Erez T, Todorov E (2012) Synthesis and stabilization of complex behaviors through online trajectory optimization. In: 2012 IEEE/RSJ international conference on intelligent robots and systems, pp 4906–4913

  • Todorov E, Erez T, Tassa Y (2012) MuJoCo: a physics engine for model-based control. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 5026–5033

  • Torrado RR, Bontrager P, Togelius J, Liu J, Perez-Liebana D (2018) Deep reinforcement learning for general video game AI. In: 2018 IEEE conference on computational intelligence and games (CIG). IEEE, pp 1–8

  • Van Der Maaten L, Postma E, Van den Herik J et al (2009) Dimensionality reduction: a comparative. J Mach Learn Res 10(66–71):13

    Google Scholar 

  • Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, Powell R, Ewalds T, Georgiev P, Oh J, Horgan D, Kroiss M, Danihelka I, Huang A, Sifre L, Cai T, Agapiou JP, Jaderberg M, Vezhnevets AS, Leblond R, Pohlen T, Dalibard V, Budden D, Sulsky Y, Molloy J, Paine T, Gülçehre Ç, Wang Z, Pfaff T, Wu Y, Ring R, Yogatama D, Wünsch D, McKinney K, Smith O, Schaul T, Lillicrap TP, Kavukcuoglu K, Hassabis D, Apps C, Silver D (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782):350–354

  • Vinyals O, Ewalds T, Bartunov S, Georgiev P, Vezhnevets AS, Yeo M, Makhzani A, Küttler H, Agapiou JP, Schrittwieser J, Quan J, Gaffney S, Petersen S, Simonyan K, Schaul T, van Hasselt H, Silver D, Lillicrap TP, Calderone K, Keet P, Brunasso A, Lawrence D, Ekermo A, Repp J, Tsing R (2017) Starcraft II: a new challenge for reinforcement learning. arXiv:1708.04782

  • Wang T, Bao X, Clavera I, Hoang J, Wen Y, Langlois E, Zhang S, Zhang G, Abbeel P, Ba J (2019) Benchmarking model-based reinforcement learning. arXiv:1907.02057

  • Watkins Christopher JCH (1989) Learning from delayed rewards. PhD thesis, King’s College, Cambridge

  • Wong A, Bäck T, Kononova AV, Plaat A (2022) Deep multiagent reinforcement learning: challenges and directions. Artificial Intelligence Review

  • Xingjian SHI, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-C (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–810

  • Ye W, Liu S, Kurutach T, Abbeel P, Gao Y (2021) Mastering atari games with limited data. Advances in neural information processing systems, vol 34

  • Zambaldi V, Raposo D, Santoro A, Bapst V, Li Y, Babuschkin I, Tuyls K, Reichert D, Lillicrap T, Lockhart E et al (2018) Relational deep reinforcement learning. arXiv preprint arXiv:1806.01830

Download references


We thank the members of the Leiden Reinforcement Learning Group, and especially Thomas Moerland, Mike Huisman, Matthias Müller-Brockhausen, Zhao Yang, Erman Acar, and Andreas Sauter for many discussions and insights. We thank the anonymous reviewers for their valuable insights, which improved the paper greatly.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Aske Plaat.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Plaat, A., Kosters, W. & Preuss, M. High-accuracy model-based reinforcement learning, a survey. Artif Intell Rev 56, 9541–9573 (2023).

Download citation

  • Published:

  • Issue Date:

  • DOI:

