Skip to main content

Trading Utility and Uncertainty: Applying the Value of Information to Resolve the Exploration–Exploitation Dilemma in Reinforcement Learning

  • Chapter
  • First Online:
Handbook of Reinforcement Learning and Control

Part of the book series: Studies in Systems, Decision and Control ((SSDC,volume 325))

  • 7925 Accesses

Abstract

A fundamental problem in reinforcement learning is the exploration–exploitation dilemma: a search problem that entails sufficiently investigating the possible action choices and exploiting those that work well for certain contexts. Few exploration mechanisms, however, provide expected performance guarantees for a given search amount. Here, we show that this dilemma can be addressed and the expected agent performance quantified by optimizing Stratonovich’s value of information. The value of information is an information-theoretic criterion that specifies the greatest increase in rewards, from the worst case, subject to a certain uncertainty amount. In the context of reinforcement learning, uncertainty is quantified by a constrained mutual dependence between random variables. When the mutual dependence between the random variables go to zero, agents tend to exploit its acquired knowledge about the environment; little to no improvements in policy performance are obtained in this case. As the mutual dependence increases, a great amount of exploration is permitted and the policy can converge to the global-best action-selection strategy. Optimizing the value of information yields action-selection update strategies that, in the limit, is theoretically guaranteed to uncover the optimal policy for a given mutual dependence amount. We show that, in a finite number of episodes, the value of information yields policies that outperform conventional exploration mechanisms for both single-state and multi-state, multi-action environment abstractions based on Markov decision processes.

José C. Príncipe—is the director of the Computational NeuroEngineering Laboratory (CNEL) at the University of Florida.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Stratonovich, R.L.: On value of information. Izvestiya of USSR Acad. Sci. Tech. Cybern. 5(1), 3–12 (1965)

    Google Scholar 

  2. Stratonovich, R.L., Grishanin, B.A.: Value of information when an estimated random variable is hidden. Izvestiya of USSR Acad. Sci. Tech. Cybern. 6(1), 3–15 (1966)

    Google Scholar 

  3. von Neumann, J., Morgenstern, O.: Theory of Games and Economic Behavior. Princeton University Press, Princeton (1953)

    MATH  Google Scholar 

  4. Belavkin, R.V.: Asymmetry of risk and value of information. In: Vogiatzis, C., Walteros, J., Pardalos, P. (eds.) Dynamics of Information Systems, pp. 1–20. Springer, New York (2014)

    MATH  Google Scholar 

  5. Sledge, I.J., Príncipe, J.C.: Balancing exploration and exploitation in reinforcement learning using a value of information criterion. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, LA, USA, pp. 1–5 (2017). Accessed by 5–9 March 2017, https://doi.org/10.1109/ICASSP.2017.7952670

  6. Sledge, I.J., Príncipe, J.C.: Quantifying the trade-offs between policy exploration and exploitation in Markov decision processes. Entropy 20(3), 155(1–34) (2018). https://dx.doi.org/10.3390/e20030155

  7. Sledge, I.J., Príncipe, J.C.: Analysis of agent expertise in Ms. Pac-Man using value-of-information-based policies. IEEE Trans. Comput. Intell. Artif. Intell. Games (2018). (accepted, in press). https://dx.doi.org/10.1109/TG.2018.2808201

  8. Sledge, I.J., Emigh, Príncipe, J.C.: Guided policy exploration for Markov decision processes using an uncertainty-based value-of-information criterion. IEEE Trans. Neural Netw. Learn. Syst. 26(9), 1–19 (2018). https://dx.doi.org/10.1109/TNNLS.2018.2812709

  9. Sledge, I.J., Príncipe, J.C.: Partitioning relational matrices of similarities or dissimilarities using the value of information. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Canada, pp. 1–5 (2018). 15–20 Apr 2018. https://arxiv.org/abs/1710.10381

  10. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4(1), 237–285 (1996). http://dx.doi.org/10.1613/jair.301

  11. Salganicoff, M., Ungar, L.H.: Active exploration and learning in real-valued spaces using multi-armed bandit allocation indices. In: Proceedings of the International Conference on Machine Learning (ICML), Tahoe City, CA, USA, pp. 480–487 (1995). Accessed 9–12 July 1995. http://dx.doi.org/10.1016/B978-1-55860-377-6.50066-9

  12. Auer, P.: Using confidence bounds for exploration-exploitation trade-offs. J. Mach. Learn. Res. 3(1), 397–422 (2002)

    MathSciNet  MATH  Google Scholar 

  13. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multi-armed bandit problem. SIAM J. Comput. 32(1), 48–77, 2002. http://dx.doi.org/10.1137/S0097539701398375

  14. Strehl, A.L., Mesterharm, C., Littman, M.L., Hirsh, H.: Experience-efficient learning in associative bandit problems. In: Proceedings of the International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, pp. 889–896 (2006). Accessed 25–29 June 2006. http://dx.doi.org/10.1145/1143844.1143956

  15. Madani, O., Lizotte, S.J., Greiner, R.: The budgeted multi-armed bandit problem. In: Proceedings of the Conference on Learning Theory (COLT), New Brunswick, NJ, USA, pp. 643–645 (2004). Accessed 12–15 July 2004. Available: http://dx.doi.org/10.1007/978-3-540-27819-1

  16. Kleinberg, R.D.: Nearly tight bounds for the continuum-armed bandit problem. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 697–704 (2008). MIT Press, Cambridge, MA, USA (2008)

    Google Scholar 

  17. Wang, Y., Audibert, J., Munos, R.: Algorithms for infinitely many-armed bandits. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 1729–1736. MIT Press, Cambridge, MA, USA (2008)

    Google Scholar 

  18. Bubeck, S., Munos, R., Stoltz, G.: Pure exploration in finitely-armed and continuous-armed bandits. Theor. Comput. Sci. 412(19), 1876–1902 (2011). http://dx.doi.org/10.1016/j.tcs.2010.12.059

  19. Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) Machine Learning: ECML, pp. 437–448. Springer-Verlag, New York City, NY USA (2005)

    Google Scholar 

  20. Even-Dar, E., Mannor, S., Mansour, Y.: PAC bounds for multi-armed bandit and markov decision processes. In: Proceedings of the Conference on Learning Theory (COLT), Sydney, Australia, pp. 255–270 (2002). Accessed 8–10 July 2002. http://dx.doi.org/10.1007/3-540-45435-7

  21. Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research 5(12), 623–648 (2004)

    MathSciNet  MATH  Google Scholar 

  22. Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 817–824. MIT Press, Cambridge, MA, USA (2008)

    Google Scholar 

  23. Srinivas, S., Krause, A., Seeger, M., Kakade, S.M.: Gaussian process optimization in the bandit setting: No regret and experimental design. In: Proceedings of the International Conference on Machine Learning (ICML), Haifa, Israel, pp. 1015–1022 (2010). Accessed 21–24 June 2010

    Google Scholar 

  24. Krause, A., Ong, C.S.: Contextual Gaussian process bandit optimization. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 2447–2455. MIT Press, Cambridge, MA, USA (2011)

    Google Scholar 

  25. Beygelzimer, A., Langford, J., Li, L., Reyzin, L., Schapire, R.E.: Contextual bandit algorithms with supervised learning guarantees. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Ft. Lauderdale, FL, USA, pp. 19–26. Accessed 20–22 April 2011

    Google Scholar 

  26. Cesa-Bianchi, N., Fischer, P.: Finite-time regret bounds for the multi-armed bandit problem. In: Proceedings of the International Conference on Machine Learning (ICML), Helsinki, Finland, pp. 100–108 (1998). Accessed 5–9 1998

    Google Scholar 

  27. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47(2), 235–256 (2002). http://dx.doi.org/10.1023/A:1013689704352

  28. McMahan, H.B., Streeter, M.: Tight bounds for multi-armed bandits with expert advice,” in Proceedings of the Conference on Learning Theory (COLT), Montreal, Canada, June 18-21 2009, pp. 1–10

    Google Scholar 

  29. Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and non-stochastic multi-armed bandit problems. Found. Trends Mach. Learn. 5(1), 1–122 (2012). http://dx.doi.org/10.1561/2200000024

  30. Cesa-Bianchi, N., Gentile, C., Neu, G., Lugosi, G.: Boltzmann exploration done right. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 6287–6296. MIT Press, Cambridge, MA, USA (2017)

    Google Scholar 

  31. Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985). http://dx.doi.org/10.1016/0196-8858(85)90002-8

  32. Auer, P., Ortner, R.: UCB revisited: improved regret bounds for the stochastic multi-armed bandit problem. Periodica Math. Hung. 61(1–2), 55–65 (2010). http://dx.doi.org/10.1007/s10998-010-3055-6

  33. Audibert, J.Y., Munos, R., Szepesvári, C.: Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theor. Comput. Sci. 410(19), 1876–1902 (2009). http://dx.doi.org/10.1016/j.tcs.2009.01.016

  34. Kuss, M., Rasmussen, C.E.: Gaussian processes in reinforcement learning. In: Thrun, S., Saul, L.K., Schölkopf, P.B. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 751–758. MIT Press, Cambridge, MA, USA (2003)

    Google Scholar 

  35. Engel, Y., Mannor, S., Meir, R.: Bayes meets Bellman: the Gaussian process approach to temporal difference learning. In: Proceedings of the International Conference on Machine Learning (ICML), Washington D.C., USA, pp. 154–161 (2003). Accessed 21–24 August 2003

    Google Scholar 

  36. Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Proceedings of the International Conference on Machine Learning (ICML), Bonn, Germany, pp. 201–208 (2005). Accessed 7–11 August 2005. http://dx.doi.org/10.1145/1102351.1102377

  37. Ghavamzadeh, M., Mannor, S., Pineau, J., Tamar, A.: Bayesian reinforcement learning: a survey. Found. Trends Mach. Learn. 8(5–6), 359–483 (2015). http://dx.doi.org/10.1561/2200000049

  38. Silver, E.A.: Markovian decision processes with uncertain transition probabilities and rewards. Ph.D. dissertation, Massachusetts Institute of Technology, Cambridge, MA, USA (1963)

    Google Scholar 

  39. Martin, J.J.: Bayesian Decision Problems and Markov Chains. Wiley, New York City, NY, USA (1967)

    MATH  Google Scholar 

  40. Cozzolino, J.M., Gonzalez-Zubieta, R., Miller, R.L.: Markovian decision processes with uncertain transition probabilities. Massachusetts Institute of Technology, Cambridge, MA, USA, Operations Research Center Technical Report 11 (1965)

    Google Scholar 

  41. Satia, J.K.: Markovian decision processes with uncertain transition matrices and/or probabilistic observation of states. Ph.D. dissertation, Stanford University, Stanford, CA, USA (1963)

    Google Scholar 

  42. Satia, J.K., Lave Jr., R.E.: Markovian decision processes with uncertain transition probabilities. Oper. Res. 21(3), 728–740 (1973). http://dx.doi.org/10.1287/opre.21.3.728

  43. Guez, A., Silver, D., Dayan, P.: Efficient Bayes-adaptive reinforcement learning using sample-based search. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 1025–1033. MIT Press, Cambridge, MA, USA (2012)

    Google Scholar 

  44. Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), Stockholm, Sweden, pp. 150–159 (1999). Accessed July 30–August 1

    Google Scholar 

  45. Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, pp. 697–704 (2006). Accessed 25–29 June 2006. http://dx.doi.org/10.1145/1143844.1143932

  46. Kolter, J.Z., Ng, A.Y.: Near-Bayesian exploration in polynomial time. In: Proceedings of the International Conference on Machine Learning (ICML), Montreal, Canada, pp. 513–520 (2009). Accessed 14–18 June 2009. http://dx.doi.org/10.1145/1553374.1553441

  47. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA (1998)

    MATH  Google Scholar 

  48. Whitehead, S.D.: Complexity and cooperation in \(Q\)-learning. In: Proceedings of the International Conference on Machine Learning (ICML), Evanston, IL, USA, pp. 363–367 (1991). Accessed 20–25 June 1991. http://dx.doi.org/10.1016/B978-1-55860-200-7.50075-1

  49. Hernandez-Gardiol, N., Mahadevan, S.: Hierarchical memory-based reinforcement learning. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 1047–1053. MIT Press, Cambridge, MA, USA (2000)

    Google Scholar 

  50. Faulkner, R., Precup, D.: Dyna planning using a feature based generative model. In: Lee, H., Ranzato, M., Bengio, Y., Hinton, G.E., LeCun, Y., Ng, A.Y. (eds.) Advances in Neural Information Processing Systems (NIPS) Workshop, pp. 10–19. MIT Press, Cambridge, MA, USA (2010)

    Google Scholar 

  51. Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.B.: Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 1047–1053. MIT Press, Cambridge, MA, USA (2016)

    Google Scholar 

  52. Singh, S.P., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000). http://dx.doi.org/10.1023/A:1007678930559

  53. Watkins, C.J. C.H.: Learning from delayed rewards. Ph.D. dissertation, King’s College, Cambridge University, Cambridge, UK (1989)

    Google Scholar 

  54. Barto, A.G., Bradtke, S.J., Singh, S.P.: Learning to act using real-time dynamic programming. Artif. Intell. 72(1–2), 81–138, 1995. http://dx.doi.org/10.1016/0004-3702(94)00011-O

  55. Sutton, R.S.: Integrated architecture for learning, planning, and reacting based on approximating dynamic programming. In: Proceedings of the International Conference on Machine Learning (ICML), Austin, TX, USA, pp. 216–224 (1990). Accessed 21–23 June 1990. http://dx.doi.org/10.1016/B978-1-55860-141-3.50030-4

  56. Cybenko, G., Gray, R., Moizumi, K.: \(Q\)-learning: A tutorial and extensions. In: Ellacott, S.W., Mason, J.C., Anderson, I.J. (eds.) Mathematics of Neural Networks, pp. 24–33. Springer-Verlag, New York City (1997)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Isaac J. Sledge .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sledge, I.J., Príncipe, J.C. (2021). Trading Utility and Uncertainty: Applying the Value of Information to Resolve the Exploration–Exploitation Dilemma in Reinforcement Learning. In: Vamvoudakis, K.G., Wan, Y., Lewis, F.L., Cansever, D. (eds) Handbook of Reinforcement Learning and Control. Studies in Systems, Decision and Control, vol 325. Springer, Cham. https://doi.org/10.1007/978-3-030-60990-0_19

Download citation

Publish with us

Policies and ethics