Trading Utility and Uncertainty: Applying the Value of Information to Resolve the Exploration–Exploitation Dilemma in Reinforcement Learning

Sledge, Isaac J.; Príncipe, José C.

doi:10.1007/978-3-030-60990-0_19

Isaac J. Sledge^6,7 &
José C. Príncipe⁸

Part of the book series: Studies in Systems, Decision and Control ((SSDC,volume 325))

7925 Accesses

Abstract

A fundamental problem in reinforcement learning is the exploration–exploitation dilemma: a search problem that entails sufficiently investigating the possible action choices and exploiting those that work well for certain contexts. Few exploration mechanisms, however, provide expected performance guarantees for a given search amount. Here, we show that this dilemma can be addressed and the expected agent performance quantified by optimizing Stratonovich’s value of information. The value of information is an information-theoretic criterion that specifies the greatest increase in rewards, from the worst case, subject to a certain uncertainty amount. In the context of reinforcement learning, uncertainty is quantified by a constrained mutual dependence between random variables. When the mutual dependence between the random variables go to zero, agents tend to exploit its acquired knowledge about the environment; little to no improvements in policy performance are obtained in this case. As the mutual dependence increases, a great amount of exploration is permitted and the policy can converge to the global-best action-selection strategy. Optimizing the value of information yields action-selection update strategies that, in the limit, is theoretically guaranteed to uncover the optimal policy for a given mutual dependence amount. We show that, in a finite number of episodes, the value of information yields policies that outperform conventional exploration mechanisms for both single-state and multi-state, multi-action environment abstractions based on Markov decision processes.

José C. Príncipe—is the director of the Computational NeuroEngineering Laboratory (CNEL) at the University of Florida.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Bayesian reinforcement learning approach in markov games for computing near-optimal policies

Article 10 June 2023

Bayesian Reinforcement Learning with Exploration

The Explore–Exploit Dilemma in Nonstationary Decision Making under Uncertainty

References

Stratonovich, R.L.: On value of information. Izvestiya of USSR Acad. Sci. Tech. Cybern. 5(1), 3–12 (1965)
Google Scholar
Stratonovich, R.L., Grishanin, B.A.: Value of information when an estimated random variable is hidden. Izvestiya of USSR Acad. Sci. Tech. Cybern. 6(1), 3–15 (1966)
Google Scholar
von Neumann, J., Morgenstern, O.: Theory of Games and Economic Behavior. Princeton University Press, Princeton (1953)
MATH Google Scholar
Belavkin, R.V.: Asymmetry of risk and value of information. In: Vogiatzis, C., Walteros, J., Pardalos, P. (eds.) Dynamics of Information Systems, pp. 1–20. Springer, New York (2014)
MATH Google Scholar
Sledge, I.J., Príncipe, J.C.: Balancing exploration and exploitation in reinforcement learning using a value of information criterion. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, LA, USA, pp. 1–5 (2017). Accessed by 5–9 March 2017, https://doi.org/10.1109/ICASSP.2017.7952670
Sledge, I.J., Príncipe, J.C.: Quantifying the trade-offs between policy exploration and exploitation in Markov decision processes. Entropy 20(3), 155(1–34) (2018). https://dx.doi.org/10.3390/e20030155
Sledge, I.J., Príncipe, J.C.: Analysis of agent expertise in Ms. Pac-Man using value-of-information-based policies. IEEE Trans. Comput. Intell. Artif. Intell. Games (2018). (accepted, in press). https://dx.doi.org/10.1109/TG.2018.2808201
Sledge, I.J., Emigh, Príncipe, J.C.: Guided policy exploration for Markov decision processes using an uncertainty-based value-of-information criterion. IEEE Trans. Neural Netw. Learn. Syst. 26(9), 1–19 (2018). https://dx.doi.org/10.1109/TNNLS.2018.2812709
Sledge, I.J., Príncipe, J.C.: Partitioning relational matrices of similarities or dissimilarities using the value of information. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Canada, pp. 1–5 (2018). 15–20 Apr 2018. https://arxiv.org/abs/1710.10381
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4(1), 237–285 (1996). http://dx.doi.org/10.1613/jair.301
Salganicoff, M., Ungar, L.H.: Active exploration and learning in real-valued spaces using multi-armed bandit allocation indices. In: Proceedings of the International Conference on Machine Learning (ICML), Tahoe City, CA, USA, pp. 480–487 (1995). Accessed 9–12 July 1995. http://dx.doi.org/10.1016/B978-1-55860-377-6.50066-9
Auer, P.: Using confidence bounds for exploration-exploitation trade-offs. J. Mach. Learn. Res. 3(1), 397–422 (2002)
MathSciNet MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multi-armed bandit problem. SIAM J. Comput. 32(1), 48–77, 2002. http://dx.doi.org/10.1137/S0097539701398375
Strehl, A.L., Mesterharm, C., Littman, M.L., Hirsh, H.: Experience-efficient learning in associative bandit problems. In: Proceedings of the International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, pp. 889–896 (2006). Accessed 25–29 June 2006. http://dx.doi.org/10.1145/1143844.1143956
Madani, O., Lizotte, S.J., Greiner, R.: The budgeted multi-armed bandit problem. In: Proceedings of the Conference on Learning Theory (COLT), New Brunswick, NJ, USA, pp. 643–645 (2004). Accessed 12–15 July 2004. Available: http://dx.doi.org/10.1007/978-3-540-27819-1
Kleinberg, R.D.: Nearly tight bounds for the continuum-armed bandit problem. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 697–704 (2008). MIT Press, Cambridge, MA, USA (2008)
Google Scholar
Wang, Y., Audibert, J., Munos, R.: Algorithms for infinitely many-armed bandits. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 1729–1736. MIT Press, Cambridge, MA, USA (2008)
Google Scholar
Bubeck, S., Munos, R., Stoltz, G.: Pure exploration in finitely-armed and continuous-armed bandits. Theor. Comput. Sci. 412(19), 1876–1902 (2011). http://dx.doi.org/10.1016/j.tcs.2010.12.059
Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) Machine Learning: ECML, pp. 437–448. Springer-Verlag, New York City, NY USA (2005)
Google Scholar
Even-Dar, E., Mannor, S., Mansour, Y.: PAC bounds for multi-armed bandit and markov decision processes. In: Proceedings of the Conference on Learning Theory (COLT), Sydney, Australia, pp. 255–270 (2002). Accessed 8–10 July 2002. http://dx.doi.org/10.1007/3-540-45435-7
Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research 5(12), 623–648 (2004)
MathSciNet MATH Google Scholar
Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 817–824. MIT Press, Cambridge, MA, USA (2008)
Google Scholar
Srinivas, S., Krause, A., Seeger, M., Kakade, S.M.: Gaussian process optimization in the bandit setting: No regret and experimental design. In: Proceedings of the International Conference on Machine Learning (ICML), Haifa, Israel, pp. 1015–1022 (2010). Accessed 21–24 June 2010
Google Scholar
Krause, A., Ong, C.S.: Contextual Gaussian process bandit optimization. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 2447–2455. MIT Press, Cambridge, MA, USA (2011)
Google Scholar
Beygelzimer, A., Langford, J., Li, L., Reyzin, L., Schapire, R.E.: Contextual bandit algorithms with supervised learning guarantees. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Ft. Lauderdale, FL, USA, pp. 19–26. Accessed 20–22 April 2011
Google Scholar
Cesa-Bianchi, N., Fischer, P.: Finite-time regret bounds for the multi-armed bandit problem. In: Proceedings of the International Conference on Machine Learning (ICML), Helsinki, Finland, pp. 100–108 (1998). Accessed 5–9 1998
Google Scholar
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47(2), 235–256 (2002). http://dx.doi.org/10.1023/A:1013689704352
McMahan, H.B., Streeter, M.: Tight bounds for multi-armed bandits with expert advice,” in Proceedings of the Conference on Learning Theory (COLT), Montreal, Canada, June 18-21 2009, pp. 1–10
Google Scholar
Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and non-stochastic multi-armed bandit problems. Found. Trends Mach. Learn. 5(1), 1–122 (2012). http://dx.doi.org/10.1561/2200000024
Cesa-Bianchi, N., Gentile, C., Neu, G., Lugosi, G.: Boltzmann exploration done right. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 6287–6296. MIT Press, Cambridge, MA, USA (2017)
Google Scholar
Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985). http://dx.doi.org/10.1016/0196-8858(85)90002-8
Auer, P., Ortner, R.: UCB revisited: improved regret bounds for the stochastic multi-armed bandit problem. Periodica Math. Hung. 61(1–2), 55–65 (2010). http://dx.doi.org/10.1007/s10998-010-3055-6
Audibert, J.Y., Munos, R., Szepesvári, C.: Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theor. Comput. Sci. 410(19), 1876–1902 (2009). http://dx.doi.org/10.1016/j.tcs.2009.01.016
Kuss, M., Rasmussen, C.E.: Gaussian processes in reinforcement learning. In: Thrun, S., Saul, L.K., Schölkopf, P.B. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 751–758. MIT Press, Cambridge, MA, USA (2003)
Google Scholar
Engel, Y., Mannor, S., Meir, R.: Bayes meets Bellman: the Gaussian process approach to temporal difference learning. In: Proceedings of the International Conference on Machine Learning (ICML), Washington D.C., USA, pp. 154–161 (2003). Accessed 21–24 August 2003
Google Scholar
Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Proceedings of the International Conference on Machine Learning (ICML), Bonn, Germany, pp. 201–208 (2005). Accessed 7–11 August 2005. http://dx.doi.org/10.1145/1102351.1102377
Ghavamzadeh, M., Mannor, S., Pineau, J., Tamar, A.: Bayesian reinforcement learning: a survey. Found. Trends Mach. Learn. 8(5–6), 359–483 (2015). http://dx.doi.org/10.1561/2200000049
Silver, E.A.: Markovian decision processes with uncertain transition probabilities and rewards. Ph.D. dissertation, Massachusetts Institute of Technology, Cambridge, MA, USA (1963)
Google Scholar
Martin, J.J.: Bayesian Decision Problems and Markov Chains. Wiley, New York City, NY, USA (1967)
MATH Google Scholar
Cozzolino, J.M., Gonzalez-Zubieta, R., Miller, R.L.: Markovian decision processes with uncertain transition probabilities. Massachusetts Institute of Technology, Cambridge, MA, USA, Operations Research Center Technical Report 11 (1965)
Google Scholar
Satia, J.K.: Markovian decision processes with uncertain transition matrices and/or probabilistic observation of states. Ph.D. dissertation, Stanford University, Stanford, CA, USA (1963)
Google Scholar
Satia, J.K., Lave Jr., R.E.: Markovian decision processes with uncertain transition probabilities. Oper. Res. 21(3), 728–740 (1973). http://dx.doi.org/10.1287/opre.21.3.728
Guez, A., Silver, D., Dayan, P.: Efficient Bayes-adaptive reinforcement learning using sample-based search. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 1025–1033. MIT Press, Cambridge, MA, USA (2012)
Google Scholar
Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), Stockholm, Sweden, pp. 150–159 (1999). Accessed July 30–August 1
Google Scholar
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, pp. 697–704 (2006). Accessed 25–29 June 2006. http://dx.doi.org/10.1145/1143844.1143932
Kolter, J.Z., Ng, A.Y.: Near-Bayesian exploration in polynomial time. In: Proceedings of the International Conference on Machine Learning (ICML), Montreal, Canada, pp. 513–520 (2009). Accessed 14–18 June 2009. http://dx.doi.org/10.1145/1553374.1553441
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA (1998)
MATH Google Scholar
Whitehead, S.D.: Complexity and cooperation in $Q$-learning. In: Proceedings of the International Conference on Machine Learning (ICML), Evanston, IL, USA, pp. 363–367 (1991). Accessed 20–25 June 1991. http://dx.doi.org/10.1016/B978-1-55860-200-7.50075-1
Hernandez-Gardiol, N., Mahadevan, S.: Hierarchical memory-based reinforcement learning. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 1047–1053. MIT Press, Cambridge, MA, USA (2000)
Google Scholar
Faulkner, R., Precup, D.: Dyna planning using a feature based generative model. In: Lee, H., Ranzato, M., Bengio, Y., Hinton, G.E., LeCun, Y., Ng, A.Y. (eds.) Advances in Neural Information Processing Systems (NIPS) Workshop, pp. 10–19. MIT Press, Cambridge, MA, USA (2010)
Google Scholar
Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.B.: Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 1047–1053. MIT Press, Cambridge, MA, USA (2016)
Google Scholar
Singh, S.P., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000). http://dx.doi.org/10.1023/A:1007678930559
Watkins, C.J. C.H.: Learning from delayed rewards. Ph.D. dissertation, King’s College, Cambridge University, Cambridge, UK (1989)
Google Scholar
Barto, A.G., Bradtke, S.J., Singh, S.P.: Learning to act using real-time dynamic programming. Artif. Intell. 72(1–2), 81–138, 1995. http://dx.doi.org/10.1016/0004-3702(94)00011-O
Sutton, R.S.: Integrated architecture for learning, planning, and reacting based on approximating dynamic programming. In: Proceedings of the International Conference on Machine Learning (ICML), Austin, TX, USA, pp. 216–224 (1990). Accessed 21–23 June 1990. http://dx.doi.org/10.1016/B978-1-55860-141-3.50030-4
Cybenko, G., Gray, R., Moizumi, K.: $Q$-learning: A tutorial and extensions. In: Ellacott, S.W., Mason, J.C., Anderson, I.J. (eds.) Mathematics of Neural Networks, pp. 24–33. Springer-Verlag, New York City (1997)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Florida, 32611, Gainesville, FL, USA
Isaac J. Sledge
Computational NeuroEngineering Laboratory (CNEL), University of Florida, Gainesville, FL, USA
Isaac J. Sledge
Department of Electrical and Computer Engineering and Department of Biomedical Engineering, University of Florida, 32611, Gainesville, FL, USA
José C. Príncipe

Authors

Isaac J. Sledge
View author publications
You can also search for this author in PubMed Google Scholar
José C. Príncipe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Isaac J. Sledge .

Editor information

Editors and Affiliations

The Daniel Guggenheim School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA, USA
Kyriakos G. Vamvoudakis
Department of Electrical Engineering, The University of Texas at Arlington, Arlington, TX, USA
Yan Wan
Department of Electrical Engineering, The University of Texas at Arlington, Arlington, TX, USA
Frank L. Lewis
Army Research Office, Durham, NC, USA
Derya Cansever

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sledge, I.J., Príncipe, J.C. (2021). Trading Utility and Uncertainty: Applying the Value of Information to Resolve the Exploration–Exploitation Dilemma in Reinforcement Learning. In: Vamvoudakis, K.G., Wan, Y., Lewis, F.L., Cansever, D. (eds) Handbook of Reinforcement Learning and Control. Studies in Systems, Decision and Control, vol 325. Springer, Cham. https://doi.org/10.1007/978-3-030-60990-0_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-60990-0_19
Published: 24 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60989-4
Online ISBN: 978-3-030-60990-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Trading Utility and Uncertainty: Applying the Value of Information to Resolve the Exploration–Exploitation Dilemma in Reinforcement Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Bayesian reinforcement learning approach in markov games for computing near-optimal policies

Bayesian Reinforcement Learning with Exploration

The Explore–Exploit Dilemma in Nonstationary Decision Making under Uncertainty

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Trading Utility and Uncertainty: Applying the Value of Information to Resolve the Exploration–Exploitation Dilemma in Reinforcement Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Bayesian reinforcement learning approach in markov games for computing near-optimal policies

Bayesian Reinforcement Learning with Exploration

The Explore–Exploit Dilemma in Nonstationary Decision Making under Uncertainty

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation