Abstract
A fundamental problem in reinforcement learning is the exploration–exploitation dilemma: a search problem that entails sufficiently investigating the possible action choices and exploiting those that work well for certain contexts. Few exploration mechanisms, however, provide expected performance guarantees for a given search amount. Here, we show that this dilemma can be addressed and the expected agent performance quantified by optimizing Stratonovich’s value of information. The value of information is an information-theoretic criterion that specifies the greatest increase in rewards, from the worst case, subject to a certain uncertainty amount. In the context of reinforcement learning, uncertainty is quantified by a constrained mutual dependence between random variables. When the mutual dependence between the random variables go to zero, agents tend to exploit its acquired knowledge about the environment; little to no improvements in policy performance are obtained in this case. As the mutual dependence increases, a great amount of exploration is permitted and the policy can converge to the global-best action-selection strategy. Optimizing the value of information yields action-selection update strategies that, in the limit, is theoretically guaranteed to uncover the optimal policy for a given mutual dependence amount. We show that, in a finite number of episodes, the value of information yields policies that outperform conventional exploration mechanisms for both single-state and multi-state, multi-action environment abstractions based on Markov decision processes.
José C. Príncipe—is the director of the Computational NeuroEngineering Laboratory (CNEL) at the University of Florida.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Stratonovich, R.L.: On value of information. Izvestiya of USSR Acad. Sci. Tech. Cybern. 5(1), 3–12 (1965)
Stratonovich, R.L., Grishanin, B.A.: Value of information when an estimated random variable is hidden. Izvestiya of USSR Acad. Sci. Tech. Cybern. 6(1), 3–15 (1966)
von Neumann, J., Morgenstern, O.: Theory of Games and Economic Behavior. Princeton University Press, Princeton (1953)
Belavkin, R.V.: Asymmetry of risk and value of information. In: Vogiatzis, C., Walteros, J., Pardalos, P. (eds.) Dynamics of Information Systems, pp. 1–20. Springer, New York (2014)
Sledge, I.J., Príncipe, J.C.: Balancing exploration and exploitation in reinforcement learning using a value of information criterion. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, LA, USA, pp. 1–5 (2017). Accessed by 5–9 March 2017, https://doi.org/10.1109/ICASSP.2017.7952670
Sledge, I.J., Príncipe, J.C.: Quantifying the trade-offs between policy exploration and exploitation in Markov decision processes. Entropy 20(3), 155(1–34) (2018). https://dx.doi.org/10.3390/e20030155
Sledge, I.J., Príncipe, J.C.: Analysis of agent expertise in Ms. Pac-Man using value-of-information-based policies. IEEE Trans. Comput. Intell. Artif. Intell. Games (2018). (accepted, in press). https://dx.doi.org/10.1109/TG.2018.2808201
Sledge, I.J., Emigh, Príncipe, J.C.: Guided policy exploration for Markov decision processes using an uncertainty-based value-of-information criterion. IEEE Trans. Neural Netw. Learn. Syst. 26(9), 1–19 (2018). https://dx.doi.org/10.1109/TNNLS.2018.2812709
Sledge, I.J., Príncipe, J.C.: Partitioning relational matrices of similarities or dissimilarities using the value of information. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Canada, pp. 1–5 (2018). 15–20 Apr 2018. https://arxiv.org/abs/1710.10381
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4(1), 237–285 (1996). http://dx.doi.org/10.1613/jair.301
Salganicoff, M., Ungar, L.H.: Active exploration and learning in real-valued spaces using multi-armed bandit allocation indices. In: Proceedings of the International Conference on Machine Learning (ICML), Tahoe City, CA, USA, pp. 480–487 (1995). Accessed 9–12 July 1995. http://dx.doi.org/10.1016/B978-1-55860-377-6.50066-9
Auer, P.: Using confidence bounds for exploration-exploitation trade-offs. J. Mach. Learn. Res. 3(1), 397–422 (2002)
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multi-armed bandit problem. SIAM J. Comput. 32(1), 48–77, 2002. http://dx.doi.org/10.1137/S0097539701398375
Strehl, A.L., Mesterharm, C., Littman, M.L., Hirsh, H.: Experience-efficient learning in associative bandit problems. In: Proceedings of the International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, pp. 889–896 (2006). Accessed 25–29 June 2006. http://dx.doi.org/10.1145/1143844.1143956
Madani, O., Lizotte, S.J., Greiner, R.: The budgeted multi-armed bandit problem. In: Proceedings of the Conference on Learning Theory (COLT), New Brunswick, NJ, USA, pp. 643–645 (2004). Accessed 12–15 July 2004. Available: http://dx.doi.org/10.1007/978-3-540-27819-1
Kleinberg, R.D.: Nearly tight bounds for the continuum-armed bandit problem. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 697–704 (2008). MIT Press, Cambridge, MA, USA (2008)
Wang, Y., Audibert, J., Munos, R.: Algorithms for infinitely many-armed bandits. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 1729–1736. MIT Press, Cambridge, MA, USA (2008)
Bubeck, S., Munos, R., Stoltz, G.: Pure exploration in finitely-armed and continuous-armed bandits. Theor. Comput. Sci. 412(19), 1876–1902 (2011). http://dx.doi.org/10.1016/j.tcs.2010.12.059
Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) Machine Learning: ECML, pp. 437–448. Springer-Verlag, New York City, NY USA (2005)
Even-Dar, E., Mannor, S., Mansour, Y.: PAC bounds for multi-armed bandit and markov decision processes. In: Proceedings of the Conference on Learning Theory (COLT), Sydney, Australia, pp. 255–270 (2002). Accessed 8–10 July 2002. http://dx.doi.org/10.1007/3-540-45435-7
Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research 5(12), 623–648 (2004)
Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 817–824. MIT Press, Cambridge, MA, USA (2008)
Srinivas, S., Krause, A., Seeger, M., Kakade, S.M.: Gaussian process optimization in the bandit setting: No regret and experimental design. In: Proceedings of the International Conference on Machine Learning (ICML), Haifa, Israel, pp. 1015–1022 (2010). Accessed 21–24 June 2010
Krause, A., Ong, C.S.: Contextual Gaussian process bandit optimization. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 2447–2455. MIT Press, Cambridge, MA, USA (2011)
Beygelzimer, A., Langford, J., Li, L., Reyzin, L., Schapire, R.E.: Contextual bandit algorithms with supervised learning guarantees. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Ft. Lauderdale, FL, USA, pp. 19–26. Accessed 20–22 April 2011
Cesa-Bianchi, N., Fischer, P.: Finite-time regret bounds for the multi-armed bandit problem. In: Proceedings of the International Conference on Machine Learning (ICML), Helsinki, Finland, pp. 100–108 (1998). Accessed 5–9 1998
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47(2), 235–256 (2002). http://dx.doi.org/10.1023/A:1013689704352
McMahan, H.B., Streeter, M.: Tight bounds for multi-armed bandits with expert advice,” in Proceedings of the Conference on Learning Theory (COLT), Montreal, Canada, June 18-21 2009, pp. 1–10
Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and non-stochastic multi-armed bandit problems. Found. Trends Mach. Learn. 5(1), 1–122 (2012). http://dx.doi.org/10.1561/2200000024
Cesa-Bianchi, N., Gentile, C., Neu, G., Lugosi, G.: Boltzmann exploration done right. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 6287–6296. MIT Press, Cambridge, MA, USA (2017)
Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985). http://dx.doi.org/10.1016/0196-8858(85)90002-8
Auer, P., Ortner, R.: UCB revisited: improved regret bounds for the stochastic multi-armed bandit problem. Periodica Math. Hung. 61(1–2), 55–65 (2010). http://dx.doi.org/10.1007/s10998-010-3055-6
Audibert, J.Y., Munos, R., Szepesvári, C.: Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theor. Comput. Sci. 410(19), 1876–1902 (2009). http://dx.doi.org/10.1016/j.tcs.2009.01.016
Kuss, M., Rasmussen, C.E.: Gaussian processes in reinforcement learning. In: Thrun, S., Saul, L.K., Schölkopf, P.B. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 751–758. MIT Press, Cambridge, MA, USA (2003)
Engel, Y., Mannor, S., Meir, R.: Bayes meets Bellman: the Gaussian process approach to temporal difference learning. In: Proceedings of the International Conference on Machine Learning (ICML), Washington D.C., USA, pp. 154–161 (2003). Accessed 21–24 August 2003
Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Proceedings of the International Conference on Machine Learning (ICML), Bonn, Germany, pp. 201–208 (2005). Accessed 7–11 August 2005. http://dx.doi.org/10.1145/1102351.1102377
Ghavamzadeh, M., Mannor, S., Pineau, J., Tamar, A.: Bayesian reinforcement learning: a survey. Found. Trends Mach. Learn. 8(5–6), 359–483 (2015). http://dx.doi.org/10.1561/2200000049
Silver, E.A.: Markovian decision processes with uncertain transition probabilities and rewards. Ph.D. dissertation, Massachusetts Institute of Technology, Cambridge, MA, USA (1963)
Martin, J.J.: Bayesian Decision Problems and Markov Chains. Wiley, New York City, NY, USA (1967)
Cozzolino, J.M., Gonzalez-Zubieta, R., Miller, R.L.: Markovian decision processes with uncertain transition probabilities. Massachusetts Institute of Technology, Cambridge, MA, USA, Operations Research Center Technical Report 11 (1965)
Satia, J.K.: Markovian decision processes with uncertain transition matrices and/or probabilistic observation of states. Ph.D. dissertation, Stanford University, Stanford, CA, USA (1963)
Satia, J.K., Lave Jr., R.E.: Markovian decision processes with uncertain transition probabilities. Oper. Res. 21(3), 728–740 (1973). http://dx.doi.org/10.1287/opre.21.3.728
Guez, A., Silver, D., Dayan, P.: Efficient Bayes-adaptive reinforcement learning using sample-based search. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 1025–1033. MIT Press, Cambridge, MA, USA (2012)
Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), Stockholm, Sweden, pp. 150–159 (1999). Accessed July 30–August 1
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, pp. 697–704 (2006). Accessed 25–29 June 2006. http://dx.doi.org/10.1145/1143844.1143932
Kolter, J.Z., Ng, A.Y.: Near-Bayesian exploration in polynomial time. In: Proceedings of the International Conference on Machine Learning (ICML), Montreal, Canada, pp. 513–520 (2009). Accessed 14–18 June 2009. http://dx.doi.org/10.1145/1553374.1553441
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA (1998)
Whitehead, S.D.: Complexity and cooperation in \(Q\)-learning. In: Proceedings of the International Conference on Machine Learning (ICML), Evanston, IL, USA, pp. 363–367 (1991). Accessed 20–25 June 1991. http://dx.doi.org/10.1016/B978-1-55860-200-7.50075-1
Hernandez-Gardiol, N., Mahadevan, S.: Hierarchical memory-based reinforcement learning. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 1047–1053. MIT Press, Cambridge, MA, USA (2000)
Faulkner, R., Precup, D.: Dyna planning using a feature based generative model. In: Lee, H., Ranzato, M., Bengio, Y., Hinton, G.E., LeCun, Y., Ng, A.Y. (eds.) Advances in Neural Information Processing Systems (NIPS) Workshop, pp. 10–19. MIT Press, Cambridge, MA, USA (2010)
Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.B.: Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NIPS), pp. 1047–1053. MIT Press, Cambridge, MA, USA (2016)
Singh, S.P., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000). http://dx.doi.org/10.1023/A:1007678930559
Watkins, C.J. C.H.: Learning from delayed rewards. Ph.D. dissertation, King’s College, Cambridge University, Cambridge, UK (1989)
Barto, A.G., Bradtke, S.J., Singh, S.P.: Learning to act using real-time dynamic programming. Artif. Intell. 72(1–2), 81–138, 1995. http://dx.doi.org/10.1016/0004-3702(94)00011-O
Sutton, R.S.: Integrated architecture for learning, planning, and reacting based on approximating dynamic programming. In: Proceedings of the International Conference on Machine Learning (ICML), Austin, TX, USA, pp. 216–224 (1990). Accessed 21–23 June 1990. http://dx.doi.org/10.1016/B978-1-55860-141-3.50030-4
Cybenko, G., Gray, R., Moizumi, K.: \(Q\)-learning: A tutorial and extensions. In: Ellacott, S.W., Mason, J.C., Anderson, I.J. (eds.) Mathematics of Neural Networks, pp. 24–33. Springer-Verlag, New York City (1997)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Sledge, I.J., Príncipe, J.C. (2021). Trading Utility and Uncertainty: Applying the Value of Information to Resolve the Exploration–Exploitation Dilemma in Reinforcement Learning. In: Vamvoudakis, K.G., Wan, Y., Lewis, F.L., Cansever, D. (eds) Handbook of Reinforcement Learning and Control. Studies in Systems, Decision and Control, vol 325. Springer, Cham. https://doi.org/10.1007/978-3-030-60990-0_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-60990-0_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60989-4
Online ISBN: 978-3-030-60990-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)