Abstract
Partially observable Markov decision processes (POMDP) are a useful model for decision-making under partial observability and stochastic actions. Partially Observable Monte-Carlo Planning (POMCP) is an online algorithm for deciding on the next action to perform, using a Monte-Carlo tree search approach, based on the UCT algorithm for fully observable Markov-decision processes. POMCP develops an action-observation tree, and at the leaves, uses a rollout policy to provide a value estimate for the leaf. As such, POMCP is highly dependent on the rollout policy to compute good estimates, and hence identify good actions. Thus, many practitioners who use POMCP are required to create strong, domain-specific heuristics. In this paper, we model POMDPs as stochastic contingent planning problems. This allows us to leverage domain-independent heuristics that were developed in the planning community. We suggest two heuristics, the first is based on the well-known \(h_{add}\) heuristic from classical planning, and the second is computed in belief space, taking the value of information into account.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Agrawal, Rajeev: Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27(4), 1054–1078 (1995)
Alagoz, O., Hsu, H., Schaefer, A.J., Roberts, M.S.: Markov decision processes: a tool for sequential decision making under uncertainty. Med. Decision Making 30(4), 474–483 (2010)
Albore, A., Geffner, H.: Acting in partially observable environments when achievement of the goal cannot be guaranteed. In: Workshop on Planning and Plan Execution for Real-World Systems, at 19th Int. Conf. on Planning and Scheduling (ICAPS ’09). (2009)
Albore, A., Palacios, H., Geffner, H.: A translation-based approach to contingent planning. In: IJCAI 2009, Proceedings of the 21st International Joint Conference on Artificial Intelligence, Pasadena, California, USA, July 11-17, 2009. pp. 1623–1628 (2009)
Albore, A., Palacios, H., Geffner, H.: A translation-based approach to contingent planning. In: IJCAI, vol. 9. pp. 1623–1628 (2009). https://doi.org/10.5555/1661445.1661706, https://dl.acm.org/doi/10.5555/1661445.1661706
Albore, A., Ramírez, M., Geffner, H.: Effective heuristics and belief tracking for planning with incomplete information. In: Proceedings of the 21st International Conference on Automated Planning and Scheduling, ICAPS 2011, Freiburg, Germany June 11-16, 2011. (2011)
Audibert, J.-Y., Munos, R., Szepesvári, C.: Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoret. Comput. Sci. 410(19), 1876–1902 (2009)
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 235–256 (2002)
Bellman, R.: Dynamic programming. Courier Corporation (1957)
Bellman, R.: A markovian decision process. J. Math. Mech. 679–684 (1957)
Bertoli, P., Cimatti, A., Roveri, M., Traverso, P.: Strong planning under partial observability. Artif. Intell. 170(4–5), 337–384 (2006)
Bonet, B., Geffner, H.: Planning as heuristic search. Artif. Intell. 129(1–2), 5–33 (2001). https://doi.org/10.1016/S0004-3702(01)00108-4
Bonet, B., Geffner, H.: Solving pomdps: Rtdp-bel vs. point-based algorithms. In: Boutilier, C., (ed) IJCAI 2009, Proceedings of the 21st International Joint Conference on Artificial Intelligence, Pasadena, California, USA, July 11-17, 2009. pp. 1641–1646 (2009). https://doi.org/10.5555/1661445.1661709, https://dl.acm.org/doi/10.5555/1661445.1661709
Bonet, B., Geffner, H.: Planning under partial observability by classical replanning: theory and experiments. In: IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011. pp. 1936–1941 (2011). https://doi.org/10.5591/978-1-57735-516-8/IJCAI11-324
Bonet, B., Geffner, H.: action selection for mdps: anytime ao* versus uct. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 26. pp. 1749–1755 (2012)
Brafman, R.I., Shani, G.: A multi-path compilation approach to contingent planning. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, July 22-26, 2012, Toronto, Ontario, Canada. (2012)
Brafman, R.I., Shani, G.: Replanning in domains with partial information and sensing actions. J. Artif. Intell. Res. 45, 565–600 (2012). https://doi.org/10.1613/jair.3711
Brafman, R.I., Shani, G.: Online belief tracking using regression for contingent planning. Artif. Intell. 241, 131–152 (2016). https://doi.org/10.1016/j.artint.2016.08.005
Cai, P., Luo, Y., Hsu, D., Lee, W.S.: Hyp-despot: a hybrid parallel algorithm for online planning under uncertainty. Int. J. Robotics Res. 40(2–3), 558–573 (2021)
Cassandra, A.R., Littman, M.L., Zhang, N.L.: Incremental pruning: a simple, fast, exact method for partially observable markov decision processes. (2013). arXiv:1302.1525
Michael Chung. Monte carlo planning in rts games. (2005)
Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In: International conference on computers and games. pp. 72–83. Springer (2006)
Geffner, H., Bonet, B.: Solving large pomdps using real time dynamic programming. In Working Notes Fall AAAI Symposium on POMDPs, vol. 218. (1998)
Geffner, H., Bonet, B.: A concise introduction to models and methods for automated planning. Springer Nature (2022)
Haslum, P., Botea, A., Helmert, M., Bonet, B., Koenig, S., et al.: Domain-independent construction of pattern database heuristics for cost-optimal planning. AAAI 7, 1007–1012 (2007)
Helmert, M., Domshlak, C.: Landmarks, critical paths and abstractions: What’s the difference anyway? In: Brim, L., Edelkamp, S., Hansen, E.A., Sanders, P. (eds) Graph Search Engineering, 29.11. - 04.12.2009, volume 09491 of Dagstuhl Seminar Proceedings. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Germany (2009). https://doi.org/10.1609/icaps.v19i1.13370, http://drops.dagstuhl.de/opus/volltexte/2010/2432/
Helmert, M., Haslum, P., Hoffmann, J., Nissim, R.: Merge-and-shrink abstraction: a method for generating lower bounds in factored state spaces. J. ACM 61(3), 1–63 (2014). https://doi.org/10.1145/2559951
Hoffmann, J., Nebel, B.: The FF planning system: fast plan generation through heuristic search. JAIR 14, 253–302 (2001). https://doi.org/10.1613/jair.855
Hoffmann, J., Brafman, R.I.: Contingent planning via heuristic forward search witn implicit belief states. In: Biundo, S., Myers, K.L., Rajan, K. (eds) Proceedings of the Fifteenth International Conference on Automated Planning and Scheduling (ICAPS 2005), June 5-10 2005, Monterey, California, USA. pp. 71–80. AAAI (2005). https://doi.org/10.5555/3037062.3037072, http://www.aaai.org/Library/ICAPS/2005/icaps05-008.php
Hörger, M., Kurniawati, H., Elfes, V.: Multilevel monte-carlo for solving pomdps online. In: Asfour, T., Yoshida, E., Park, J., Christensen, H., Khatib, O. (eds) Robotics Research - The 19th International Symposium ISRR 2019, Hanoi, Vietnam, October 6-10, 2019, vol. 20 of Springer Proceedings in Advanced Robotics. pp. 174–190. Springer (2019). https://doi.org/10.1007/978-3-030-95459-8_11
Howard, R.A.: Information value theory. IEEE Transactions on systems science and cybernetics 2(1), 22–26 (1966). https://doi.org/10.1109/TSSC.1966.300074
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1–2), 99–134 (1998)
Keller, T., Eyerich, P.: PROST: probabilistic planning based on UCT. In: McCluskey, L., Williams, B.C., Silva, J.R., Bonet, B. (eds) Proceedings of the Twenty-Second International Conference on Automated Planning and Scheduling, ICAPS 2012, Atibaia, São Paulo, Brazil, June 25-19, 2012. AAAI (2012). https://doi.org/10.1609/icaps.v22i1.13518, http://www.aaai.org/ocs/index.php/ICAPS/ICAPS12/paper/view/4715
Keller, T., Helmert, M.: Trial-based heuristic tree search for finite horizon mdps. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 23, pages 135–143, 2013. https://doi.org/10.1609/icaps.v23i1.13557, http://www.aaai.org/ocs/index.php/ICAPS/ICAPS13/paper/view/6026
Kim, S.-K., Salzman, O., Likhachev, M.: Pomhdp: search-based belief space planning using multiple heuristics. In: Proceedings of the International Conference on Automated Planning and Scheduling, volume 29. pp. 734–744. (2019). https://doi.org/10.1609/icaps.v29i1.3542, https://ojs.aaai.org/index.php/ICAPS/article/view/3542
Kocsis, L., Szepesvári, C.: Bandit based monte-carlo planning. In: European conference on machine learning. pp. 282–293. Springer (2006)
Kurniawati, H.: Partially observable markov decision processes and robotics. Ann. Rev. Control Robot. Auto. Syst. 5, 253–277 (2022). https://doi.org/10.1146/annurev-control-042920-092451
Kurniawati, H., Yadav, V.: An online pomdp solver for uncertainty planning in dynamic environment. In: Robotics Research: The 16th International Symposium ISRR. pp. 611–629. Springer (2016). https://doi.org/10.1007/978-3-319-28872-7_35
Kurniawati, H., Hsu, D., Lee, W.S.: Sarsop: efficient point-based pomdp planning by approximating optimally reachable belief spaces. In: Robotics: Science and Systems, vol. 2008. Citeseer (2008)
Luo, Y., Bai, H., Hsu, D., Lee, W.S.: Importance sampling for online planning under uncertainty. Int. J Robot. Res. 38(2–3), 162–181 (2019). https://doi.org/10.1177/0278364918780322
Muise, C.J., Belle, V., McIlraith, S.A.: Computing contingent plans via fully observable non-deterministic planning. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Québec City, Québec, Canada. pp. 2322–2329 (2014)
Pineau, J., Gordon, G., Thrun, S., et al.: Point-based value iteration: an anytime algorithm for pomdps. Ijcai 3, 1025–1032 (2003)
Puterman, M.L.: Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons (2014)
Richter, S., Helmert, M., Westphal, M.: Landmarks revisited. AAAI 8, 975–982 (2008) http://www.aaai.org/Library/AAAI/2008/aaai08-155.php
J. Rintanen: Regression for classical and nondeterministic planning. In: ECAI 2008. pp. 568–572. IOS Press (2008). https://doi.org/10.3233/978-1-58603-891-5-568
Ross, S., Pineau, J., Paquet, S., Chaib-Draa, B.: Online planning algorithms for pomdps. J Artif. Intell. Res. 32, 663–704 (2008). https://doi.org/10.1613/jair.2567
Saborío, J.C., Hertzberg, J.: Planning under uncertainty through goal-driven action selection. In: Agents and Artificial Intelligence: 10th International Conference, ICAART 2018, Funchal, Madeira, Portugal, January 16–18, 2018, Revised Selected Papers 10. pp. 182–201. Springer (2019). https://doi.org/10.1007/978-3-030-05453-3_9
Saborío, J.C., Hertzberg, J.: Efficient planning under uncertainty with incremental refinement. In: Uncertainty in Artificial Intelligence. pp. 303–312. PMLR (2020). https://doi.org/10.1109/TSMC.1987.4309045
Shani, G., Brafman, R.I.: Replanning in domains with partial information and sensing actions. In: IJCAI, volume 2011. pp. 2021–2026. Citeseer (2011). https://doi.org/10.5591/978-1-57735-516-8/IJCAI11-337
Shani, G., Brafman, R.I., Shimony, S.E.: Forward search value iteration for pomdps. In: IJCAI. pp. 2619–2624. Citeseer (2007)
Shani, G., Pineau, J., Kaplow, R.: A survey of point-based pomdp solvers. Auton. Agent. Multi-Agent Syst. 27, 1–51 (2013). https://doi.org/10.1007/s10458-012-9200-2
Shen, W., Trevizan, F., Toyer, S., Thiébaux, S., Xie, L.: Guiding search with generalized policies for probabilistic planning. Proc. Int. Symp. Comb. Search 10, 97–105 (2019). https://doi.org/10.1609/socs.v10i1.18507
Shen, W., Trevizan, F., Thiébaux, S.: Learning domain-independent planning heuristics with hypergraph networks. Proc Int. Conf. Autom. Plann. Sched. 30, 574–584 (2020)
Silver, D., Veness, J.: Monte-carlo planning in large pomdps. Adv. Neural Inf. Process. Syst. 23 (2010). https://doi.org/10.5555/2997046.2997137
Smallwood, R.D., Sondik, E.J.: The optimal control of partially observable markov processes over a finite horizon. Oper. Res. 21(5), 1071–1088 (1973). https://doi.org/10.1287/opre.21.5.1071
Smith, T., Simmons, R.: Heuristic search value iteration for pomdps. (2012). arXiv:1207.4166
Somani, A., Ye, N., Hsu, D., Lee, W.S.: Despot: online pomdp planning with regularization. Adv. Neural Inf. Process Syst. 26 (2013). https://doi.org/10.1613/jair.5328
Sondik, E.J.: The optimal control of partially observable markov processes over the infinite horizon: discounted costs. Oper. Res. 26(2), 282–304 (1978)
Spaan, M.T.J., Vlassis, N.: Perseus: randomized point-based value iteration for pomdps. J. Artif. Intell. Res. 24, 195–220 (2005)
Sunberg, Z., Kochenderfer, M.: Online algorithms for pomdps with continuous state, action, and observation spaces. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 28. pp. 259–263 (2018). https://doi.org/10.48550/arXiv.1709.06196
Trunda, O., Barták, R.: Deep learning of heuristics for domain-independent planning. ICAART (2). pp. 79–88 (2020)
Xiao, Y., Katt, S., ten Pas, A., Chen, S., Amato, C.: Online planning for target object search in clutter under partial observability. 2019 International Conference on Robotics and Automation (ICRA). pp. 8241–8247. IEEE (2019). https://doi.org/10.1109/ICRA.2019.8793494
Ye, N., Somani, A., Hsu, D., Lee, W.S.: Despot: Online pomdp planning with regularization. J. Artif. Intell. Res. 58, 231–266 (2017). https://doi.org/10.1613/jair.5328
Funding
Open access funding provided by Ben-Gurion University.
Author information
Authors and Affiliations
Contributions
O.B. : ideas, implementation, experiments. G.S. : ideas, implementation, writing.
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Blumenthal, O., Shani, G. Domain independent heuristics for online stochastic contingent planning. Ann Math Artif Intell (2024). https://doi.org/10.1007/s10472-024-09947-5
Accepted:
Published:
DOI: https://doi.org/10.1007/s10472-024-09947-5