High-efficiency online planning using composite bounds search under partial observation

Chen, Yanjie; Liu, Jiangjiang; Huang, Yibin; Zhang, Hui; Wang, Yaonao

doi:10.1007/s10489-022-03914-5

High-efficiency online planning using composite bounds search under partial observation

Published: 30 July 2022

Volume 53, pages 8146–8159, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

High-efficiency online planning using composite bounds search under partial observation

Download PDF

Yanjie Chen ORCID: orcid.org/0000-0001-9750-9177^1,4,
Jiangjiang Liu¹,
Yibin Huang¹,
Hui Zhang^2,4 &
…
Yaonao Wang^3,4

313 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Motion planning in uncertain environments is a common challenge and essential for autonomous robot operations. Representatively, the determinized sparse partially observable tree (DESPOT) algorithm shows reasonable performance for planning under uncertainty. However, DESPOT may generate a low-quality solution due to inaccurate searches and low efficiencies in the belief tree construction. Therefore, this paper proposes a high-efficiency online planning method built upon the DESPOT algorithm, namely, the DESPOT with discounted upper and lower bounds (DESPOT-DULB) algorithm, to simultaneously improve the efficiency and performance of motion planning. Particularly, the node’s information is represented by combining the upper and lower bounds of the node (ULB) in the forward exploration of the action space to reasonably assist the optimal action selection. Then, a discounted factor based on the depth information of the belief tree is introduced to reduce the gap between the upper bound and lower bound both in the action space and observation space. As a result, the proposed method can comprehensively represent the information of the node to ensure a near-optimal forward search. The theoretical proofs of the proposed method are provided as well. The simulation results, including three representative scenario comparisons and a parameter sensitivity analysis, demonstrate that the proposed method exhibits favorable performances in many examples of interest.

An Online POMDP Solver for Uncertainty Planning in Dynamic Environment

Bayesian incremental inference update by re-using calculations from belief space planning: a new paradigm

Article 24 August 2022

Decentralized multi-robot belief space planning in unknown environments via identification and efficient re-evaluation of impacted paths

Article 22 July 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Planning under uncertainty is a common challenge in many robotic applications, such as visual tracking [1] and autonomous navigation [2, 3]. Generally, the robot is difficult to operate accurately due to the uncertainty originating from sensor noise, imperfect robot control, and changing environments.

Recently, the partially observable Markov decision process (POMDP) [4] has provided a principled mathematical framework for robot decision-making and planning under uncertainty. Therefore, the POMDP has undertaken a wide range of challenging tasks in robotics, e.g., autonomous driving [5], grasping [6], robot rescue [7], and intelligent tutoring systems [8]. However, POMDP planning is computationally intractable due to various sources. The state space and belief size grow exponentially with the number of states, also known as the “curse of dimensionality”. Moreover, the number of action-observation histories grows exponentially with the planning range, which is the “curse of history” [9]. Both the curse of dimensionality and history cause an exponential growth in computational complexity, especially for large-scale POMDP planning.

In addition to computational difficulties, the POMDP planning has certain challenges due to the existing uncertainty. Currently, various attempts have been made to effectively deal with the uncertainty. Examples include Bayesian networks [10], convolutional neural networks [11], and α-vector policy graphs [12]. However, both Bayesian networks and convolutional neural networks need to train the network parameters and the trained parameters are not guaranteed to be optimal. Meanwhile, when the number of α-vectors is large, the policy graph explodes exponentially and cannot be explained. Fortunately, Kocsis et al. [13] introduced a search framework (Monte Carlo tree search) to reduce the uncertainty through multiple simulations and calculations. Then, Silver et al. [14] introduced partially observable Monte Carlo planning (POMCP) to solve real-time uncertainties in large tasks by combining MCTS and a partially observed upper confidence tree (PO-UCT) algorithm [15]. However, the approach still faces some challenges, such as being misguided by the upper confidence bound (UCB) heuristic of the UCT algorithm and an overly greedy convergence.

To further improve the performance of planning under uncertainty, Somani et al. [16] constructed a sparse belief tree through executing a series of searches. In the belief tree, the new child node was expanded based on the node’s upper bound information, which was calculated by the particles of the current node performing multiple simulations. However, there are two difficulties in the belief tree search stage. The first difficulty is that the current optimal action branch may be wrongly selected with a high probability because the information of the node may be not considered sufficiently. A reasonable representation of the child node information is helpful for the exploration of belief trees. The information of the nodes on the belief tree not only depends on the upper bound but also needs to consider the lower bound. To construct the optimal belief tree, the combination of upper and lower bounds is considered an exploration item for representing the node’s information comprehensively and improving the efficiency of a search. The second difficulty is that the uncertainty of the node has slight fluctuations because the initial upper and lower bound of the node are prone to change. To relieve the fluctuations, many researchers are considered that adjusting the exploration item is a desirable approach. Bougie et al. [17] encouraged high-level explorations by introducing hyperparameters to adjust fast and slow rewards in the exploration item. Chen et al. [18] improved value function approximation by decreasing the exploration discount factor. However, the discount factor is a constant that does not take into consideration the effect of the node’s depth in the above methods. Therefore, a depth function is introduced as a dynamic discount factor to adjust the exploration item.

In this paper, the combination of upper and lower bounds of the node is considered to solve the problem of incomplete representation of the node’s information. In addition, the depth function as a discount is introduced to relieve the fluctuation of the uncertainty. Therefore, discounted upper and lower bounds of the node are introduced for the DESPOT algorithm to construct the belief tree, as shown in Fig. 1. The proposed online planning method, named DESPOT-DULB, improves the search strategy of the traditional DESPOT by introducing ULB in a forward exploration of action selection and introducing the discount factor in forwarding explorations of action and observation selections, respectively. As a benefit, the uncertainty is reduced, and favorable performances can be attained, such as performances that show high rewards and high efficiency. The main contributions of this paper are listed as follows:

1.
In the belief tree expansion stage, the information of the belief node is represented by combining the node’s upper and lower bounds. Then, the belief tree is expanded based on the node’s information. Compared with merely using the upper bound or the lower bound of the node, the proposed method is useful for efficiently searching for the optimal action.
2.
Because of the slight fluctuations in the uncertainty of the belief node, a depth function is proposed as the discount factor to adjust the gap between the upper bound and lower bound of the node to ensure a reasonable reduction in uncertainty. As the uncertainty decreases, the performance of the proposed method can be further improved.
3.
The proposed search criteria have certain theoretical guarantees and the convergence of the reconstructed belief tree is demonstrated. Meanwhile, the experimental results show that DESPOT-DULB has performance improvements on tasks of interest.

This paragraph has been reorganized as follows. The rest of this paper is organized as follows. Section 2 shows the related work. Section 3 reviews the background on POMDP model. Section 4 describes the online POMDP algorithm based on the discounted upper and lower bounds. The related theoretical analysis of DESPOT-DULB are shown in Section 4, with Appendix A and B providing detailed theoretical proofs. Section 5 presents the experimental results regarding the performance of DESPOT-DULB, compared with standard POMDP benchmarks. The paper is concluded in Section 6.

2 Related work

Generally, two kinds of approximate POMDP planning methods are adopted in the current research, namely, offline planning [19,20,21] and online planning [14, 16, 22]. These methods have been applied to many fields but there are still some challenges. For example, the efficiency is low, and the performance is poor for large-scale spaces. Although offline planning has made great progress [20, 21], the approach is difficult to expand to large POMDPs because the number of states increases. Online planning can expand to large POMDPs, but online planning is slower than offline planning because a search needs to be executed at each step. Online and offline planning have been combined to further improve the planning performance efficiently [23].

In previous work [24], the upper bound as a heuristic was usually more popular for explorations compared with the lower bound. Many algorithms, such as HSVI and SARSOP, can deal with POMDP tasks effectively by relying on the upper bound as a heuristic. Meanwhile, the state-of-the-art online algorithms DESPOT [16] and POMCP [14] have been widely used in the field of robotics, e.g., vision planning [25] and contact manipulation [26]. Both POMCP and DEPSOT adopted the idea of the upper confidence bound (UCB) to search for the optimal action for constructing the search tree. POMCP used a particle to perform multiple simulations for estimating the information of the leaf node to effectively search the optimal action. POMCP++ [27] further improved the performance of the algorithm by using a set of particles rather than a particle to perform simulations to obtain accurate information. Hierarchical POMCP [28] solved large POMDP problems by premeditating hierarchical models. However, the search of the aforementioned approaches is easily misguided and overly greedy. Nonetheless, DESPOT [29] demonstrated strong performances on large POMDP problems by applying the initialize upper bound of the leaf node as a heuristic to search for optimal actions. However, many difficulties still exist that degrade the performance of the DESPOT algorithm. First, the incomplete node information affects the quality of the constructed belief tree. Then, the fluctuation of node uncertainty affects the convergence efficiency of the constructed belief tree.

To improve the performance of the DESPOT algorithm, many researchers have tried in various aspects. DESPOT-α [30] changed the search method of the belief tree based on the α vector to deal with particle divergence. IS-DESPOT [31] introduced importance sampling to improve the performance of DESPOT under certain conditions. However, this sampling only worked when dealing with important events with low probability. Hyp-DESPOT [32] further improved the planning time by integrating the CPU and GPU. Hyp-DESPOT generated a parallel DESPOT tree by using a multi-CPU to traverse multiple independent paths and a GPU to execute parallel Monte Carlo simulations at the leaf nodes of the search tree. However, all of the above approaches generally construct a belief tree without considering node information sufficiently, and then may result in the fluctuation of the uncertainty. Meanwhile, considering the search based on the lower bound is conducive to obtaining the optimal policy; an impressive idea is to switch the upper and lower bounds as a heuristic. LB-DESPOT [33] designed the action heuristic by probabilistically selecting the upper bound or the lower bound of the node. Nevertheless, the switching conditions of the heuristic are difficult to set and the convergence efficiency of the belief tree cannot improve.

In short, this paper proposes the DESPOT-DULB to improve search performances and convergence efficiency. DESPOT-DULB inherits the following parts of DESPOT [29]:

1.
the empirical value V_π(b) of a policy π is the average total discounted reward obtained by simulating the policy under each scenario,
2.
the regularized objective function aims to overcome overfitting,
3.
the regularized weighted discounted utility (RWDU) function ν(b) aims to choose the optimal policy.

Therefore, DESPOT-DULB considers the same method as DESPOT to initialize the upper bound and lower bound of the node. The main difference is that DESPOT-DULB introduces the discounted upper and lower bounds as a heuristic to search for the optimal action rather than searching based on the upper bound of the node in the forward search stage. Meanwhile, a depth function is considered to reduce the uncertainty by adjusting the gap between the upper bound and lower bound of the node. The theoretical analysis and simulation results verify the favorable performance of the proposed method.

3 Background

Uncertainty originates from noisy sensors, changing environments and imperfect control. These situations cause significant challenges for motion planning. To effectively plan under these uncertainties, the POMDP model is generally introduced to reduce the uncertainty by updating the beliefs according to the received information. Formally, the POMDP model can be designated as a tuple (S, A, Z, T, O, R), where S is a set of states, A is a set of agent actions and Z is a set of observations. R(s, a) is the immediate reward obtained on executing action a in state s. When action a is executed in state s, the probability of the next state s^′ is defined as state transition function T(s, a, s^′) = p(s^′| s, a). In addition, the probability of observing z in state s^′ reached by performing action a is defined as observation function O(s^′, a, z) = p(z| s^′, a).

A POMDP agent generally cannot obtain the true state because of the randomness and unpredictability of the environment. However, the agent receives observations continuously that provide partial information about the state. Thus, the agent maintains a belief, which is a probability distribution over S. The agent starts with an initial belief b₀. At the time t, the agent updates the belief according to Bayes rule, by incorporating information from the action a_t taken and the resulting observation o_t:

$$ {b}_t\left({s}^{\prime}\right)=\eta O\left({s}^{\prime },{a}_t,{z}_t\right)\sum \limits_{s\in S}T\left(s,{a}_t,{s}^{\prime}\right){b}_{t-1}(s) $$

(1)

where η is a normalizing constant. The belief b_t = τ(b_t − 1, a_t, z_t) = τ(τ(b_t − 2, a_t − 1, z_t − 1), a_t, z_t) = ⋯= τ(⋯τ(τ(b₀, a₁, z₁), a₂, z₂), ⋯, a_t, z_t)is a sufficient statistic that contains all the information from the history of actions and observations (a₁, z₁, a₂, z₂, ⋯, a_t, z_t).

A policy π : B → A is a mapping from belief space B to action space A. The policy prescribes an action π(b) ∈ A at the belief b ∈ B. The ultimate goal of POMDP planning is to choose a policy π that maximizes the value function V_π(b), that is, the expected total discount reward.

$$ {V}_{\pi }(b)=E\left(\sum \limits_{t=0}^{\infty }{\gamma}^tR\left({s}_t,\pi \left({b}_t\right)\right)|{b}_0=b\right) $$

(2)

where b₀ is the initial belief. The constant γ ∈ [0, 1) is a discount factor, expressing preferences for immediate rewards over future ones.

In online POMDP planning, the agent starts with an initial belief. Then, the following process is repeated. At each time iteration, (1) the agent searches for an optimal action a^∗ at the current belief b; (2) the agent executes the action a^∗ and receives a new observation z; (3) the agent updates the belief using Eq. (1) continuously. To search for an optimal action a^∗, a valid way is to construct a belief tree (Fig. 1), treating the current belief b₀ as the initial belief at the root node of the tree. The agent performs a forward search on the tree for a policy π that maximizes the value V_π(b₀) at node b₀ and sets a^∗ = π(b₀). Each node of the tree represents a belief. Each node branches into |A| action edges, and each action branches into |Z| observation edges. If the node and its child node are represented by beliefs b and b^′, respectively, then b^′ = τ(b, a, z) for a ∈ A and z ∈ Z.

To find a near-optimal policy, the tree is truncated at a maximum depth D, and then the agent searches for the optimal policy on the truncated tree. For each leaf node, an estimated lower bound on its optimal value is obtained by simulating a default policy. A default policy can usually be a random policy or a heuristic. At the internal node b, the Bellman’s principle of optimality is applied for computing the maximum value:

$$ {\displaystyle \begin{array}{c}{V}^{\ast }(b)=\underset{a\in A}{\max}\Big\{\sum \limits_{s\in S}b(s)R\left(s,a\right)\\ {}+\gamma \sum \limits_{z\in Z}p\left(z|b,a\right){V}^{\ast}\left(\tau \left(b,a,z\right)\right)\Big\}\end{array}} $$

(3)

which computes the sum of all the action branches and the average value of all the observation branches. The proposed algorithm traverses the belief tree from the leaf node to the root node and recursively calculates the maximum value of each node using Eq. (3). Then, the agent executes the best policy at the root node b₀.

4 Despot-DULB

In this section, the combination of the initial upper bound and lower bound of the node has been considered to represent the information of the node effectively in an action selection. To further ensure a reasonable reduction in uncertainty, a depth function is proposed for the forward search in the action and observation selection stages respectively. Then, the upper and lower bounds of nodes on the path are adjusted slightly during the backup phase. The specific description of the proposed method is as follows.

4.1 Online planning

Algorithm 1 shows the overall framework and pseudocode of the DESPOT-DULB algorithm. In particular, the BUILD DESPOT-DULB function provides a high-level sketch of the belief tree construction. The specific process is outlined here. 1) The root node b₀ is randomly initialized by sample K scenarios (line 13). 2) The upper and lower bounds of the root node b₀ are initialized (line 14–15). 3) The algorithm conducts a series of explorations to expand the belief tree D and reduce the gap between the upper bound μ(b₀) and the lower bound l(b₀) at the root node b₀ of D. 4) Each exploration follows the optimal action a^∗ using Eq. (4)and observation z^∗ using Eq. (6) are chosen to expand the belief tree (line 18). 5) The algorithm traces the path back to the root and performs a backup on the upper and lower bounds at each node along the way using the Bellman’s principle (line 19). 6) The explorations continue until the gap between the bounds μ(b₀) and l(b₀) at the root node reaches the target value ε₀, i.e., judging μ(b₀) − l(b₀) < ε₀ is satisfied or not (line 17).

A reasonable belief tree D is constructed by considering the combination of the upper bound and lower bound of the node in Step (4) to improve the search of action selections. Meanwhile, the depth function is introduced to adjust the gap between the upper bound and lower bound of the node in Step (4) to ameliorate the action search and observation selection. Then, the upper and lower bounds of the nodes on the path are slightly adjusted during Step (5). The forward search and backup are repeated until the terminal conditions are met, such as the gap at the root node reaching a target value and the planning time running out.

1)
Forward exploration: To construct the belief tree, searching for the optimal action within a finite time must be considered. In current research, two kinds of approaches are generally adopted in action selection. One is dynamic programming that needs to construct a complete belief tree before looking for the optimal action. Another is a forward search algorithm that avoids constructing the complete belief tree in advance. For large-scale POMDPs, the complete belief tree is constructed impractically. To scale up to large POMDPs, the belief tree is constructed incrementally under the guidance of a heuristic. During the heuristic search, an upper bound μ(b) and a lower bound l(b) on the optimal RWDU are maintained at each node b of D, that is, l(b) ≤ ν^∗(b) ≤ μ(b). An upper bound U(b) and a lower bound L₀(b) on the empirical value are computed so that $ U(b)\le {\hat{V}}^{\ast }(b)\le {L}_0(b) $. In particular, let $ {L}_0(b)={\hat{V}}_{\pi_0}(b) $ when performing a default policy π₀ at node b. Let ε(b) = μ(b) − l(b) denote the gap between the upper and lower RWDU bounds at a node b. The goal of each forward search is to reduce the gap of root node ε(b₀). The ultimate goal of a forward exploration is to find an action sequence that can make the gap of root node convergence zero.

To search the optimal action branch, the upper and lower bounds of the node are fully adopted to represent the information of the node. Although the combination of upper and lower bounds of the node can converge the gap to a small value, the gap has slight fluctuations. A depth function β is introduced as a discount to further improve the performance of searching. Starting from the root node b₀, along each node b of the search path, the optimal action branch is selected according to the discounted upper and lower bounds information (μ(b) + ω ⋅ l(b))/β:

$$ {\displaystyle \begin{array}{c}{a}^{\ast }=\underset{a\in A}{\mathrm{argmax}}d\left(b,a\right)=\underset{a\in A}{\mathrm{argmax}}\Big\{\begin{array}{c}\\ {}\end{array}\rho \left(b,a\right)\\ {}+\left(\sum \limits_{z\in {Z}_{b,a}}\mu \left({b}^{\prime}\right)+\omega \sum \limits_{z\in {Z}_{b,a}}l\left({b}^{\prime}\right)\right)/\beta \Big\}\end{array}} $$

(4)

where b^′ = τ(b, a, z) is the child node of b by performing action branch a and observation branch z at b. 0 ≤ ω ≤ 1 is the proportion factor.

Considering that the gap has a slight fluctuation, a depth function β is introduced in the forward search. β is the discount factor that is determined by the depth of the node and is defined as follows:

$$ \beta ={\kappa}^{\varDelta (b)} $$

(5)

where κ > 1 is a constant. Δ(b) is the depth of node b.

After executing action a^∗, observation branch z is chosen by maximizing the excess uncertainty E(b^′) at node b^′. Due to the change in the updated upper and lower bounds, the gap of the node may exhibit slight fluctuations. The depth function β is introduced to E(b^′) to ensure a reasonable reduction in uncertainty.

$$ {\displaystyle \begin{array}{c}{z}^{\ast }=\underset{z\in {Z}_{b,a}}{\mathrm{argmax}}E\left({b}^{\prime}\right)\\ {}=\underset{z\in {Z}_{b,a}}{\mathrm{argmax}}\left\{\beta \varepsilon \left({b}^{\prime}\right)-\sum \limits_{\phi \in {\varPhi}_{b^{\prime }}}\frac{\mid {\varPhi}_{b^{\prime }}\mid }{K}\xi \varepsilon \left({b}_0\right)\right\}\end{array}} $$

(6)

where b₀ is the root node. Intuitively, the excess uncertainty E(b^′) measures the gap between the multiple of the current gap at b^′ and the “expected” gap at b^′.

The leaf node b is expanded by creating a child node b^′ of b for each action branch a ∈ A and each observation encountered under a scenario ϕ ∈ Φ_b. For each new child node b^′, the bounds μ₀(b^′), l₀(b^′), U₀(b^′) and L₀(b^′) are initialized. The RWDU bounds μ₀(b^′) and l₀(b^′) can be represented according to the empirical value bounds U₀(b^′) and L₀(b^′), respectively. Moreover, the accurate empirical value bounds U₀(b^′) and L₀(b^′) are beneficial to obtain accurate RWDU bounds μ₀(b^′) and l₀(b^′), respectively. The accurate RWDU bounds μ₀(b^′) and l₀(b^′) are beneficial to represent the node’s information and reduce the uncertainty for expanding the belief tree effectively. Applying the default policy π₀ at node b^′ and using the definition of the RWDU function, we have

$$ {l}_0\left({b}^{\prime}\right)={\nu}_{\pi_0}\left({b}^{\prime}\right)=\frac{\mid {\varPhi}_{b^{\prime }}\mid }{K}{\gamma}^{\varDelta \left({b}^{\prime}\right)}{L}_0\left({b}^{\prime}\right) $$

(7)

$$ {\mu}_0\left({b}^{\prime}\right)=\max \left\{{l}_0\left({b}^{\prime}\right),\frac{\mid {\varPhi}_{b^{\prime }}\mid }{K}{\gamma}^{\varDelta \left({b}^{\prime}\right)}{U}_0\left({b}^{\prime}\right)-\lambda \right\} $$

(8)

The initial empirical upper bounds U₀ can be constructed for several methods, such as the uniform bound and hindsight optimization bound. The simple initial empirical upper bound is the uninformed bound

$$ {U}_0(b)=\frac{R_{\mathrm{max}}}{1-\gamma } $$

(9)

This bound is slack. Hindsight Optimization (HO) [34] provided a principled method to construct an upper bound. However, HO may be expensive to compute when the state space is large. In addition, an approximate hindsight optimization bound [29] is calculated by assuming that the states were fully observed, converting the POMDP into a corresponding MDP, and solving for its optimal value function V_MDP.

$$ {U}_0(b)=\frac{1}{\mid {\varPhi}_b\mid}\sum \limits_{\phi \in {\varPhi}_b}{V}_{MDP}\left({s}_{\phi}\right) $$

(10)

In addition to constructing the upper bound, the initial lower bound L₀(b) at node b is defined based on a default policy π₀ by simulating π₀ for a finite number of steps under each scenario and calculating the average total discounted reward. A default policy is usually a random policy or a fixed action policy. Contracting the appropriate initial upper bound and lower bound effectively improves the performance of the proposed method.

2)
Termination of exploration: To construct a reasonable belief tree within a finite time, the exploration terminates at node b under the following conditions. The first one is, Δ(b) > D, i.e., the search depth reaches the maximum depth of the tree. In addition, E(b) < 0, meaning that the expected gap at b is reached and further exploration may be unprofitable. Last, node b is blocked by its ancestor node b^′, i.e., the number of sampled scenarios is insufficient at ancestor node b^′:

$$ \frac{\mid {\varPhi}_{b^{\prime }}\mid }{K}{\gamma}^{\varDelta \left({b}^{\prime}\right)}\left(U\left({b}^{\prime}\right)-{L}_0\left({b}^{\prime}\right)\right)\le \lambda l\left({b}^{\prime },b\right) $$

(11)

where l(b^′, b) is the number of nodes on the path from b^′ to b.

3)
Backup: When the exploration is terminated, the belief tree is constructed. A path is obtained from the leaf node to the root node. To reasonably obtain the upper and lower bounds of the node on the path, the upper and lower bounds of a back-up are adjusted slightly according to the initial upper and lower bounds of the node. Then, the upper and lower bounds are backed up for each node b on the path by using the Bellman’s principle:

$$ {\displaystyle \begin{array}{c}\mu (b)=\max \left\{{l}_0(b),\underset{a\in A}{\max}\right\{\begin{array}{c}\\ {}\end{array}\rho \left(b,a\right)\\ {}+\left(\sum \limits_{z\in {Z}_{b,a}}\mu \left({b}^{\prime}\right)+{\omega}_1\sum \limits_{z\in {Z}_{b,a}}l\left({b}^{\prime}\right)\right)/{\beta}_1\Big\}\end{array}} $$

(12)

$$ {\displaystyle \begin{array}{c}l(b)=\max \left\{{l}_0(b),\underset{a\in A}{\max}\right\{\begin{array}{c}\\ {}\end{array}\rho \left(b,a\right)\\ {}+\left(\sum \limits_{z\in {Z}_{b,a}}l\left({b}^{\prime}\right)+{\omega}_1\sum \limits_{z\in {Z}_{b,a}}\mu \left({b}^{\prime}\right)\right)/{\beta}_1\Big\}\end{array}} $$

(13)

$$ {\displaystyle \begin{array}{c}U(b)=\underset{a\in A}{\max}\Big\{\frac{1}{\mid {\varPhi}_b\mid}\sum \limits_{\phi \in {\varPhi}_b}R\left({s}_{\phi },a\right)\\ {}+\gamma \sum \limits_{z\in {Z}_{b,a}}\frac{\mid {\varPhi}_b\mid }{\mid {\varPhi}_{b^{\prime }}\mid }U\left({b}^{\prime}\right)\Big\}\end{array}} $$

(14)

where b^′ is a child of b with b^′ = τ(b, a, z). 0 ≤ ω₁ ≤ 1 is a proportion factor. β₁ = κ₁^Δ(b) is similar to β. κ₁ > 1 is a constant. β₁ and ω₁ are usually set to a small value. Otherwise, the lower bound of the root node is approximate for each action.

When the nodes on the path are backed up, the proposed method returns the lower bound of the root node. Meanwhile, a bound is maintained by using the default policy. The optimal action is the action corresponding to the maximum lower bound. Then, the agent executes the optimal action.

4.2 Analysis

Dynamic programming constructs a full DESPOT-DULB D. The anytime forward search algorithm constructs a DESPOT-DULB incrementally and terminates with a partial DESPOT-DULB D^′, which is a subtree of D. The main purpose of the analysis is to show that the optimal regularized policy $ \hat{\pi} $ derived from D^′ converges to the optimal regularized policy derived from D.

Theorem 1 proves the choices of the action branch and observation branch in the anytime algorithm. Theorem 1 states that the excess uncertainty at node b is bounded by the sum of the excess uncertainty of its child nodes. DESPOT-DULB provides a greedy way to reduce excess uncertainty by iteratively searching for the action branch and observation branch with the greatest excess uncertainty, establishing Eqs. (4) and (6) as the selection criteria of the action and observation branches, respectively. Theorem 2 proves the convergence of the DESPOT-DULB. As the gap ε(b₀) decreases, the calculated policy converges to the optimal policy.

Theorem 1

For any DESPOT-DULB node b, if E(b) > 0 and $ {a}^{\ast }=\underset{a\in A}{\mathrm{argmax}}d\left(b,a\right) $, then

$$ E(b)\le \sum \limits_{z\in {Z}_{b,{a}^{\ast }}}E\left({b}^{\prime}\right) $$

(15)

where b^′ = τ(b, a^∗, z) is a child of b. The detailed proof is provided in the Appendix 1.

Theorem 2

For every node b of DESPOT-DULB, we assume that the initial upper bound U₀(b) is δ− approximate:

$$ {U}_0(b)\ge {\hat{V}}^{\ast }(b)-\delta $$

(16)

Suppose that T_max is bounded and that the anytime DESPOT-DULB algorithm terminates with a partial DESPOT-DULB D^′ that has gap ε(b₀) between the upper and lower bounds at the root b₀. The optimal regularized policy $ \hat{\pi} $ derived from D^′ satisfies

$$ {\nu}_{\hat{\pi}}\ge {\nu}^{\ast}\left({b}_0\right)-\varepsilon \left({b}_0\right)-\delta $$

(17)

where ν^∗(b₀) is the value of an optimal regularized policy derived from the full DESPOT-DULB D at b₀.

As T_max grows, the uncertainty ε(b₀) decreases incrementally. The analysis shows that the performance of $ \hat{\pi} $ approaches that of an optimal regularized policy as the running time increases. In addition, the error of the initial upper bound approximation affects the final result at most δ. The detailed proof is provided in the Appendix 2.

5 Results and analysis

This section presents comparative and analytical studies of different POMDP algorithms in the standard POMDP benchmark. To verify that (1) the discounted upper and lower bounds are beneficial to search the optimal action efficiently, (2) the proposed depth function can adjust the gap to ensure a reasonable reduction in uncertainty. The proposed DESPOT-DULB algorithm is evaluated by a computer simulation comparison for the following tasks: a) Tag, b) Laser Tag, and c) Pocman. The DESPOT-DULB is compared with the state-of-the-art online algorithms POMCP, DESPOT and LB-DESPOT [33] to verify (1). Moreover, the DESPOT-ULB is a version that does not consider discounting to verify (2). Similar simulation settings as in [29] are utilized to verify the performance of the algorithm. Considering that small parameters cannot attain an optimal performance and large parameters lead to over-idealization of the algorithm, then, a reasonable range is given for the parameters ω ∈ (0, 0.4) and κ ∈ (1.0, 1.1). Through a multi-round test using the trial-and-error method, the initial values of each parameter are selected to guarantee a favorable performance. For Tag, the parameters are set as ω = 0.20, ω₁ = 0.02, β₁ = 1.0 and κ = 1.02. For Laser Tag, three tasks of different sizes are designed. The corresponding scenario parameters are set as (1) ω = 0.20, ω₁ = 0.02, β₁ = 1.0, κ = 1.05, (2)ω = 0.25, ω₁ = 0.025, β₁ = 1.0, κ = 1.05, and (3) ω = 0.3, ω₁ = 0.03, β₁ = 1.0, κ = 1.05. For Pocman, the parameters are set as ω = 0.003, ω₁ = 0.002, β₁ = 1.0 and κ = 1.012. All experiments were conducted on a computer with Intel(R) Core(TM) i5–6300 HQ, 4 cores running at 2.30 GHz, and 12 G main memory. The operating system of the laptop computer was Ubuntu 18.04. All the algorithms were given exactly 1 second per step to choose an action.

The size of the test domain ranged from small to extremely large. The simulation results are shown in Table 1. Overall, the offline method SARSOP has good performance in the small-scale range but cannot scale up. DESPOT-DULB has a certain performance improvement over DESPOT, POMCP and LB-DESPOT for suitable tasks, especially for large-scale observation space tasks like Laser Tag. Figure 3 shows the average running time ratio in different tasks for DESPOT, DESPOT-DULB and LB-DESPOT. Figure 4 presents the change in the gap between the upper and lower bounds of the optimal node on the current search tree for DESPOT, DESPOT-ULB and DESPOT-DULB. Figure 5 shows the effect of parameters ω and κ on the performance of DESPOT-DULB, where R represents the corresponding reward obtained by running the algorithm 500 times. Reasonable parameters can effectively improve the performance of the algorithm. The details on each domain are described below.

Table 1 Performance comparison

Full size table

5.1 Tag

Tag is a standard POMDP benchmark. A robot and a target move in a grid with 29 possible positions (Fig. 2a). The goal of the robot is to find the target that consciously escapes. For the Tag environment, the specific description is as follows. At first, both the robot and the target obtain a random position. The robot knows its position, but the robot cannot obtain the position of the target. The robot can observe the target’s position when the robot and the target are in the same position. Moreover, the robot tries to choose an action from five actions: stay in place and move in four adjacent directions, when paying −1 for each step. Finally, the robot attempts to tag the target, rewarding it with +10 if the attempt is successful and punishing it with −10 otherwise.

Table 1 shows the obtained average total discounted rewards for several POMDP algorithms. On this small-size domain, SARSOP achieves the best result. The proposed DESPOT-DULB exhibits strong competitiveness in Tag compared with the other algorithms. Figure 3 shows that DESPOT-DULB requires less planning time than DESPOT and LB-DESPOT in Tag. Figure 4a shows that DESPOT-DULB has a smaller gap between the upper and lower bounds than DESPOT and DESPOT-ULB. Meanwhile, DESPOT-DULB can suppress the sharp rise of the gap by discounting compared with DESPOT-ULB without discounting. Combined with Theorem 2, the calculated policy is closer to the optimal policy than DESPOT under the case of a constant δ. Figure 5a shows the node’s gap curve of DESPOT-DULB and the corresponding reward when the proportion factor ω is 0.1, 0.2, 0.25 and 0.3. When the proportion factor ω increases, the gap curve has large fluctuations and the corresponding reward decreases gradually. Reducing the proportion factor ω is one of the methods to improve the gap but DESPOT-DULB needs a reasonable proportion factor to represent the node’s information. Figure 5(d) presents the node’s gap curve and the corresponding reward of DESPOT-DULB when the parameter κ is 1.0, 1.02, and 1.04. As the parameter κ increases, the gap can smoothly converge to a small value but the obtained reward increases first and then decreases. Choosing a suitable κ is a good way to reduce the gap and obtain the maximum reward. Based on the above analysis, the parameters ω = 0.20 and κ = 1.02 are chosen to obtain a maximum reward and small uncertainty for DESPOT-DULB.

5.2 Laser tag

Laser Tag is an expanded version of Tag with a large observation space. In Laser Tag, the robot moves in a n × m rectangular grid with o randomly placed obstacles. Three scenes of different sizes are set as follows: a 6 × 8 grid with 6 randomly placed obstacles, a 7 × 11 grid with 8 randomly placed obstacles (Fig. 2b), and a 9 × 12 grid with 12 randomly placed obstacles. The settings of the robot and target are the same as those of Tag. However, the robot does not know its own position and the robot is initially distributed uniformly over the grid. To localize, the robot is equipped with a laser range finder that measures the distances in eight directions. The side length of each cell is set to 1. The laser reading in each direction is generated from a normal distribution centered at the true distance of the robot to the nearest obstacle in that direction, with a standard deviation of 2.5. The readings are rounded to the nearest integers. Hence, an observation comprises a set of eight integers and the total number of observations is approximately 3.5 × 10⁵, 1.5 × 10⁶ and 4.1 × 10⁶, respectively.

With the large observation space, the SARSOP algorithm cannot run successfully. Table 1 shows that DESPOT-DULB has substantially better quality than DESPOT, POMCP, LB-DESPOT and DESPOT-ULB according to the obtained average discounted rewards. Figure 3 shows that DESPOT-DULB requires less planning time than DESPOT and LB-DESPOT for different sizes of laser tags. Figure 4b shows that DESPOT-DULB provides a smaller gap between the upper and lower bounds of the current node than DESPOT at the same depth. Similar to Tag, the obtained policy is closer to the optimal policy than DESPOT when δ is determined. Figure 5b shows the node’s gap curve and the obtained reward of DESPOT-DULB for a 7 × 11 grid of Laser Tag when the proportion factor ω is 0.1, 0.25 and 0.4. Figure 5e shows the node’s gap curve and the obtained reward of DESPOT-DULB for a 7 × 11 grid of Laser Tag when the proportion factor κ is 1.02, 1.05 and 1.08. As the parameters ω or κ increase, the gap of the node drops significantly at the same depth and the corresponding reward increases first and then decreases. Decreasing ω or κ is beneficial for reducing the gap and obtaining the maximum reward. Based on the above considerations, the parameters ω and κ are selected as 0.25 and 0.15, respectively.

5.3 Pocman

Pocman [14] is a partial observation variant of the popular video game Pacman (Fig. 2c). In Pocman, an agent and four ghosts move in a 17 × 19 maze populated with food pellets. If the agent eats a food pellet, the agent obtains a reward of +10. The agent costs −1 for each move. If the agent is caught by a ghost, the game terminates with a penalty of −100. Moreover, there are four power pills in the maze. The agent can eat a ghost and receive a reward of +25 within the next 15 steps after eating a power pill. A ghost chases the agent if the agent is within a Manhattan distance of 5. A ghost runs away if the agent is in a state of eating a power pill. In addition, the agent does not know the exact position of the ghost. However, the agent receives information on whether (1) it sees the ghost in each of the cardinal directions, (2) it hears the ghost within a Manhattan distance of 2, (3) it feels a wall in each of the four cardinal directions, and (4) it smells food pellets in the adjacent or diagonally adjacent cells. Pocman has a large state space of approximately 10⁵⁶ states.

On this large-scale domain, Table 1 shows that DESPOT-DULB has a slight improvement compared to DESPOT and DESPOT-ULB according to the obtained average rewards, while the SARSOP algorithm cannot run successfully. Figure 3 shows that all the algorithms, DESPOT-DULB, LB-DESPOT and DESPOT, take almost the same amount of time because the agent needs to execute the same number of steps for each simulation. Figure 4c shows that DESPOT-DULB converges to a smaller gap and a lower depth than DESPOT and DEPOT-ULB. A lower uncertainty is beneficial in determining the state of the agent. Then, the agent can obtain a near-optimal policy. Figure 5c shows the node’s gap curve and the corresponding reward of DESPOT-DULB when the proportion factor ω is 0.01, 0.03 and 0.05. A smaller uncertainty can be obtained by increasing the proportion factor ω but if the proportion factor is too large, the maximum reward cannot be obtained. After comprehensive consideration, the parameter ω = 0.03 is chosen to implement the testing. Figure 5f shows the node’s gap curve and the corresponding reward of DESPOT-DULB when the parameter κ is 1.0, 1.012 and 1.03. As the parameter κ increases, the gap can converge to a small value, but the ultimate depth is large, resulting in a long planning time. For a reward, the parameters of ω = 0.03 and κ = 1.012 are chosen to obtain the maximum reward and a lower uncertainty.

6 Conclusion and future work

This paper has proposed the online POMDP algorithm DESPOT-DULB, which has considered that the node’s upper bound cannot adequately represent all the node information. The proposed DESPOT-DULB contains the combination of the upper and lower bounds of the node and a depth function based on the current depth. The combination of the upper and lower bounds of the node has represented the node information to improve the quality of the forward search. Meanwhile, a depth function has been considered to adjust the gap between the upper bound and lower bound of the node to ensure a reasonable reduction in uncertainty. With the support of computer simulation comparisons using standard POMDP benchmarks, both the simulation and theoretical analysis have shown that DESPOT-DULB not only retains the DESPOT’s desirable properties for online planning but also shows a certain improvement in the quality of policy and the efficiency of search.

There are two potential directions to expand this work. First, an appropriate heuristic is considered to express the amount of information of the upper and lower bounds of the node, e.g., the information entropy. Therefore, the node’s accurate information can be used to search the optimal node to obtain the maximum reward. Second, a reasonable approach is premeditated to obtain the compact upper and lower bounds. The compact upper and lower bounds can improve the planning efficiency.

References

Dai XY, Meng QH, Jin S (2021) Uncertainty-driven active view planning in feature-based monocular vSLAM. Appl Soft Comput 108:107459
Article Google Scholar
Nakrani NM, Joshi MM (2022) A human-like decision intelligence for obstacle avoidance in autonomous vehicle parking. Appl Intell 52(4):1–20
Article Google Scholar
Hubmann C, Schulz J, Becker M, Althoff D, Stiller C (2018) Automated driving in uncertain environments: planning with interaction and uncertain maneuver prediction. IEEE Trans Intell Veh 3(1):5–17
Article Google Scholar
Smallwood R, Sondik E (1973) The optimal control of partially observable Markov processes over a finite horizon. Oper Res 21:1071–1088
Article MATH Google Scholar
Bai H, Cai S, Ye N, Hsu D, Lee WS (2015) Intention-aware online POMDP planning for autonomous driving in a crowd. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp 454–460
Garg NP, Hsu D, Lee WS (2019) Learning to grasp under uncertainty using POMDPs. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp 2751–2757
Wu K, Lee WS, Hsu D (2015) POMDP to the rescue: boosting performance for Robocup rescue. In: proceedings of the IEEE international conference on intelligent robots and systems (IROS), pp 5294–5299
Folsom-Kovarik JT, Sukthankar G, Schatz S (2013) Tractable POMDP representations for intelligent tutoring systems. ACM Trans Intell Syst Technol 4(2):1–22
Article Google Scholar
Kaelbling LP, Littman ML, Cassandra AR (1998) Planning and acting in partially observable stochastic domains. Artif Intell 101(1–2):99–134
Article MathSciNet MATH Google Scholar
Deb S, Tammi K, Gao XZ, Kalita K, Mahanta P, Cross S (2022) A robust two-stage planning model for the charging station placement problem considering road traffic uncertainty. IEEE Trans Intell Transp Syst 23(7):1–15
Article Google Scholar
Sung I, Choi B, Nielsen P (2021) On the training of a neural network for online path planning with offline path planning algorithms. Int J Inf Manag 57:102142
Article Google Scholar
Nicol S, Chads I (2012) Which states matter? An application of an intelligent discretization method to solve a continuous POMDP in conservation biology. PLoS One 7(2):e28993
Article Google Scholar
Browne CB, Powley E, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of Monte Carlo tree search methods. IEEE Trans Comput Intell AI in Games 4(1):1–43
Article Google Scholar
Silver D, Veness J (2010) Monte-Carlo planning in large POMDPs. Adv Neural Inf Proces Syst 23:2164–2172
Google Scholar
Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Mach Learn 47(2–3):235–256
Article MATH Google Scholar
Somani A, Ye N, Hsu D, Lee WS (2013) DESPOT: online POMDP planning with regularization. Adv Neural Inf Proces Syst 58:231–266
MathSciNet MATH Google Scholar
Bougie N, Ichise R (2021) Fast and slow curiosity for high-level exploration in reinforcement learning. Appl Intell 51(2):1086–1107
Article Google Scholar
Chen Y, Kochenderfer MJ, Spaan MTJ (2018) Improving offline value-function approximations for POMDPs by reducing discount factors. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp 3531–3536
Kurniawati H, Hsu D, Lee WS (2008) SARSOP: efficient point-based POMDP planning by approximating optimally reachable belief spaces. Robot: Sci Syst 4:65–72
Google Scholar
Bai H, Hsu D, Lee WS (2014) Integrated perception and planning in the continuous space: a POMDP approach. Int J Robot Res 33(9):1288–1302
Article Google Scholar
Zhang Z, Hsu D, Lee WS, Lim ZW, Bai A (2015) Please: palm leaf search for pomdps with large observation spaces. In: Proceedings of the Twenty-Fifth International Conference on Automated Planning and Scheduling, pp. 249–258
Wu B, Zheng HY, Feng YP (2014) Point-based online value iteration algorithm in large POMDP. Appl Intell 40(3):546–555
Article Google Scholar
He R, Brunskill E, Roy N (2011) Efficient planning under uncertainty with macro-actions. J Artif Intell Res 40:523–570
Article MATH Google Scholar
Ross S, Pineau J, Paquet S, Chaib-Draa B (2008) Online planning algorithms for POMDPs. J Artif Intell Res 32:663–704
Article MathSciNet MATH Google Scholar
Zhang S, Sridharan M, Washington C (2013) Active visual planning for mobile robot teams using hierarchical pomdps. IEEE Trans Robot 29(4):975–985
Article Google Scholar
Koval M, Hsu D, Pollard N, Srinivasa SS (2020) Configuration lattices for planar contact manipulation under uncertainty. In: Proceedings of International Workshop on the Algorithmic Foundations of Robotics, pp. 768–783
Sun K, Schlotfeldt B, Pappas GJ (2020) Stochastic motion planning under partial observability for mobile robots with continuous range measurements. IEEE Trans Robot 37(3):979–995
Article Google Scholar
Vien NA, Ngo H, Lee S, Chung T (2014) Approximate planning for Bayesian hierarchical reinforcement learning. Appl Intell 41(3):808–819
Article Google Scholar
Ye N, Somani A, Hsu D, Lee WS (2017) DESPOT: online POMDP planning with regularization. J Artif Intell Res 58:231–266
Article MathSciNet MATH Google Scholar
Garg NP, Hsu D, Lee WS (2019) DESPOT-alpha: online POMDP planning with large state and observation spaces. Robot: Sci and Syst. https://doi.org/10.15607/RSS.2019.XV.006
Luo Y, Bai H, Hsu D, Lee WS (2019) Importance sampling for online planning under uncertainty. Int J Robot Res 38(2–3):162–181
Article Google Scholar
Cai P, Luo Y, Hsu D, Lee WS (2021) HyP-DESPOT: a hybrid parallel algorithm for online planning under uncertainty. Int J Robot Res 40(2–3):558–573
Article Google Scholar
Wu C, Kong R, Yang G, Kong X, Zhang Z, Yu Y, Liu W (2021) LB-DESPOT: efficient online POMDP planning considering lower bound in action selection. In: Proceedings of the AAAI Conference on Artificial Intelligence 35(18):15927–15928
Yoon S, Fern A, Givan R, Kambhampati S (2008) Probabilistic planning via determinization in hindsight. In: Proceedings of AAAI Conference on Artificial Intelligence 2:1010–1016

Download references

Acknowledgments

This work was supported in part by the National Key Research and Development Program of China (No. 2021ZD0114503), the Major Research plan of the National Natural Science Foundation of China (No. 92148204), the National Natural Science Foundation of China (No. 61803089, 61971071, 62027810, 62133005), the Hunan Science Fund for Distinguished Young Scholars (No. 2021JJ10025), the Hunan key research and development program (No. 2021GK4011, 2022GK2011), the Changsha Science and Technology Major Project (No. kh2003026), the Joint Open Foundation of State Key Laboratory of Robotics (No. 2021-KF-22-17), the Tianjin University-Fuzhou University Independent Innovation Fund (No. TF2022-4), the China University industry-University-research Innovation Fund (No. 2020HYA06006).

Author information

Authors and Affiliations

School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou, 350108, China
Yanjie Chen, Jiangjiang Liu & Yibin Huang
School of Robotics, Hunan University, Changsha, 410082, China
Hui Zhang
College of Electrical and Information Engineering, Hunan University, Changsha, 410082, China
Yaonao Wang
National Engineering Research Center of Robot Visual Perception and Control Technology, Changsha, 410082, China
Yanjie Chen, Hui Zhang & Yaonao Wang

Authors

Yanjie Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jiangjiang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yibin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yaonao Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hui Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1 Proof

Let CH(b, a^∗) be all the set of b^′ = τ(b, a, z) for some $ z\in {Z}_{b,{a}^{\ast }} $, that is, the set of children nodes of b in the DESPOT-DULB tree. If E(b) > 0, then ε(b) > 0, that is, μ(b) − l(b) > 0, and thus μ(b) ≠ l₀(b). Hence we have

$$ {\displaystyle \begin{array}{c}\mu (b)=d\left(b,{a}^{\ast}\right)=\rho \left(b,{a}^{\ast}\right)\\ {}+\left(\sum \limits_{b^{\prime}\in CH\left(b,{a}^{\ast}\right)}\mu \left({b}^{\prime}\right)+\omega \sum \limits_{b^{\prime}\in CH\left(b,{a}^{\ast}\right)}l\left({b}^{\prime}\right)\right)/\beta \end{array}} $$

(A1)

and

$$ {\displaystyle \begin{array}{c}l(b)\ge m\left(b,{a}^{\ast}\right)\ge \rho \left(b,{a}^{\ast}\right)\\ {}+\left(\sum \limits_{b^{\prime}\in CH\left(b,{a}^{\ast}\right)}l\left({b}^{\prime}\right)+\omega \sum \limits_{b^{\prime}\in CH\left(b,{a}^{\ast}\right)}\mu \left({b}^{\prime}\right)\right)/\beta \end{array}} $$

(A2)

where 0 < ω < 1, and β > 1. Subtracting the Eq. (A1) by the Eq. (A2), we have

$$ \mu (b)-l(b)\le \left(1-\omega \right)\sum \limits_{b^{\prime}\in CH\left(b,{a}^{\ast}\right)}\left[\mu \left({b}^{\prime}\right)-l\left({b}^{\prime}\right)\right]/\beta $$

(A3)

where

$$ \beta ={\kappa}^{\Delta (b)} $$

(A4)

Considering 0 < 1 − ω < 1, κ > 1, we have

$$ {\displaystyle \begin{array}{c}\left(1-\omega \right)\sum \limits_{b^{\prime}\in CH\left(b,{a}^{\ast}\right)}\left[\mu \left({b}^{\prime}\right)-l\left({b}^{\prime}\right)\right]/\beta \\ {}<\sum \limits_{b^{\prime}\in CH\left(b,{a}^{\ast}\right)}\left[\mu \left({b}^{\prime}\right)-l\left({b}^{\prime}\right)\right]\end{array}} $$

(A5)

That is

$$ \mu (b)-l(b)\le \sum \limits_{b^{\prime}\in CH\left(b,{a}^{\ast}\right)}\left[\mu \left({b}^{\prime}\right)-l\left({b}^{\prime}\right)\right] $$

(A6)

Combining the Eq. (A4) and Eq. (A6), we have

$$ {\displaystyle \begin{array}{c}{\kappa}^{\varDelta (b)}\left[\mu (b)-l(b)\right]\le {\kappa}^{\varDelta (b)}\sum \limits_{b^{\prime}\in CH\left(b,{a}^{\ast}\right)}\left[\mu \left({b}^{\prime}\right)-l\left({b}^{\prime}\right)\right]\\ {}\le {\kappa}^{\varDelta \left({b}^{\prime}\right)}\sum \limits_{b^{\prime}\in CH\left(b,{a}^{\ast}\right)}\left[\mu \left({b}^{\prime}\right)-l\left({b}^{\prime}\right)\right]\end{array}} $$

(A7)

Note that

$$ \frac{\mid {\varPhi}_b\mid }{K}\xi \varepsilon \left({b}_0\right)=\sum \limits_{b^{\prime}\in CH\left(b,{a}^{\ast}\right)}\frac{\mid {\varPhi}_{b^{\prime }}\mid }{K}\xi \varepsilon \left({b}_0\right) $$

(A8)

Hence, we have

$$ {\displaystyle \begin{array}{c}{\kappa}^{\varDelta (b)}\left[\mu (b)-l(b)\right]-\frac{\mid {\varPhi}_b\mid }{K}\xi \varepsilon \left({b}_0\right)\\ {}\le \sum \limits_{b^{\prime}\in CH\left(b,{a}^{\ast}\right)}\left\{{\kappa}^{\varDelta \left({b}^{\prime}\right)}\left[\mu \left({b}^{\prime}\right)-l\left({b}^{\prime}\right)\right]-\frac{\mid {\varPhi}_{b^{\prime }}\mid }{K}\xi \varepsilon \left({b}_0\right)\right\}\end{array}} $$

(A9)

That is, $ E(b)\le \sum \limits_{z\in {Z}_{b,{a}^{\ast }}}E\left({b}^{\prime}\right) $.

Appendix 2 Proof

Let U₀^′(b) = U₀(b) + δ, then U₀^′ is an exact upper bound. Let μ₀^′ be the corresponding initial upper bound, and μ^′ be the corresponding upper bound on ν^∗(b). Then μ₀^′ is a valid initial upper bound for ν^∗(b) and the backup equations ensure that μ^′(b) is a valid upper bound for ν^∗(b). On the other hand, it is easily shown by induction that

$$ \mu (b)+{\gamma}^{\varDelta (b)}\frac{\mid {\varPhi}_b\mid }{K}\delta \ge {\mu}^{\prime }(b) $$

(B10)

When a special case for b = b₀, we have

$$ \mu (b)+\delta \ge {\mu}^{\prime}\left({b}_0\right) $$

(B11)

Hence, when the algorithm terminates, we have

$$ \mu \left({b}_0\right)+\delta \ge {\mu}^{\prime}\left({b}_0\right)\ge {\nu}^{\ast}\left({b}_0\right) $$

(B12)

Equivalently,

$$ {\displaystyle \begin{array}{c}{\nu}_{\hat{\pi}}=l\left({b}_0\right)\ge {\nu}^{\ast}\left({b}_0\right)-\left(\mu \left({b}_0\right)-l\left({b}_0\right)\right)-\delta \\ {}={\nu}^{\ast}\left({b}_0\right)-\varepsilon \left({b}_0\right)-\delta \end{array}} $$

(B13)

The Eq. (B13) holds because the initialization and the computation of the lower bound l via the backup equations are exactly that for finding a regularized optimal policy value in the partial DESPOT-DULB.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, Y., Liu, J., Huang, Y. et al. High-efficiency online planning using composite bounds search under partial observation. Appl Intell 53, 8146–8159 (2023). https://doi.org/10.1007/s10489-022-03914-5

Download citation

Accepted: 19 June 2022
Published: 30 July 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s10489-022-03914-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

High-efficiency online planning using composite bounds search under partial observation

Abstract

Similar content being viewed by others

An Online POMDP Solver for Uncertainty Planning in Dynamic Environment

Bayesian incremental inference update by re-using calculations from belief space planning: a new paradigm

Decentralized multi-robot belief space planning in unknown environments via identification and efficient re-evaluation of impacted paths

1 Introduction

2 Related work

3 Background