1 Introduction

To pursue its assigned objectives in the face of dynamic and unpredictable circumstances, an agent needs the autonomy to flexibly adopt different actions as needed. But such autonomy risks making the agent more unpredictable to other agents it is cooperating with. Specifically, to be trustworthy, the agent’s exercise of its autonomy must avoid violating social commitments that it has made to others. This balance, between satisfying others’ expectations by employing autonomy to improve individual effectiveness, versus satisfying others’ expectations by tempering autonomy to adhere to social commitments, is the focus of this article.

Most multiagent systems research into social commitments, and trust in agents to meet them, investigates constructs and protocols by which agents can express commitments, test for commitment satisfaction, and develop models of reputation and trust. Through such mechanisms, self-interested agents (or their developers) gain incentive for meeting commitments, as finding gullible agents to exploit by reneging on commitments becomes harder. Such a perspective assumes that agents (or their developers) know how to satisfy commitments and earn the trust of others, and need to be incentivized to do so. This perspective is sensible, for example, in transactional contexts, where goods, services, and money are exchanged.

Our emphasis in this article is different. We assume that agents are cooperative, so incentivizing is not the issue. Instead, we do not assume that it is clearcut how autonomous agents should satisfy commitments in the inherently uncertain contexts where autonomy is needed. An example in human systems is the commitment a surgeon makes when treating a patient, where the motives of the surgeon and patient are aligned, but the planned surgical procedure might not work, or the surgeon might discover during the surgery an overriding problem to address instead. The patient is counting on the surgeon to autonomously respond to emergent circumstances, within the bounds of standards of practice/care. Even though the outcome of the surgery might not be what everyone expected, the surgeon still could be deemed as to have merited the trust of the patient in fulfilling the commitment made.

For artificial autonomous agents, we seek to distill insights from such human settings into computationally operational terms, by defining a clear semantics for what it means to achieve a commitment despite inherent uncertainty, and by formulating decision-making algorithms for an agent that provably adhere to those semantics. This article provides one set of answers to these questions, though others might be possible. The key ideas behind our answers include:

  • A recognition that, in uncertain settings, commitments can only be made to what actions an agent will attempt, rather than to what outcomes will result from its actions.

  • The importance of committing to acting within a space of possible actions/plans, rather than to a specific single action/plan, so an agent can retain latitude for exercising autonomy.

  • The ability to succinctly characterize a space of possible actions/plans in terms of a probabilistic commitment to reach one of a set of possible states.

  • A family of algorithms for provably and flexibly achieving probabilistic commitments that allow different tradeoffs between computational effort exerted and the optimality of autonomous behavior.

We described this general area of research and our early intuitions for solving it in a workshop paper that was later included in a published collection of best workshop papers [11]. Select technical details have appeared in an unpublished workshop paper [42]. In a conference paper [43], we proposed similar commitment semantics, under Bayesian reward uncertainty, to those in this article, but described solution techniques that included a less robust iterative method that can violate the prescriptive commitment semantics in subtle ways. The improved iterative method in this article (Sect. 4.4) supersedes the previous flawed one, providing provable guarantees on fulfilling the prescriptive commitment semantics. Versions of the improved methods in this article, but applied to non-Bayesian setting, have appeared in a conference paper [44]. In contrast, this article goes into detail about (previously unpublished) Bayesian versions of the improved solution methods, formally proving their adherence to our semantics. It also collects together content that had previously appeared only in non-archival forms. Finally, beyond any of our work that has appeared elsewhere, this article additionally provides more details and intuitions, and a more extensive empirical evaluation.

In the sections that follow, after briefly summarizing past work on commitments in agent systems (Sect. 2), we describe the decision-theoretic foundations of our work and the formal semantics of a probabilistic commitment (Sect. 3). From there, we describe a family of decision-making algorithms that make different computation/optimality tradeoffs, with theoretical results proving their adherence to our commitment semantics (Sect. 4), along with strategies to solve the problems more efficiently (Sect. 5). Our empirical results highlight the effectiveness of our principled approach and the tradeoffs among the different algorithms (Sect. 6). Finally, the article finishes with a summary of the contributions, and of directions for future work (Sect. 7).

2 Related work

A comprehensive overview of research into using computational methods to characterize and operationalize social commitments in terms of formal (temporal and modal) logic has appeared [32], and is based on literature in this field (e.g., [2, 5,6,7, 19, 31]). These formulations support important objectives like the provable pursuit of mutually agreed-upon goals, and codifying conventions and protocols for managing uncertainty (e.g., [16, 38, 40]). As an example of a convention, an agent that determines that it will not keep a commitment might be obligated to inform dependent agents [16].

There has been substantial work in the field on formal methods for developing protocols, with provable properties, for agents who are modeling and communicating about commitments. The focus is on the lifecycle of a commitment, from its initial proposed creation, to the mutual agreement to adopt it, to determining whether it has been fulfilled, to whether it is time to abandon it. Over the lifecycle, it is important that interacting agents engage in a communication protocol that ensures their beliefs about the status of a shared commitment are aligned. Günay et al. [12, 13] have developed a formal language that provides for non-deterministic elements in agents’ beliefs, along with a model-checking algorithm for analyzing agents’ compliance with commitments. Sultan et al. [35] develop another model-checking technique for verifying social commitments specified using a modal logical language for uncertain settings. To improve alignment when an agent suspects another might not be conscientious about providing updates, Pereira et al. [24] have developed an approach where an agent can use observations of another’s actions to infer when that other agent has abandoned its commitment, based on if its current actions do not fit any (known) plan to fulfill the commitment. Their work assumes a commitment semantics that requires an agent who has adopted a commitment to singlemindedly pursue that commitment. In general, this semantics is overly restrictive, because an agent might need to interleave actions in the pursuit of multiple commitments, along with its own local goals, at any given time.

The above acknowledges that the agents might not achieve outcomes they have committed to, and focuses on improving awareness across agents as to the status of all their shared commitments. But if agents can arbitrarily and unilaterally decide to drop a commitment, the commitment loses predictive value for coordination between the agents. Some of the logical formulations above (e.g., [16]) enumerate conditions where an agent is allowed to abandon its local component of a mutual goal, where in general these conditions are either: (1) when the agent believes it is impossible to achieve its local component; (2) when the agent believes the mutual goal is not worth pursuing anymore; or (3) when the agent believes one or more of the other agents participating in the mutual goal have abandoned their local components of it. These conditions are logically reasonable, but fail to impose a prescriptive semantics for the agent to use in making local decisions. For example, to satisfy the first condition, is an agent never allowed to take an action that has even a small chance of rendering its local component unachievable? What if all of its actions have such a chance? For the second condition, if an agent can unilaterally drop a commitment whenever its preferred goal changes, then has it really committed in the first place?

To make an agent more predictable, a commitment can be paired with conditions under which it is guaranteed to hold [1, 27, 32, 37]. In transactional settings, for example, an agent could commit to providing a good or service on the condition that it first receives payment. However, if conditions can be over anything, then they can make commitments worthless because a commitment might be conditioned simply on no better option coming along. Sandholm and Lesser [28] recognized the general impracticality of enumerating all the conditions that might affect commitment adherence, and, even if the conditions could be specified, in verifying they hold in a distributed setting. Their solution was a contracting framework where a decommitment penalty is associated with each commitment, so as to accommodate uncertainty but discourage frivolous decommitment. However, even though the recipient of a commitment will know it will be compensated if the commitment is abandoned, it in general will be unable to know how likely that will be, since it cannot look inside the provider of the commitment to discern how likely it is that its actions to achieve the commitment will fail, or that it will decide that other goals should take priority.

Therefore, an alternative to a decommitment penalty is for the commitment provider to summarize the likelihood that its commitment’s various conditions will jointly hold (e.g., a factory’s suppliers will meet deadlines, its workers will not strike, its shippers will fulfill orders, etc.) into a summary probability. Hence, a probabilistic commitment [4, 39, 41] is a form of conditional commitment where the details of the conditions have been replaced by an estimate of the probability that they will hold. Xuan and Lesser [41] have explained how probabilistic commitments improve joint planning by allowing agents to find policies that are responsive to possible contingencies, including even unlikely ones, and computing appropriate alternative courses of action as the probabilities for commitments being met change. A more myopic (to be more tractable) variation of this approach was developed for the DARPA Coordinators program [18], where instead of anticipating ways that probabilities might change, the recipient would revise its plans only when the commitment provider would send an updated probability of the commitment being satisfied. These prior approaches however only treat commitment probabilities as predictions about how the provider’s plan will affect recipients. In contrast, our goal is that probabilistic commitments not only provide such predictive information to the recipient, but also impose prescriptive semantics on the provider to influence its behavior into a good faith effort towards making those predictions come true.

Our work, summarized in this article, is the first to develop prescriptive commitment semantics under decision-theoretic model uncertainty, along with algorithms that operationalize this semantics for faithful commitment pursuit. The model uncertainty that we consider is a form of partial observability, and thus the algorithms we develop can be viewed as extensions of existing techniques for solving (unconstrained) partially-observable Markov decision problems [15, 17, 33]. Our commitment semantics prescribes additional constraints to the original planning problem, and we develop algorithms that exactly meet the commitment constraints under partial observability. Existing work has developed methods for constrained decision-theoretic planning without model uncertainty [3], or has solved the constraints only approximately [29, 25]. Others have also developed planning approaches for given commitments formulated using formal logic, which mainly rely on techniques of heuristic search (e.g., [21, 22, 36]). These approaches usually amount to enumerating courses of action in search for conditions that ensure the feasibility of the commitments. For example, Meneguzzi et al. [21] develop a depth-first search algorithm to generate realizable enactments of the commitment. These logic-based planning techniques deal with the provider’s uncertainty about the outcomes of its actions, while we also consider the provider’s uncertainty over the rewards and dynamics of its environment.

3 Problem formulation

3.1 Markov decision processes

We first provide background on Markov Decision Processes (MDPs), which form the basis of the decision-theoretic setting we adopt for the commitment provider. An MDP is formally defined by the tuple \(M = ({\mathcal {S}}, {\mathcal {A}}, P, R, s_0, H)\) where \({\mathcal {S}}\) is the finite state space, \({\mathcal {A}}\) is the finite action space, \(P:{\mathcal {S}}\times {\mathcal {A}}\rightarrow {\varDelta }({\mathcal {S}})\) (\({\varDelta }({\mathcal {S}})\) denotes the set of all probability distributions over \({\mathcal {S}}\)) is the transition function, \(R:{\mathcal {S}}\times {\mathcal {A}}\rightarrow {\mathbb {R}}\) is the reward function, H is the finite horizon, and \(s_0\) is the initial state. The state space is partitioned into disjoint sets by the time step, \({\mathcal {S}} = \bigcup _{t=0}^{H}{\mathcal {S}}_t\). By taking action \(a_t\in {\mathcal {A}}\) in state \(s_t\in {\mathcal {S}}_t\), the environment generates a reward \(r_{t+1} = R(s_t,a_t)\) and transits to a new state \(s_{t+1}\in {\mathcal {S}}_{t+1}\) according to transition function P, i.e. \(s_{t+1}\sim P(\cdot |s_t,a_t)\) meaning that \(s_{t+1}\) is stochastically drawn from the transition function. Given a policy \(\pi :{\mathcal {S}} \rightarrow {\varDelta }({\mathcal {A}})\) and starting in the initial state, a random sequence of transitions \(\langle s_0,a_0,r_1,s_1,\ldots ,s_{H-1},a_{H-1},r_{H},s_{H} \rangle\) is generated, which records the entire history up to the time horizon. The value function of \(\pi\) is \(V^\pi (s) = E[\sum _{t'=t+1}^{H}r_{t'}|\pi , s_t=s]\) where t is such that \(s\in {\mathcal {S}}_t\). The optimal policy for M, denoted as \(\pi ^*\), maximizes \(V^\pi\) for all \(s\in {\mathcal {S}}\).

We give an illustrative example of an MDP shown in Fig. 1. There are four locations for each time step, labeled as A, B, C, and D. The time horizon is \(H=10\). Thus, the state space consists of 44 states, and we let \({\mathcal {S}}_t= \{\mathrm{A}_t, \mathrm{B}_t, \mathrm{C}_t, \mathrm{D}_t\}\). The initial state is \(\mathrm{A}_0\). There are two actions, \(\alpha\) and \(\beta\), and the transition function is shown in the annotations of the edges in Fig. 1. Taking any action in D yields a reward of \(+\) 10, and taking action \(\alpha\) in the other three locations yields a reward of − 1. The rewards of all other state-action pairs are zero. We now describe the optimal policy for this MDP. In locations A, C, and D, the transition probabilities of both actions are identical, and therefore the optimal policy takes action \(\beta\) since it yields reward no smaller than action \(\alpha\). In location B, if it is not the final action then the optimal policy takes action \(\alpha\) to transit to D which yields reward (\(+\) 10) that outweighs the negative reward of taking action \(\alpha\) (− 1); otherwise, the optimal policy takes action \(\beta\) in B as the final action.

Fig. 1
figure 1

There are four locations for each time step labeled as A, B, C, and D, with A being the initial location. The time horizon is \(H=10\). Thus, the state space consists of 44 states. There are two actions, \(\alpha\) and \(\beta\), and the transition function is shown in the annotations of the edges. Taking any action in D yields a reward of \(+\) 10, and taking action \(\alpha\) in the other three locations yields a reward of − 1. The rewards of all other state-action pairs are zero

There are several methods for solving an MDP for its optimal policy. We here summarize one based on tabular linear programming (LP) [26], which is also the computational strategy we adopt for developing algorithms for the provider. For MDP M, each policy \(\pi\) has a corresponding occupancy measure \(x^\pi\) for the expected number of times action a will be taken in state s over the time horizon H, starting in initial state \(s_0\):

$$\begin{aligned} x^\pi (s,a) = E\left[ \sum \limits _{t=0}^{H-1}1_{\{s_t=s,a_t=a\}}|s_0,\pi \right] , \end{aligned}$$

where \(1_E\) is the indicator function that takes value one if event E occurs and zero otherwise. We will use shorthand notation x in place of \(x^\pi\) when policy \(\pi\) is clear from the context. Policy \(\pi\) can be recovered from its occupancy measure via

$$\begin{aligned} \pi (a|s)=\frac{x(s,a)}{\sum \limits _{a'}x(s,a')}. \end{aligned}$$

Figure 2 is the linear program that solves an MDP M. It introduces the occupancy measure as decision variables, and the policy is constructed from the program’s optimal solution. Constraints (1b) and (1c) guarantee that x is a valid occupancy measure, where \(\delta (s',s_0)\) is the Kronecker delta that returns 1 when \(s'=s_0\) and 0 otherwise. The expected cumulative reward can be expressed using x in the objective function (1a).

Fig. 2
figure 2

The linear program for solving an MDP M

3.2 The decision-theoretic setting

We consider the setting in which the provider’s true sequential decision-making problem is one out of K possible MDPs drawn from a known prior distribution, where all MDPs share identical state and action spaces but possibly different transition and reward functions, and the state and the reward are fully observable during execution. Formally, the environment is defined by the tuple \({\mathcal {E}}=\langle {\mathcal {S}}, {\mathcal {A}}, \{P_k, R_k\}_{k=1}^{K}, s_0, \mu _0, H \rangle\). We assume that the state space \({\mathcal {S}}\), the action space \({\mathcal {A}}\), and the time horizon H are finite. Although our commitment semantics can be straightforwardly extended beyond this assumption, the algorithms that we develop in this article to operationalize the semantics are built upon well-established tabular MDP methods that assume finite state and action spaces. Implementing our commitment semantics in infinite state and action spaces is a possible direction for future research. The MDP that the agent is in is drawn from the known prior distribution \(\mu _0\). If the agent is in MDP k, then by taking action \(a_t\in {\mathcal {A}}\) in state \(s_t\in {\mathcal {S}}_t\), it receives a reward \(R_k(s_t,a_t)\) and the environment transits to \(s_{t+1}\sim P_k(\cdot |s_t,a_t)\). It will be convenient if we let capitalized \(S_t\) be a random variable indicating the state at time step t whose specific realization is denoted \(s_t\), and capitalized \(A_t\) be a random variable indicating the action being taken at time step t, whose specific realization is denoted \(a_t\). Because the true MDP is partially observable, we consider history-dependent stochastic policies that map the history up to time step t,

$$\begin{aligned} h_t=\langle s_0,a_0,r_1,s_1,\ldots ,s_{t-1},a_{t-1},r_{t},s_{t} \rangle , \end{aligned}$$

to a probability distribution over the next action. Specifically, we use \(\pi (a|h)\) to denote the probability of choosing action a given history h when following policy \(\pi\). During execution, the agent can use the information provided by the history so far to update the posterior distribution over the MDP it is actually facing. The posterior is a sufficient statistic that succinctly summarizes the agent’s knowledge about the uncertain environment by interacting with it up to a certain point.

3.3 Prescriptive probabilistic commitment semantics

For the agents operating in the uncertain environment described in Sect. 3.2, Definition 1 formally gives our definition of a probabilistic commitment.

Definition 1

A probabilistic commitment is formally defined as a tuple

$$\begin{aligned} c=\langle {\varPhi }, T, \rho \rangle , \end{aligned}$$
(2)

where \({\varPhi }\subseteq {\mathcal {S}}\) is the commitment state space, \(T(\le H)\) is the commitment time, and \(\rho\) is the commitment probability.

At a minimum, a probabilistic commitment in this form represents a prediction about how likely the state of the world will be an element of \({\varPhi }\) at time T, based on whatever policy the commitment provider is following. As summarized in Sect. 2, however, this kind of predictive commitment semantics can fail to match commonsense notions of what making a commitment means, since it does not in any way impede a provider from unilaterally changing its commitment whenever it chooses to alter its policy.

We therefore define a prescriptive semantics for a probabilistic commitment:

Definition 2

The prescriptive probabilistic commitment semantics for probabilistic commitment c requires that a commitment provider is constrained to follow a policy \(\pi\), such that

$$\begin{aligned} \mathop {\mathrm{Pr}}\limits _{k\sim \mu _0} ( S_T \in {\varPhi }| S_0=s_0,k;\pi ) \ge \rho . \end{aligned}$$
(3)

By Eq. (3), our prescriptive probabilistic commitment semantics is clear: knowing that it is facing an MDP drawn from the prior distribution \(\mu _0\) over possible MDPs in the environment (\(k\sim \mu _0\)), the provider is constrained to follow a (in general history-dependent) policy, such that, starting at the initial state \(s_0\), the probability of reaching a state in the commitment state space \(S_T \in {\varPhi }\) at the commitment time T is at least the commitment probability \(\rho\). Unlike a predictive semantics, where the probability \(\rho\) of reaching a state in \({\varPhi }\) at time T depends on the provider’s choice of policy \(\pi\), the prescriptive semantics turns the dependency around: the provider’s choice of policy \(\pi\) depends on the committed probability \(\rho\) of reaching a state in \({\varPhi }\) at time T.

Our commitment semantics generalizes the classical (logic-based) prescriptive semantics (Sect. 2) to settings where uncertainty precludes finding (even conditional) plans that provably reach states in the commitment state space \({\varPhi }\). Probabilistic commitments account for, and probabilistically quantify, the possibility that actions might have irreversible outcomes from which the commitment state space is unreachable. To satisfy our semantics, a provider should only agree to a commitment if it can formulate a policy with a sufficiently low probability (\(< (1-\rho )\)) of such outcomes. If the provider then faithfully follows such a policy for the agreed-upon commitment, then by definition (Eq. 3) it has satisfied its commitment. Hence, a crucial consequence of our prescriptive probabilistic commitment semantics is that, now, meeting a commitment is entirely under the agent’s control because satisfying a commitment only requires that the agent follow its policy in the states it finds itself in, rather than ensuring that specific states are guaranteed to be reached.Footnote 1

Let \({\varPi }_{c}\) be the set of all policies respecting the semantics of commitment c. We say that commitment c is feasible if and only if \({\varPi }_{c}\) is not empty. Let

$$\begin{aligned} V^{\pi }_{\mu _0}(s_0)= E_{k\sim \mu _0}\left[ \sum \limits _{t=0}^{H-1}R_k(S_t,A_t)|S_0=s_0;k,\pi \right] \end{aligned}$$

be the expected cumulative reward starting at state \(s_0\) under policy \(\pi\) if the true MDP is drawn from \(\mu _0\). We are interested in finding a policy that maximizes the expected cumulative reward while respecting the semantics of a given feasible probabilistic commitment, which is formally formulated as the following problem:

$$\begin{aligned} \mathop {\mathrm{arg\,max}}\limits _{\pi \in {\varPi }_c} V^{\pi }_{\mu _0}(s_0). \end{aligned}$$
(4)

Solving problem (4) involves two main challenges. First, it is non-trivial to characterize \({\varPi }_c\) in a computationally-efficient manner that eases the policy optimization step. Second, under Bayesian uncertainty, finding the optimal policy (even without the constraint prescribed by the commitment semantics) requires planning with evolving posterior distributions. This imposes additional computational difficulty, since the number of posteriors grows exponentially with the time horizon. We propose methods in Sect. 4 that address these challenges.

4 Methods

In this section, we describe several methods for constructing policies with different tradeoffs between solution quality and computational cost, while all the constructed policies are guaranteed to be in \({\varPi }_c\) to respect the semantics of a given commitment c. In order to achieve high expected cumulative reward, the agent has to plan not only with fully observable states but also with the most recent knowledge about the true MDP it is in. Our first method, Commitment Constrained Full Lookahead (CCFL), finds the optimal policy in set \({\varPi }_{c}\) by generating beforehand all possible posterior distributions over possible MDPs up to the finite time horizon. As a downside, since the number of posterior distributions generally grows exponentially as the time horizon grows, planning with all possible posterior distributions can make CCFL computationally infeasible. To this end, our Commitment Constrained Lookahead (CCL) method, generalizes CCFL by taking as input an integer parameter, L, as the number of time steps for posterior lookahead. Our Commitment Constrained No-Lookahead (CCNL) method can be treated as a special case of CCL, in which \(L=0\), and therefore actions are chosen only based on the initial conditions and ignoring posterior distributions. A small L often saves a lot of computational time compared to full lookahead, but by being more myopic decreases the expected cumulative reward. To partially mitigate this shortcoming of CCL (at the cost of a more modest increase in computation), we have created an iterative version of it called Commitment Constrained Iterative Lookahead (CCIL) that reapplies the CCL method in the midst of execution, where the posterior lookahead of successive applications of CCL reach closer to the time horizon.

4.1 Commitment constrained full lookahead

During execution, the agent can use the knowledge provided by the history so far to infer which MDP is more/less likely to be the true MDP it is facing. Formally, one can summarize current history h into a belief, \(b:=\langle s, \mu \rangle\), where s is the agent’s current physical state, and \(\mu\) is the posterior distribution over all possible MDPs given h. We use \(B_t\) to denote the random variable indicating the belief at time step t, and \(b_t\) to denote the belief given history \(h_t\). The agent can find the optimal history-dependent policy by planning in the belief MDP defined as the tuple \(\langle {\mathcal {B}}, {\mathcal {A}}, b_0, {\tilde{P}}, {\tilde{R}} \rangle\), where \({\mathcal {B}}\) is the set of all beliefs reachable from initial belief \(b_0=\langle s_0, \mu _0 \rangle\), which is finite because every possible true MDP k is finite and the time horizon is finite. \({\tilde{P}}\) and \({\tilde{R}}\) are belief transition and reward functions, respectively. Specifically, if we let \(b|(a,r,s')\) be the belief after taking action a in belief state b, receiving reward r and transiting to state \(s'\), then the probability of transiting to any belief \(b'\in {\mathcal {B}}\) after taking action a in belief state b can be expressed as

$$\begin{aligned} {\tilde{P}}(b'|b,a)=\sum \limits _{\{r,s': b|(a,r,s')=b'\}}\Pr (r,s'|b,a), \end{aligned}$$

where \(\Pr (r,s'|b,a)\) is the probability of receiving reward r and transiting to state \(s'\) after taking action a in belief b and can be expressed using \(\{P_k, R_k\}_{k=1}^{K}\) as

$$\begin{aligned} \Pr (r,s'|b,a) =\Pr (r,s'|\langle s,\mu \rangle ,a) =\sum \limits _{k=1}^{K}\mu _kP_k(s'|s,a)1_{\{r=R_k(s,a)\}}. \end{aligned}$$

In words, given any belief \(b'\in {\mathcal {B}}\), \({\tilde{P}}(b'|b,a)\) sums up probabilities over transitions \((r,s')\) which update the belief to \(b'\). Similarly, the belief reward function can be defined as

$$\begin{aligned} {\tilde{R}}(b,a)={\tilde{R}}(\langle s,\mu \rangle ,a)=\sum \limits _{k=1}^{K}\mu _kR_{k}(s, a). \end{aligned}$$

Our Commitment Constrained Full Lookahead (CCFL) method finds an optimal policy in \({\varPi }_c\) among all belief-based policies, i.e., policies that choose actions as a function of the current belief, while respecting the commitment semantics. Note since a belief is a function of the history, then a belief-based policy also gives action probabilities as a function of the history. For MDP k, each policy \(\pi\) has a corresponding occupancy measure \(y_k^\pi\) for the expected number of times action a will be taken in belief-state b over the time horizon H:

$$\begin{aligned} y_k^\pi (b,a) = E\left[ \sum \limits _{t=0}^{H-1}1_{\{B_t=b,A_t=a\}}|B_0=b_0;k,\pi \right] . \end{aligned}$$

We will use shorthand notation \(y_k\) in place of \(y_k^\pi\) when policy \(\pi\) is clear from the context. If \(\pi\) is a belief-based policy, it can be recovered from its belief-action occupancy measure in any MDP k via

$$\begin{aligned} \pi (a|b)=\frac{y_k(b,a)}{\sum \nolimits _{a'}y_k(b,a')}. \end{aligned}$$
(5)

CCFL solves the mathematical program shown in Fig. 3, which introduces as decision variables the belief-action occupancy measure for all possible MDPs, and constructs the policy via Eq. (5) using the program’s optimal solution. The CCFL program is a straightforward adaptation of the linear program in Fig. 2 that solves an MDP. Constraints (6b) and (6c), which are the counterparts of constraints (1b) and (1c) in Fig. 2, guarantee that y is a valid occupancy measure with the initial belief being \(b_0\) and the transition function being \({\tilde{P}}\). The expected cumulative reward is expressed using y in the objective function (6a), which is the counterpart of objective (1a). The commitment semantics of Eq. (3) imposes an additional constraint (6d), which ensures the resulting policy is in \({\varPi }_c\).

Fig. 3
figure 3

CCFL program

Because the belief is a sufficient statistic (i.e. it provides as much information for predicting the future as the history does), the CCFL program is feasible if the commitment is feasible, and the policy constructed by CCFL is optimal among all history-dependent policies respecting the commitment semantics, as formally stated in Theorem 1.

Theorem 1

If commitmentcis feasible, meaning\({\varPi }_c\ne \emptyset\), then the CCFL program in Fig. 3is also feasible. Let\(y^*\)be an optimal solution to the CCFL program. The policy constructed via Eq. (5) using\(y^{*}\)is optimal with respect to the problem in Eq. (4).

The proofs of theorems in this article are presented in the “Appendix”.

4.2 Commitment constrained no-lookahead

Planning with all possible posterior distributions makes CCFL computationally infeasible, as confirmed by our empirical results in Sect. 6. To counter this, we now consider policies that ignore this posterior knowledge and only depend on the current state to choose actions. We refer to them as Markov policies and let \({\varPi }_0\) be the set of all Markov policies. If commitment c is feasible for Markov policies, i.e., \({\varPi }_c \cap {\varPi }_0 \ne \emptyset\), our Commitment Constrained No-Lookahead (CCNL) method will find an optimal Markov policy that maximizes expected cumulative reward respecting the commitment semantics, which is a solution to the following problem:

$$\begin{aligned} \mathop {\mathrm{arg\,max}}\limits _{\pi \in {\varPi }_c \cap {\varPi }_0} V^{\pi }_{\mu _0}(s_0). \end{aligned}$$
(7)

Note that \({\varPi }_0\) is a subset of all history-dependent policies. When, as would generally be the case, \({\varPi }_0\) is a much smaller policy set, the computational cost of CCNL would be much less than that of CCFL, but the solution policy of CCNL is only an approximation of the optimal commitment semantics-respecting policy yielded by CCFL, as will be confirmed empirically in Sect. 6.

Similar to the belief-action occupancy measure, for MDP k, any policy \(\pi\) has a corresponding occupancy measure \(x_k^\pi\) of state-action pairs:

$$\begin{aligned} x_k^\pi (s,a) = E\left[ \sum \limits _{t=0}^{H-1}1_{\{S_t=s,A_t=a\}}|S_0=s_0;k,\pi \right] . \end{aligned}$$

We will use shorthand notation \(x_k\) in place of \(x_k^\pi\) when policy \(\pi\) is clear from the context. If \(\pi\) is a Markov policy, it can be recovered from its state-action occupancy measure in any MDP k via

$$\begin{aligned} \pi (a|s)=\frac{x_k(s,a)}{\sum \nolimits _{a'}x_k(s,a')}. \end{aligned}$$
(8)

CCNL constructs the policy by solving the mathematical program shown in Fig. 4. It introduces as decision variables the state-action occupancy measure for all possible MDPs. Constraints (9b) and (9c), as counterparts of constraints (1b) and (1c), guarantee that \(x_k\) is a valid occupancy measure with the initial state being \(s_0\) and the transition function being \(P_k\). The commitment semantics of Eq. (3) is explicitly expressed in constraint (9e). The expected cumulative reward is expressed using x in the objective function (9a), where \(\mu _{0,k}\) is the probability that the true MDP is k according to \(\mu _0\). The corresponding Markov policy can be derived via Eq. (8). Unlike CCFL, the CCNL program is no longer a straightforward adaptation of the linear program in Fig. 2 because a challenging problem here is to ensure that these K sets of occupancy measures all derive the same Markov policy. To this end, we use constraint (9d) to enforce alignment across all K sets of occupancy measures. The constraints in Fig. 4 are feasible if and only if \({\varPi }_c \cap {\varPi }_0 \ne \emptyset\).

Fig. 4
figure 4

CCNL program

4.3 Commitment constrained lookahead

CCFL pre-plans for every possible revision to the agent’s posterior knowledge about the true MDP it might be in, which guarantees optimality but possibly at a huge computational cost. At the other extreme, CCNL only considers Markov policies that ignore this evolving posterior knowledge. Here we consider the general case where the agent plans its first \(L\in [0,H]\) actions as a function of the evolving belief, and thereafter plans actions based on the evolving state but with the belief (including both the state and the posterior distribution) the agent was in at time L. We refer to this parameter, L, as the belief-update lookahead boundary, which tells the planner how far beyond the current time to look ahead about states and posterior distributions. The resulting L-updates policy takes the form:

$$\begin{aligned} \pi (a|h_t) = {\left\{ \begin{array}{ll} \pi (a|b_t) &{} t < L\\ \pi (a|s_t, b_L) &{} t \ge L \end{array}\right. } \end{aligned}$$

where \(b_t\) is the belief consistent with \(h_t\), and \(b_L\) is the belief consistent with \(h_L\) when \(t\ge L\). Note that a 0-update policy is the same as a Markov policy and an H-update policy is a full width belief-based policy. Therefore, belief-update lookahead boundary L defines a continuum between CCNL and CCFL.

Given a specific value of L, let \({\varPi }_L\) be the set of all L-updates policies. If commitment c is feasible for belief-update lookahead boundary L, i.e., \({\varPi }_c \cap {\varPi }_L \ne \emptyset\), our Commitment Constrained Lookahead (CCL) method will find an optimal L-updates policy that maximizes expected cumulative reward respecting the commitment semantics, which is a solution to the following problem:

$$\begin{aligned} \mathop {\mathrm{arg\,max}}\limits _{\pi \in {\varPi }_c \cap {\varPi }_L} V^{\pi }_{\mu _0}(s_0). \end{aligned}$$
(10)

CCL constructs the policy by solving the mathematical program shown in Fig. 5, which is a novel and carefully-crafted combination of the techniques in CCFL and CCNL. The program introduces as decision variables y and x, where y is the belief-action occupancy measure (as defined for CCFL) for those beliefs reachable within the first L time steps of the plan, and x is the state-action occupancy measures (as defined for CCNL) for the remaining time steps to the horizon. We use \({\mathcal {B}}_l^b\) to denote the set of reachable beliefs after executing exactly l actions from belief b, and \({\mathcal {B}}_{\le l}^b=\bigcup _{l=0}^{l'}{\mathcal {B}}_{l'}^b\) to denote the set of reachable beliefs from b by executing at most l actions starting from b. Because time is a state feature, \({\mathcal {B}}_l^b\) and \({\mathcal {B}}_{l'}^b\) are disjoint if \(l\ne l'\). CCL generates beforehand all reachable beliefs from initial belief \(b_{(t=)0}\) within L actions, \({\mathcal {B}}_{\le L}^{b_{(t=)0}}\). The belief-action and state-action measures enable us to express the expected cumulative reward very conveniently in the objective (12a) where the first term sums up the reward of the first L time steps, and the second term the remaining time steps to the horizon. The occupancy measures also enable us to express commitment semantics conveniently: if the lookahead does not reach the commitment time T, then the commitment semantics can be expressed in terms of the belief-action occupancy measure via constraint (12h); otherwise, the commitment constraint can be expressed in terms of those state-action occupancy measures via constraint (12i). Constraints (12b) and (12c) on y are the counterparts of (6b) and (6c) in the CCFL program of Fig. 3. Similarly, constraints (12e), (12f), and (12g) on x are the counterparts of (9b), (9c), and (9d) in the CCNL program of Fig. 4, which means the CCL program is considerably more sophisticated than the original linear program of Fig. 2. These constraints are feasible if and only if \({\varPi }_c \cap {\varPi }_L \ne \emptyset\). Any L-updates policy \(\pi _L\) that respects the commitment semantics can be derived from a feasible solution to the program in Fig. 5 via:

$$\begin{aligned} \pi _{L} (a|h_t) = {\left\{ \begin{array}{ll} \pi _{L} (a|b_t) = \dfrac{y(b_t,a)}{\sum \nolimits _{a'}y(b_t,a')} &{} t < L\\ \pi _{L} (a|s_t, b_L) = \dfrac{x_{b_L,k}(s_t,a) }{ \sum \nolimits _{a'}x_{b_L,k}(s_t,a')} &{} t \ge L \end{array}\right. } . \end{aligned}$$
(11)
Fig. 5
figure 5

CCL program

Theorem 2 states that CCL using belief-update lookahead boundary L finds an optimal policy in \({\varPi }_c \cap {\varPi }_L\).

Theorem 2

If\(~{\varPi }_c \cap {\varPi }_L \ne \emptyset\)holds for commitmentc, then the program in Fig. 5is feasible. Let\(x^*, y^*\)be its optimal solution, then the policy derived via Eq. (11) with\(x^*, y^*\)is the optimal policy in\({\varPi }_c \cap {\varPi }_L\).

Intuitively, a belief-update lookahead boundary greater than zero enables the agent to plan actions not only based on the states it will visit, but also based on how its actions can provide information to improve its posteriors about what its true MDP is. Sacrifices in short-term reward may ultimately improve long-term performance. Theorem 3 says the expected cumulative reward of the policy derived by CCL using any \(L>0\) is lower bounded by that of the policy derived by CCNL. This is because, by definition, for any L and any Markov policy, there exists an L-updates policy that behaves exactly the same as the Markov policy, i.e. \({\varPi }_0 \subseteq {\varPi }_L\).

Theorem 3

If\(~{\varPi }_c \cap {\varPi }_0 \ne \emptyset\)holds for commitmentc, then for any integer\(L\in [0, H]\)the CCL program in Fig. 5is feasible, and we have

$$\begin{aligned} V^{\pi _L^*}_{\mu _0}(s_0) \ge V^{\pi _{0}^*}_{\mu _0}(s_0) \end{aligned}$$

where\(\pi _L^*\)and\(\pi _{0}^*\)are the policies derived by CCL using belief-update lookahead boundaryLand zero, respectively.

However, one has to be careful in using deeper boundaries because the performance of CCL is guaranteed to be monotonically non-decreasing in L only when MDPs vary solely in reward functions, but this monotonicity cannot be guaranteed in general, as stated in Theorems 4 and 5.

Theorem 4

If MDPs vary in reward functions and not in transition dynamics, i.e.\(\forall k, k', P_k=P_{k'}\), and\({\varPi }_c \cap {\varPi }_L \ne \emptyset\)for boundaryL,then for any\(L'>L\)we have\({\varPi }_c \cap {\varPi }_{L'} \ne \emptyset\), and

$$\begin{aligned} V^{\pi _L^*}_{\mu _0}(s_0) \le V^{\pi _{L'}^*}_{\mu _0}(s_0) \end{aligned}$$

where\(\pi _L^*\)and\(\pi _{L'}^*\)are the policies derived by CCL using boundariesLand\(L'\), respectively.

Theorem 5

There exists an environment, a commitmentc, and boundaries\(0<L<L'<H\)satisfying\(~{\varPi }_c \cap {\varPi }_L \ne \emptyset\)and\(~{\varPi }_c \cap {\varPi }_{L'} \ne \emptyset\), such that

$$\begin{aligned} V^{\pi _L^*}_{\mu _0}(s_0) > V^{\pi _{L'}^*}_{\mu _0}(s_0) \end{aligned}$$

where\(\pi _L^*\)and\(\pi _{L'}^*\)are the policies derived by CCL using belief-updates boundariesLand\(L'\), respectively.

These theoretical results provide some insights when choosing L. If the transition dynamics do not vary across MDPs, as suggested by Theorem 4, \({\varPi }_L\) is monotonically increasing in L. One should use the largest affordable L because a larger L is likely to include more policies in \({\varPi }_c\) and improve the value. A commitment that is infeasible for a smaller L could be feasible for a larger L. In general, though, the transition dynamics can vary across MDPs, and \({\varPi }_L\) is not guaranteed to be monotonically increasing in L. One should use CCFL if it is affordable. CCFL considers all policies in \({\varPi }_c\) if it is non-empty and therefore it yields optimal value. When CCFL is not affordable, then as suggested by Theorem 3 we can check the feasibility of a commitment with CCNL because a commitment feasible to CCNL (i.e. \({\varPi }_c \cap {\varPi }_0 \ne \emptyset\)) is also feasible for any L. For our empirical results in Sect. 6, we experiment with several candidate values of L. Our experience suggests that L can best be chosen with problem-specific knowledge.

4.4 Commitment constrained iterative lookahead

At each time step during execution, the agent observes the state transition that occurs and reward received to update its posterior \(\mu\) about the true MDP it is in. One might think it would be a good idea for the agent to construct and follow an updated policy from its current state, substituting its updated belief state for the initial belief. However, the agent cannot shift from one policy to another without considering its commitment. Clearly, if the agent can find a plan that achieves the original commitment probability conditioned on the current belief, then shifting to such a plan will certainly respect the commitment semantics. Observation 1 says this re-planning is not always feasible.

Observation 1

There exists an environment, a feasible commitmentc, a policy\(\pi \in {\varPi }_c\), and a history\(h_t\)induced by\(\pi\), such that

$$\begin{aligned} \forall \pi ' \quad \mathop {\mathrm{Pr}}\limits _{k\sim \mu _t} ( S_T \in {\varPhi }| S_t=s_t,k;\pi ') < \rho , \end{aligned}$$

where\(\langle s_t,\mu _t \rangle\)is the belief consistent with\(h_t\).

The example shown in Fig. 6 verifies Observation 1. Starting in state A, the agent can feasibly commit to reaching the absorbing state D at time step 2 with at least probability .8. If the agent stochastically reached state C at time step 1, there is no plan that reaches state D from state C with probability at least .8, and this verifies Observation 1.

Fig. 6
figure 6

This example is adapted from Fig. 1. There are two possible reward functions \(R_1\) and \(R_2\) shown above with 50–50 prior. In both reward functions, the reward only depends on the action. There are two actions, \(\alpha\) and \(\beta\), and the transition dynamics is shown in the annotations of the edges. Starting in A, the agent commits to reaching the absorbing location D at time step two with at least probability .8. If the agent happens to be in C at time step one, there is no plan that reaches D from C with probability at least .8 (verifying Observation 1). Even though re-planning from C does not yield a plan that leads to D with probability 0.8, the new plan will nonetheless yield more reward because at time step one we will know which reward function applies and can therefore choose the more rewarding action in C

Our Commitment Constrained Iterative Lookahead (CCIL) method instead updates the commitment probability in a way that guarantees feasible re-planning, and iteratively applies CCL with that updated commitment probability during execution. The idea is that, when re-planning, respecting the commitment semantics does not require meeting the original probabilistic commitment, but instead to fulfill the commitment probability that had originally been associated with the physical-state history traversed so far.Footnote 2 Here we formally describe CCIL’s first iterative application of CCL after having executed one or more actions. Suppose the agent now has belief \(b_t=\langle s_t,\mu _t \rangle\) at time step \(t\le L\) after following policy \(\pi ^*_L\) derived from the initial optimal solution to the CCL program with belief-update lookahead boundary L. Now the agent re-plans from \(s_t\) using its updated posterior \(b_t\), with the commitment probability that its previous policy \(\pi ^*_L\) ascribed to meeting the commitment if state \(s_t\) were reached:

$$\begin{aligned} \rho _t = \mathop {\mathrm{Pr}}\limits _{k\sim \mu _t}( S_T \in {\varPhi }| S_t=s_t,k;\pi ^*_L) . \end{aligned}$$
(13)

Specifically, the agent constructs and follows a new L-updates policy, beginning from the current belief, by reusing the CCL program in Fig. 5 with the following modifications:

  1. 1.

    Start from current belief \(b_t=\langle s_t, \mu _t\rangle\) instead of \(b_0 = \langle s_0, \mu _0\rangle\).

  2. 2.

    Let \(L \leftarrow \min (L, H-t)\) to ensure that the lookahead from the current time step is bounded by the time horizon, i.e. \(t+L \le H\).

  3. 3.

    If the agent has not reached the commitment time, i.e. \(t<T\), plan with the updated commitment probability by replacing \(\rho\) with \(\rho _t\) calculated as in Eq. (13) in constraint (12h) if \(T<t+L\) or in constraint (12i) if \(T\ge t+L\); otherwise, discard constraints (12h) and (12i) (e.g., let \(\rho _t=0\)).

Revisiting the example in Fig. 6, the initial policy could meet the commitment probability (0.8) by committing to take action \(\alpha\) with probability 1 if B is reached at time 1, and otherwise the agent is unconstrained. After taking action \(\alpha\) (or \(\beta\)) at time 0, then at time 1 the agent is either in B or C, and from the reward it just received knows the true reward function. Using CCIL, the agent re-plans. If it is in B, then since the original policy attributed probability 1 to meeting the commitment down this path, its new policy is constrained to take action \(\alpha\) (whatever the true reward is), and afterwards take the better action. If it is in C, the updated commitment probability is zero (the original policy did not count at all on possibly meeting the commitment down this path), so the new policy can optimize reward without constraints.

In principle, the agent can iteratively apply the above procedure at any time during execution. We will evaluate empirically a version of CCIL that takes as input a pair of integers, (LI), such that it iteratively uses L as the belief-update lookahead boundary to update the policy every \(I\le L\) steps. This procedure is outlined in Algorithm 1, and Theorem 6 proves that it respects our commitment semantics.

Theorem 6

Let\(\pi _{IL}\)be the history-dependent policy defined as in Algorithm 1. We have\(\pi _{IL}\in {\varPi }_c\).

figure a

5 Dealing with the quadratic equality constraint

The CCFL program in Fig. 3 is a linear program straightforwardly adapted from the program in Fig. 2 and thus can be solved by standard linear programming algorithms. The CCL program in Fig. 5, however, is no longer a straightforward adaptation of Fig. 2 because it introduces a quadratic equality constraint (12g) to ensure that the action selection rules derived from occupancy measures in all possible MDPs are identical. Similarly, the CCNL program in Fig. 4 also introduces such a quadratic equality constraint (9d). These quadratic constraints makes the mathematical programs non-convex and hard to solve. In practice, many math-programming solvers are unable to handle programs with quadratic equality constraints (e.g., [8, 14]). Although some solvers can deal with such programs (e.g., [20, 23]), they often need to take as input a feasible solution as the starting point, but finding an initial feasible solution by itself might be difficult, and the final solutions are usually sensitive to starting points. Here we introduce two variant formulations of the CCL program in Fig. 5 that avoid quadratic equality constraints.

Deterministic CCL The policy derived from the program in Fig. 5 via Eq. (11) is in general stochastic. To enforce deterministic policies, Dolgov and Durfee [9, 10] introduced binary indicators in the linear programs for solving MDPs. Inspired by their work, we propose a novel formulation that avoids quadratic equality constraints by introducing binary indicators that force the action selection to be deterministic after belief-update lookahead boundary L. Specifically, we introduce indicators \({\varDelta }\) as additional decision variables into the CCL program in Fig. 5 with the following constraints replacing the quadratic equality constraint (12g):

$$\begin{aligned}&\forall b_L \in {\mathcal {B}}_{L}^{b_0},s,a\quad {\varDelta }_{b_L}(s,a)\in \{0,1\}; \\&\forall b_L \in {\mathcal {B}}_{L}^{b_0},s\quad \sum \limits _{a}{\varDelta }_{b_L}(s,a)\le 1; \\&\forall b_L \in {\mathcal {B}}_{L}^{b_0},k,s,a\quad x_{b_L,k}(s,a)\le {\varDelta }_{b_L}(s,a). \end{aligned}$$

This reformulation yields a Mixed Integer Linear Program (MILP) which is well studied with many available solvers (e.g., [8, 14, 20, 23]). Any feasible solution with the above constraints replacing constraints (12g) of the program in Fig. 5 yields a policy with deterministic action selection at time steps after belief-update lookahead boundary L via Eq. (11), which can be alternatively expressed using the indicator variables:

$$\begin{aligned} \pi _{L} (a|h_t) = {\left\{ \begin{array}{ll} \pi _{L} (a|b_t) = \dfrac{y(b_t,a)}{\sum \nolimits _{a'}y(b_t,a')} &{} t < L\\ \pi _{L} (a|s_t, b_L) = 1_{\{{\varDelta }_{b_L}(s_t,a)=1\}} &{} t \ge L \end{array}\right. }. \end{aligned}$$
(14)
Fig. 7
figure 7

CCL program in the reward uncertainty only case, i.e. \(\forall k, k'~~P=P_k=P_{k'}\)

Reward uncertainty only Quadratic equality constraint (12g) can be avoided when the transition dynamics does not vary across possible MDPs, i.e. \(\forall k, k', P_k=P_{k'}\).Footnote 3 In this case, for the action selection at time step \(t\in [H-L,H]\), without loss of optimality, the agent needs only to plan for the Bayes-optimal Markov policy w.r.t. the mean reward \(R_{\mu _L}\) according to the belief it ended up in at time step L:

$$\begin{aligned} R_{\mu _L}(s,a)=\sum \limits _k\mu _{L,k}R_k(s,a) \end{aligned}$$

The resulting mathematical program is shown in Fig. 7. The main difference from the original CCL program in Fig. 5 is that it only introduces one occupancy measure \(x_{b_L}\) for each reachable belief \(b_L\) at time step L, instead of K sets of occupancy measures \(\{x_{b_L,k}\}_{k=1}^{K}\) in the original CCL program. The derived policy can be expressed via:

$$\begin{aligned} \pi _{L} (a|h_t) = {\left\{ \begin{array}{ll} \pi _{L} (a|b_t) = \dfrac{y(b_t,a)}{\sum \nolimits _{a'}y(b_t,a')} &{} t < L\\ \pi _{L} (a|s_t, b_L) = \dfrac{x_{b_L}(s_t,a) }{ \sum \nolimits _{a'}x_{b_L}(s_t,a')} &{} t \ge L \end{array}\right. } \end{aligned}$$

6 Empirical results

As we summarized in Sect. 2, our work is the first to define a prescriptive semantics for probabilistic commitments, and develop algorithms that respect these semantics. Hence, in the empirical studies that follow, we predominantly focus on developing a deeper understanding of the strengths and limitations of different flavors of our algorithms. However, in an effort to illustrate empirically the difference between our approach and prior work, in our first study in the illustrative Windy L-Maze domain (Sect. 6.1), we compare to the closest related work we could identify: a non-prescriptive semantics for probabilistic commitments, and a prescriptive semantics for non-probabilistic commitments. We show how our prescriptive probabilistic commitment semantics allows agents to outperform either of these others because with it agents can balance selfish and unselfish behavior.

In Sect. 6.2, we use a small size Food-or-Fire domain to show how our CCL performs in an environment with both transition and reward uncertainty, and under various choices of belief-update lookahead boundary. In the subsequent two domains of RockSample (Sect. 6.3) and Change Detection (Sect. 6.4), the number of possible posterior distributions can grow so quickly with the time horizon that CCFL becomes computationally infeasible. In RockSample, we show how the iterative version of CCL, CCIL, is able to improve performance over CCL with modest additional computational cost. In Change Detection, we perform a detailed case study on the effects of the belief-update lookahead boundary and how it should be chosen with domain-specific knowledge, along with results reconfirming the improvement of CCIL over CCL.

6.1 Windy L-maze

The purpose of the experiments in this domain is to illustrate how our prescriptive probabilistic commitment semantics can improve multi-agent planning compared to alternative semantics. The domain consists of an L-maze occupied by a commitment provider and a recipient, as shown in Fig. 8.

Fig. 8
figure 8

Windy L-maze. The provider starts in the cell labeled a and can only move in the vertical corridor, and the recipient starts in the cell labeled b and can only move in the horizontal corridor. It is admissible that both agents occupy the cell labeled c at the same time step. The table on the right specifies the reward functions, where d is the distance, measured by number of cells, between cell c and the provider/the recipient. For the provider, there are three possible reward functions \(\{R^p_k\}_{k=1}^{3}\). The recipient’s reward, \(R^r\) (bottom row), is known for certain

The provider starts in the cell labeled a and can only move in the vertical corridor, and the recipient starts in the cell labeled b and can only move in the horizontal corridor. It is admissible that both agents occupy the cell labeled c at the same time step. Let \(d^p, d^r\) be the distance, measured by number of cells, between cell c and the provider, the recipient, respectively. For the provider, there are three possible reward functions as functions of \(d^p\), \(\{R^p_k\}_{k=1}^{3}\), with a uniform prior:

$$\begin{aligned} \text {for }&d^p=4, R^p_1(d^p)=R^p_2(d^p)=R^p_3(d^p)=0.1 \\ \text {for }&d^p<4, R^p_1(d^p)=0, R^p_2(d^p)=-R^p_3(d^p)=d^p-4 \end{aligned}$$

The recipient’s reward, \(R^r\), is known as a function of \(d^r\): \(R^r(d^r)=0.1\) if \(d^r=3\); \(R^r(d^r)=3\) if \(d^r=0\); \(R^r(d^r)=0\) for other values of \(d^r\). The provider can move up, down, or stay in the current cell, and its moves succeed with probability one. The recipient can move left, right, or stay in the current cell. Initially, a door located in cell c is open with a strong wind blowing in such that the recipient’s moves to the left only succeed with probability 0.1, and its other moves succeed with probability one. By occupying cell c, the provider can permanently close the door, in which case the wind stops and all the recipient’s moves succeed with probability one. The two agents aim to maximize the joint expected reward up to the time horizon \(H=10\).

Because the recipient will get a significantly larger reward in cell c than in cell b, it is beneficial for the recipient if the provider could move to cell c to close the door. However, under reward functions \(R^p_1\) and \(R^p_2\), traveling down the corridor to cell c will yield less reward for the provider than staying in the starting cell a. Therefore, effective coordination between the two agents is crucial to achieving high expected joint reward, where (as we shall see) the uncertain rewards of the provider make an “all-or-nothing” commitment suboptimal compared to a probabilistic commitment.

We compare the following three commitment semantics:

  • Non-Prescriptive Probabilistic Semantics In this case, a probabilistic commitment only represents a prediction of the provider’s behavior [18, 41], rather than a prescription for how it will act. The provider computes and follows its history-dependent policy maximizing just its own local reward. It informs the recipient of the probability, \({\underline{\rho }}\), that the door will be closed at time step \(T\ge 4\) under the provider’s policy, and the recipient then computes and follows its own locally-optimal policy with respect to \({\underline{\rho }}\) by standard methods of solving MDPs. We refer to this semantics as selfish and no-commitment because the provider makes no effort to consider the preferences of the recipient when computing and executing its policy.

  • Prescriptive Non-Probabilistic Semantics This semantics is the logic-based semantics alluded to in work on detecting commitment abandonment [24], where a commitment provider will drop all else and single-mindedly pursue a commitment. In this case, the provider computes and follows its history-dependent policy that achieves the highest probability, \({\overline{\rho }}\), of closing the door at the earliest possible time step which is \(T=4\). The recipient uses \({\overline{\rho }}\) to compute and follow its optimal policy assuming maximum help from the provider. We refer to this semantics as unselfish and full-commitment because the provider prioritizes satisfying the preferences of the recipient over its own rewards.

  • Prescriptive Probabilistic Commitment This is the semantics we advocate in this paper. The provider makes a probabilistic commitment: it commits to closing the door at time step \(T=4\) with at least probability \(\rho\). It uses the CCFL algorithm to compute and follow its locally-optimal policy that respects the commitment semantics. The recipient trusts this commitment, and computes and follows its optimal policy assuming the door will be closed at time step \(T\ge 4\) with probability \(\rho\).

Table 1 Evaluation of non-prescriptive semantics, prescriptive non-probabilistic semantics, and prescriptive probabilistic commitment on the windy L-maze domain

The performance of each of the three different semantics (with a few choices of \(\rho\) for our prescriptive probabilistic semantics) is shown in Table 1. Notice that even when the provider is acting entirely selfishly (the non-prescriptive probabilistic case), it predicts that it will nevertheless close the door with probability \({\underline{\rho }}=1/3\). This is because its optimal policy is to move down the corridor one step, observe the reward signal to know exactly what the true reward function is, and then either go immediately back to a, or, with probability 1/3, it will learn that the reward function is \(R^p_3\) and continue on to c. Following the prescriptive non-probabilistic semantics, the unselfish provider will follow a policy guaranteed to close the door (\({\overline{\rho }}=1.0\)), because its moves succeed with certainty. With the prescriptive probabilistic commitment semantics, the agents can choose a probability of closing the door \(\rho \in [0,1]\) that balances selfishness and unselfishness in the provider to attain a higher joint reward. As \(\rho\) increases, the provider’s value monotonically decreases and the recipient’s value monotonically increases. As shown in Table 1, both \(\rho =0.6\) and \(\rho =0.8\) achieve higher joint reward than \({\underline{\rho }}\) and \({\overline{\rho }}\), and \(\rho =0.7\) is even better than \(\rho =0.6\) and \(\rho =0.8\).

These results confirm that our semantics for probabilistic commitments, coupled with algorithms for agent decision-making that respect these semantics, can lead to better joint performance than treating commitments either as inflexible logical constraints on the provider’s plan (such that it must provably satisfy the commitment) or as non-binding predictions about the likelihood the agent’s plan will happen to satisfy the commitment. Our semantics enable agents to strike a compromise between these extremes.

6.2 Food-or-Fire

The purpose of the experiment in this domain is twofold: 1) it is used to simply illustrate that CCL works well in an environment with both transition and reward uncertainty to construct policies respecting the semantics of a given probabilistic commitment, and 2) it is small enough that we can show the effect of the belief-update lookahead boundary by experimenting with all possible choices for the boundary from zero to the time horizon.

The environment is a simple two by three grid maze with \(K=3\) possible scenarios, as shown in Fig. 9, where solid black lines indicate impassable walls. The prior over the three scenarios is a uniform distribution. In the “empty” scenario, the agent can move freely in four directions within the maze, and no reward signal occurs. In the “food” scenario, there are two sections of impassable wall, and food associated with a reward of \(+\) 1 exists in the mid-left cell between the walls. The “fire” scenario is the same as the second except that food is replaced with fire associated with a reward of − 1. The agent, starting in the bottom left cell, commits to reach the top left cell (Exit) at the time horizon, i.e. \(T=H\), with at least probability \(\rho\). The agent can fully observe its current location but can only detect a wall by trying (and failing) to move between two adjacent cells.

Fig. 9
figure 9

Food-or-fire. Left: the “empty” scenario. Middle: the “food” scenario. Right: the “fire” scenario

Because the transition dynamics vary across the three scenarios, we only implemented deterministic CCL described in Sect. 5. Figure 10 plots the expected cumulative reward against all possible belief-update boundaries using deterministic CCL under various choices of T and \(\rho\). According to Theorem 5, the monotonic performance in belief-update lookahead boundary L cannot be guaranteed, but it turns out the expected cumulative reward using deterministic CCL is monotonically non-decreasing with L for all choices of T and \(\rho\) we tried. Thus, anecdotally, it is not hard to find cases in which a larger L yields higher value, even though by Theorem 5 it is not guaranteed. Moreover, when L increases from two to three, we observe that the expected cumulative reward increases significantly for most choices of T and \(\rho\). This is because a belief-update lookahead boundary L of three is just sufficient to identify which scenario the agent is actually facing by moving to the middle-left cell using three actions and reasoning about the observed reward signal of food, fire, or neither. Not surprisingly, with lower commitment probabilities, the agent is able to achieve higher expected reward. An interesting observation is that, compared with \(\rho =0.8\), we see the the expected cumulative reward is more like a step function at \(L=3\) for \(\rho =0.5\) and \(\rho =1.0\). When \(\rho =1.0\), the agent has to reach the Exit at time T in all three scenarios, so it suffices to determine the optimal behavior as soon as the agent figures out at time \(L=3\) which scenario it is facing. When \(\rho =0.5\), the agent would certainly reach the Exit in the “empty” scenario and the “fire” scenario. With the uniform prior, these two scenarios already contribute to \(2/3\ge \rho =0.5\) probability of fulfilling the commitment, and therefore in the “food” scenario the agent would stay in the cell with food for the \(+\) 1 reward and never exit. To achieve this behavior when \(\rho =0.5\), it suffices to use \(L=3\). For \(\rho =0.8\), it is more complicated in the sense that the agent also needs to reach the Exit with some positive probability in the second (food) scenario, and our results show that, with deterministic CCL, using L larger than 3 is able to improve the value.

Fig. 10
figure 10

Expected cumulative reward in Food-or-Fire domain as a function of the commitment and the belief-update lookahead boundary

6.3 RockSample

The size of the Food-or-Fire domain is small enough for us to afford computing belief-update boundaries up to the time horizon. In this RockSample domain and the following Change Detection domain, the number of posterior distributions grows so quickly as the time horizon grows that CCFL becomes computationally infeasible. Our results show that using the iterative version of CCL, CCIL, can improve the performance significantly with moderate additional computational cost.

RockSample [34] is a classic POMDP problem that models a rover exploring an unknown environment. In an instance of RockSample (n, s), the rover can move in an \(n\times n\) grid containing s rocks. When n and s become large, a large belief-update lookahead boundary becomes computationally infeasible. The locations of the rocks are known. Only some of the rocks have scientific value and are of type Good; the others are of type Bad. The type of each rock is uniformly random. The task is to determine which rocks are valuable, approach and take samples of valuable rocks, and leave the map by moving off the right-hand edge of the map. Each time step, the rover can select from \(s+5\) actions: {North, East, South, West, Sample, \(Check_1,\ldots ,Check_s\)}. Each \(Check_i\) action directs the rover’s sensor to rock i, returning a noisy observation from {Good, Bad}. The noise in the observations received by executing each \(Check_i\) action is determined by the Manhattan distance between the rover and the rock being checked: the probability of receiving a correct observation is 0.9, 0.7, and 0.5 when the the Manhattan distance is 0, 1, and at least 2, respectively. In an instance of RockSample (n, s), s rocks could have \(2^s\) possible combinations of type assignments. We treat them as \(K=2^s\) possible MDPs that only differ in reward, and solve the program in Fig. 7 to construct CCL and CCIL policies. During execution, the observations from \(Check_i\) actions are model-informative, suggesting which MDP is more likely.

Fig. 11
figure 11

RockSample instances: a RockSample (2,2), b RockSample (4,4)

In the original RockSample problem, the rover chooses actions to execute until it moves off the map and receives a positive reward. We adapted it to incorporate the probabilistic commitment: the rover does not receive any reward by moving off the map, but it has to move off the map by the time horizon, i.e. \(T=H\), with at least the commitment probability \(\rho\). We scale the reward to the range of \([-1, 1]\): the rover receives a reward of 1.0 for sampling a rock of type Good, a reward of − 1.0 for sampling a rock of type Bad, and no reward occurs for re-sampling the same rock.

We evaluated CCL and CCIL on instances of RockSample (2, 2) and RockSample (4, 4) (Fig. 11). Table 2 contains the results of expected reward and run time in RockSample (2, 2) for commitment time \(T=10\) and commitment probability \(\rho =1.0\) with various choices of L and I. The run time for CCIL is the sum of the CPU times for each iteration. Note that because 1) the rover can get pretty accurate observations since it is always close to the rocks, 2) the types of rocks are uniformly random, and 3) time horizon 10 is large enough, the optimal behavior can collect in expectation one good rock, yielding an expected cumulative reward close to 1.0. For CCL, the results in Table 2 indicate that a larger belief-update lookahead boundary indeed improves the expected reward, but the computational time also increases dramatically. We can see that CCIL can achieve comparable expected reward with much less computational time than CCL. Although CCIL (\(L=3, I=1\)), CCIL (\(L=4, I=4\)), and CCL (\(L=8\)) all achieve near-optimal expected reward, CCIL (\(L=3, I=1\)) and CCIL (\(L=4, I=4\)) use much less computational time than CCL (\(L=8\)).

Table 2 Results on RockSample (2,2), \(|{\mathcal {S}}|=177,|{\mathcal {A}}|=7, |{\mathcal {O}}|=4\) with \(T=10\), \(\rho =1.0\)
Table 3 Results on RockSample (4, 4), \(|{\mathcal {S}}|=4097, |{\mathcal {A}}|=9, |{\mathcal {O}}|=8\), with \(T=15\), \(\rho =1.0\)

Table 3 contains the results in RockSample (4, 4) for commitment time \(T=15\) and probability \(\rho =1.0\). With \(T=15\), the time is just enough for the rover to correctly detect 3 rocks, sample the good rocks, and move off the map. Since a rock is good with probability .5, the expected cumulative reward of the optimal behavior is close to 1.5. For RockSample (4, 4), we can see that CCL can only scale to relatively small belief-update boundaries. The computational time grows dramatically, and we run out of memory when \(L=5\). CCL achieves an expected cumulative reward of 0.9 for \(L=4\), which means that a larger L is needed to find the near-optimal behavior. CCIL performs much better than CCL because it iteratively re-plans during the execution. The performance of CCIL (\(L=1, I=1\)) is between that of CCL (\(L=3\)) and CCL (\(L=4\)). CCIL (\(L=2, I=2\)), CCIL (\(L=2, I=1\)), and CCIL (\(L=3, I=3\)) all achieve behavior with expected cumulative reward close to 1.3, which cannot be achieved by CCL using a moderate amount of computational time. These three choices of (LI) achieve comparable expected reward (no statistically significant difference), with CCIL (\(L=2, I=2\)) being the fastest because its iterative lookahead is less frequent than CCIL (\(L=2, I=1\)) and shallower than CCIL (\(L=3, I=3\)).

6.4 Change detection

In Change Detection, we perform a detailed case study on the effects of the belief-update lookahead boundary, where time horizon H is short enough so that we can experiment with every \(L\le H\) for CCL. We also experiment with a larger H for which CCFL is computationally infeasible, to develop further intuitions about balancing lookahead with iteration to achieve good performance with reasonable computation.

Change Detection is a classic constrained POMDP problem [30]. The agent can partially observe the environment, and at some point the environment will transit into a state where the alarm should be sounded by the agent. The agent aims to minimize the delay in alerting (sounding the alarm) after the transition, and the probability of a false alarm should be lower than a given threshold which is referred to as the false alarm (F.A.) tolerance. Formally, the state space and action space are \({\mathcal {S}}\)={PreChange, PostChange, PostAlarm, FalseAlarm}, \({\mathcal {A}}\)={NoAlarm, Alarm}, respectively. The environment starts in PreChange, and transits to PostChange at a random time step if the agent has not performed action Alarm. Specifically, the problem has a geometric change time parameter \(\eta\), such that at every time step, if the state is still PreChange, it will transit to PostChange with probability \(\eta\). Once the agent performs action Alarm, the state transits to PostAlarm from PostChange with a positive reward, or to FalseAlarm from PreChange with no reward. The commitment is to not reach FalseAlarm with at least a given probability. To encourage early detection, the agent receives a reward of \(+\) 1.0 if it executes action Alarm immediately after transiting to PostChange, with the reward discounted each subsequent time step. The states are not fully observable. Instead, the agent makes an observation o every time step from the observation space \({\mathcal {O}}\), suggesting if the environment has changed or not. The probability of making a specific observation is determined by probability mass functions \(f_0, f_1:{\mathcal {O}}\mapsto [0,1]\) when the environment is in PreChange, and PostChange, respectively. In our experiments, the agent can make an observation every time step from a set of size \(|{\mathcal {O}}|=3\). The reward discount factor is set to \(\gamma =0.8\). The PreChange and PostChange observation distributions are

$$\begin{aligned} f_0(o_1)=0.6, f_0(o_2)=0.3, f_0(o_3)=0.1,\\ f_1(o_1)=0.2, f_1(o_2)=0.4, f_1(o_3)=0.4. \end{aligned}$$

Parameter \(\eta\) provides the agent with the prior distribution of the change time. After making observations, the agent can use Bayes’ rule to calculate the posterior distributions.

We consider the finite horizon decision problem, with the commitment time \(T=H\) being equal to the time horizon, and define the state of the Change Detection problem as \(s=\langle t, Alarmed\rangle\) where Alarmed is a Boolean that takes the value of true when the agent executed action Alarmed in any time step before t, or false otherwise. The current time step t and Boolean Alarmed are both fully observable to the agent. We define belief as \(b=\langle s, \mu \rangle\), where state s is augmented by probability mass function \(\mu\) that gives the probability of all possible change times up to the horizon.

Figure 12 contains the results when experimenting with CCL on a Change Detection instance with horizon \(H=T=10\), where CCFL is computationally feasible. We have experimented with two choices of the geometric change time parameter, \(\eta =0.1, 0.2\), and four choices of the false alarm (F.A.) tolerance. When F.A. tolerance is 0.0, the agent is forbidden to execute Alarm actions if there is any possiblity of false alarm, and therefore the expected cumulative reward is 0 for any choice of the belief-update lookahead boundary L. Otherwise, the expected cumulative reward is monotonically increasing with L. Moreover, choosing a large L is most helpful when the geometric change time parameter \(\eta\) is small (Fig. 12left). For \(\eta =0.1\) (Fig. 12left), the expected reward rises anywhere from about 3-fold (for tolerance=0.2) to 7-fold (for tolerance=0.05), while for \(\eta =0.2\) (Fig. 12right) it is anywhere from about 1.5-fold (for tolerance = 0.2) to 3.5-fold (for tolerance = 0.05). So for the same tolerance, lookahead makes twice the impact when \(\eta =0.1\) than \(\eta =0.2\). Small \(\eta\) suggests that the change is more likely to happen later, and therefore a large L is more likely to envision it. For both choices of \(\eta\), as lookahead L increases, the relative increase in expected reward is smaller when F.A. tolerance is larger. This is because larger tolerance inherently gets more reward regardless of lookahead, and hence there is less reward for lookahead to recoup. These results suggest that, more generally, the value of L should be chosen based at least upon: (1) how far into the future the most meaningful changes to the belief state will occur (as captured by \(\eta\) in this case), (2) how sensitive the agent’s reward is to making a more informed decision (as captured by F.A. tolerance in this case), and (3) how dramatically computation costs rise with farther lookahead (where in this case the branching factor of 2 (change or no change) is fairly low).

Fig. 12
figure 12

Results of CCL on change detection with \(T=10, \gamma =0.8\)

Table 4 Results on change detection with F.A. tolerance of 0.2, \(T=50, \eta =0.04, \gamma =0.8\)

We have also experimented with a larger horizon, \(H=T=50\), where CCFL is not computationally affordable. The geometric change time parameter is \(\eta =0.04\). As we just saw, a low value like this makes the change more likely to happen later and thus emphasizes farther lookahead. The F.A. tolerance is set to 0.2. Table 4 contains the results of expected reward and run time for CCL and CCIL with various choices of L, and of I when applicable. The run time of CCL grows dramatically with L. The expected reward, though, grows relatively slowly, because these lookaheads are still very short for such a small \(\eta\) that requires large lookahead. This can be inferred from Fig. 12(left), where \(\eta =1.0\) is larger and we still see a steep increase in reward at \(L=H/2\). Nevertheless, there is still a 3-fold increase in reward when we increase L for CCL until the computation budget is reached. For CCIL, we experiment with \(L=2,4,6\) and \(I=1,L/2,L\). Unsurprisingly, with more frequent iterative lookahead (smaller I), both the expected reward and the run time increase. CCIL (\(L=4,I=1\)) achieves reward that is higher than any CCL within the computation budget. Both CCIL (\(L=6,I=1\)) and CCIL (\(L=6,I=3\)) double the reward of CCL (\(L=10\)), the largest L within the computation budget, yet use much less computation. These results verify again the effectiveness of the iterative lookahead strategy in CCIL. Recall that, in RockSample, setting \(I=L\) achieves significantly larger reward than CCL with the same L. However, in Change Detection, \(I=L\) achieves no higher reward than CCL for the values of L we consider. We conjecture that this is because the belief changes frequently in Change Detection (every time step) and perhaps in a way that is critical for the agent’s future decisions, making it necessary to perform frequent iterative lookahead, while it might take several steps in RockSample to experience a change (after taking the \(Check_i\) action). From the results of \(L=4,6\) and \(I=1,L/2\), we observe that with larger L, the agent can use larger I without sacrificing too much reward. Overall, CCIL (\(L=4,I=1\)) and CCIL (\(L=6,I=3\)) achieve the best compromise for a wide range of tradeoffs between solution quality and computational cost.

7 Conclusion

In this article, we argue in favor of an operational semantics we defined for a commitment provider that is operating under model uncertainty. Our semantics is based on what a commitment provider can control—its own actions. Specifically, we considered a decision-theoretic setting where the agent is making sequential decisions in one out of several MDPs with a known prior. Fulfilling a commitment corresponds to pursuing a course of action, beginning at the time the commitment was made and over the known prior, that has sufficient likelihood of achieving the intended state at a certain time prescribed by the commitment. In this semantics, the agent fulfills its commitment by following a commitment-constrained policy even if, due to bad luck, the desired outcome was not realized. Based on this semantics, we developed Commitment Constrained Lookahead (CCL), a novel algorithm parameterized by the belief-update lookahead boundary, that constructs semantics-respecting policies offline for the provider. We empirically compared our new semantics, operationalized in CCL, with prior logical and predictive semantics concepts, to illustrate where and why our semantics is superior. We also analytically and empirically investigated the impact of the belief-update lookahead boundary that makes an explicit tradeoff between the computation cost and performance of the computed policy. We have further extended CCL to Commitment Constrained Iterative Lookahead (CCIL) that iteratively adjusts the policy online according to the evolving posterior distribution about the true environment, while still respecting the commitment semantics. Our empirical results show that CCIL can achieve the same performance as CCL with much less computation overhead.

There are a number of interesting directions for future work. As we have mentioned, developing algorithms that incorporate our commitment semantics in infinite state and action spaces is a possible direction for future work. Moreover, our emphasis in this article has been on how the provider should constrain its behaviors for trustworthy commitment achievement. In open systems where trust needs to be earned, interesting questions arise as to how easy it would be to verify if an agent has acted in good faith on the commitment, where the outcomes of the decisions are observable but the decisions themselves are not. With the provider’s trustworthy achievement, to make the commitment useful we also need to answer the question of how the recipient should properly model the commitment and plan accordingly. Finally, with semantics and mechanisms for representing, pursuing, and modeling a given commitment, we are prepared to answer the question of what commitment cooperating agents should agree to make.