Double-Factored Decision Theory for Markov Decision Processes with Multiple Scenarios of the Parameters

Hou, Cheng-Jun

doi:10.1007/s40305-023-00457-5

Double-Factored Decision Theory for Markov Decision Processes with Multiple Scenarios of the Parameters

Published: 15 May 2023

(2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of the Operations Research Society of China Aims and scope Submit manuscript

Double-Factored Decision Theory for Markov Decision Processes with Multiple Scenarios of the Parameters

Download PDF

Cheng-Jun Hou ORCID: orcid.org/0000-0002-0351-5179¹

90 Accesses
Explore all metrics

Abstract

The double-factored decision theory for Markov decision processes with multiple scenarios of the parameters is proposed in this article. We introduce scenario belief to describe the probability distribution of scenarios in the system, and scenario expectation to formulate the expected total discounted reward of a policy. We establish a new framework named as double-factored Markov decision process (DFMDP), in which the physical state and scenario belief are shown to be the double factors serving as the sufficient statistics for the history of the decision process. Four classes of policies for the finite horizon DFMDPs are studied and it is shown that there exists a double-factored Markovian deterministic policy which is optimal among all policies. We also formulate the infinite horizon DFMDPs and present its optimality equation in this paper. An exact solution method named as double-factored backward induction for the finite horizon DFMDPs is proposed. It is utilized to find the optimal policies for the numeric examples and then compared with policies derived from other methods from the related literatures.

Finite Markov Chains and Markov Decision Processes

Analysis of Markov Decision Processes Under Parameter Uncertainty

Markov Decision Processes with Discounted Costs over a Finite Horizon: Action Elimination

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The Markov decision process (MDP) is an effective tool for modeling and decision-making in a dynamic environment, which are used in many disciplines such as robotics, automatic control, economics, manufactures and so on [1,2,3,4,5]. MDPs consider the intrinsic uncertainty which is realized by the uncertainty among its future transition. But the indeterminacy of the parameters (transition probability and reward) in the MDP is source of uncertainty that is not well addressed in many MDP problems. What’s more, in reality it imposes significant difficulty in obtaining a precise representation of the MDP parameters. There are various reasons, such as (i) imprecise or conflicting information given by the content experts [6]; (ii) insufficient data from which to build structures for parameter estimation [7]; (iii) non-stationary transition probabilities due to insufficient and/or hidden state information [7]; and (iv) unpredictable but prevailing events impacting the system [8].

There are generally two categories of parameter uncertainty in the MDPs. Firstly, either or both of the parameters are unknown completely. For handling such problems, the theory of Bayes-adaptive Markov decision process (BAMDP) is established by utilizing the concept of Bayesian inference [9, 10]. In a BAMDP, prior information regarding the unknown parameters is represented by a parameterized distribution and Bayesian inference is used to incorporate any new information in order to update the distribution so that the problem of exploration–exploitation is addressed during learning and sampling. [11] builds a rigorous framework rooted in information theory for solving the MDPs with unknown transition probabilities. But these theories are very difficult to implement on real-world problems. Secondly, the parameters lie in a given range, in another words, there is a pre-defined uncertainty set that the parameters belong to. In this context, there is the more traditional approach of mitigating the parameter uncertainty in MDPs, known as robust dynamic programming. The standard robust dynamic programming is a “max–min” approach in which the decision maker seeks to find a policy that maximizes the worst-case performance when the transition probabilities are allowed to vary within an uncertainty set [6, 12,13,14,15]. One of the key results is that the max–min problem is tractable for instances that satisfy the rectangularity property [12, 13]. Essentially, rectangularity means that observing the realization of a transition probability parameter gives no information about the values of other parameters for any other state-action-time triplet. Because each parameter value for any given state-action-time triplet is independent to each other, the problem can be decomposed so that each worst-case parameter is found via an optimization problem called the inner problem. [12, 13] provide algorithms for solving the max–min problem for a variety of uncertainty sets by providing polynomial-time methods for solving the corresponding inner problem. While rectangular uncertainty sets are desirable from a computational perspective, they can give rise to policies that are overly-conservative.

Recently, there has been a line of research on mitigating the impact of the parameter uncertainty by incorporating multiple scenarios of the parameters into the solution of the MDP. The parameter uncertainty in MDPs is encoded by allowing for multiple scenarios of the rewards and transition probabilities. A scenario is one possible realization of uncertainty and may result from possible realizations of a system, or some observation of a system, or from a simulation model. So this model clearly falls into the second category of parameter uncertainty above. Each scenario is referred to as a “model” of the MDP in [4], where all models are defined on the same state space, action space, and set of decision epochs, but each model may vary in terms of its rewards and transition probabilities. Such an MDP is known as the multi-model Markov decision process in [4] and the concurrent Markov decision process in [5]. We refer to it as the Markov decision process with multiple scenarios of the parameters or the multi-scenario MDP. [4, 5, 16, 17] propose designing policies that maximize a weighted value across multiple scenarios. [18] proposes minimizing the maximum regret with respect to each scenario for the finite horizon multi-scenario MDPs. [17] and [16, 19] propose a percentile optimization approach for finite horizon and infinite horizon multi-scenario MDPs, respectively. So far, the exact solution methods for this problem have relied on mixed-integer programming (MIP) formulations where binary decision variables encode the policy and continuous variables encode the value functions for each scenario of the multi-scenario MDP [4]. Although the MIP can be used to find exact solutions, these problems are NP-hard [4, 17], and the MIP solution methods for this problem have been limited to small multi-scenario MDPs.

The invention of multi-scenario MDPs is motivated by parameters estimated with statistical uncertainty [4], sometimes the results could be dramatically different or even conflicting with each other, especially in the field of clinical/medical research, where MDPs has been successfully used to design optimal treatment and screening protocols, however, longitudinal observation data used to characterize the MDP model are often very limited due to the patient population and cost of acquisition. In this context, there is the need for designing policies that is general favorable for several parameter possibilities, known as model or scenario among related literatures. Moreover, we believe when actually implementing the policies to an individual case in practice, there is information hidden under the realization of state transitions that could be utilized to make inference on the individual itself, on how likely the individual is following each scenario. And with an initial set of weights or distribution of the scenarios, the policies is developed considering all future possibilities so that it’s optimal even before any information and action being collected or implemented.

So in this article, we propose a new theoretical framework for the weighted value problem of the multi-scenario MDPs and methods solving for its exact solution. We introduce the concepts of scenario belief and the scenario expectation to formulate the expected total discounted reward of a policy, and establish a new framework named as double-factored Markov decision process (DFMDP). The “double-factored” here are two factors, i.e., the physical state and the scenario belief of the system, that work as sufficient statistics for the past history of observations and actions in the multi-scenario MDPs. We will demonstrate that these two factors summarize all the prior information gained up to the current stage. Thus they provide complete information for future decision-making. We analyze four classes of policies for the finite horizon DFMDPs. They are respectively double-factored history-dependent randomized, double-factored history-dependent deterministic, double-factored Markovian randomized and double-factored Markovian deterministic policies. It is shown that there exists a double-factored Markovian deterministic policy which is optimal among all policies above. An exact solution method for the double-factored Markovian deterministic policies is proposed. We also formulate the infinite horizon DFMDPs in a later section, but the focus of this paper is still on finite horizon problems. With this framework, we are able to develop algorithms that can be scaled to real-world problems, solving for the exact solutions to the identical weighted value problem of the multi-scenario MDPs, with a side benefit of keeping updating the scenario distribution based on all history observations and actions. So other than solving for the weighted value problem which is our primary goal, this framework can also be extended and utilized to solve learning or recognition problems.

Other than clinical justification where there exists multiple well-establish authorities or studies to build up the MDP parameters, in the context of E-commerce where MDP-based recommendation system has been utilized [20], there is ample amount of data to characterize the whole customer population into sub-category based on census data and shopping behaviors. When a new customer shops online, we can make tailored recommendation when we have more confidence on the sub-category the customer fits, based on his/her history reactions. Yet before the customer performs any actions, we could still make recommendation considering an initial belief or weights on all sub-categories that is benefiting the expectation of the objectives measured via either sales or click-through rate.

This article is organized as follows. In Sect. 2, the definition and notation of the multi-scenario MDPs are presented. In Sect. 3, the finite horizon and the infinite horizon DFMDPs are formulated. Four classes of policies are discussed and an exact solution method is proposed for the finite horizon DFMDPs. The computational experiments involving three sets of test instances for comparing several solution methods for the finite horizon DFMDPs are illustrated in Sect. 4. Section 5 summarizes the work presented in this article and shares plans for future research on the relevant topics.

2 Background and Notation

2.1 Standard MDP

We denote the standard MDP by a 5-elements-tuple (T, S, A, (p,r), s₀) where T = {0, 1,$\cdots$, Z − 1} is the finite discrete set of decision epoch, S is the finite state set, A is the finite action set, (p,r) is the parameter pair in which p is the |S| ×|S| ×|A| transition probability matrix whose element $p_{{s,{\text{s}}^{\prime } }}^{a}$ is the probability of ending in state s′ ∈ S if the system performs action a ∈ A in state s ∈ S, and r is the |S| ×|S| ×|A| reward matrix whose element $r_{{s,{\text{s}}^{\prime } }}^{a}$ is the reward obtained by ending in state s′ ∈ S if the system performs action a ∈ A in state s ∈ S, and s₀ ∈ S is the initial state. Moreover, we assume that the terminal reward is denoted by ${\varvec{r}}^{0} = \left[ {r_{1}^{0} , \cdots ,r_{\left| S \right|}^{0} } \right]^{\top}$ and the discounted factor by γ ∈ [0, 1].

Decision epoch t is also called time t. The interval between two successive decision epochs is called period. We consider the Markov deterministic policy π = (x₀, x₁, ···, x_Z-1) ∈ Π where x_t: S → A is the decision rule at epoch t ∈ T, which assign one action a_t = x_t (s_t) to each state s_t ∈ S, and Π is the policy set. The goal of the decision maker is to specify a policy π ∈ Π that maximize the expected total discounted rewards over the planning horizon [1]

$$\mathop {\max }\limits_{\pi \in \Pi } {\text{E}}_{{s_{0} }}^{\pi } \left\{ {\mathop \sum \limits_{t = 0}^{Z - 1} \gamma^{t} r_{{s_{t} }}^{{a_{t} }} + \gamma^{Z} r_{{s_{Z} }}^{0} } \right\},\forall\, s_{0} \in S,$$

(1)

where $r_{s}^{a} = \sum\limits_{s' \in S} {p_{{s,s^{\prime } }}^{a} } r_{{s,s^{\prime } }}^{a}$. The optimal policy of (1) can be found by the backward induction algorithm in [1].

The Bellman’s equations for the infinite horizon MDP are

$$V^{*} \left( s \right) = \mathop {{\text{max}}}\limits_{a \in A} \left\{ {r_s^a + \gamma \mathop \sum \limits_{{s^{\prime } \in {\text{S}}}} p_{s,s'}^a V^{*} \left( {s^{\prime } } \right)} \right\}, \forall\, s \in S,$$

(2)

where ${V}^{*}\left(s\right)$ is the optimal value function of state s for the infinite horizon MDP. The optimal stationary and deterministic policy of this problem can be found by the value iteration or the policy iteration in [1].

2.2 Multi-Scenario MDP

We denote the multi-scenario MDP by a 6-elements-tuple $(T,S,A,C,{J}_{{\varvec{p}},{\varvec{r}}},\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle )$ where T, S and A are following the same definitions above, C is the finite discrete set of scenarios, J_p,r = {(p₁, r₁),···,(p_|C|, r_|C|)} is the parameter pair set in which (p_k, r_k) is the parameter pair of scenario k $\in$ C where the elements in p_k are denoted by $p_{{s,{\text{s}}^{\prime } ,k}}^{a}$ with the same definitions as $p_{{s,{\text{s}}^{\prime } }}^{a}$ above, and the elements in r_k are denoted by $r_{{s,{\text{s}}^{\prime } ,k}}^{a}$ with the same definitions as $r_{{s,{\text{s}}^{\prime } }}^{a}$ above, $\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle$ is the ordered pair consisting of the initial state s₀ $\in$ S and the initial probability distribution of scenarios, ${{\varvec{b}}}_{{s}_{0}}={[{b}_{{s}_{0},1},\cdots,{b}_{{s}_{0},|C|}]}^{\top}\in {\Delta }^{C}$, where Δ^C denotes the set of probability distributions on set C.

For easy of description, we refer to the standard MDP with the parameter pair (p_k, r_k), k ∈ C, as the MDP k as in [4]. We use $r_{s,k}^{a} = {\mathop \sum\limits_{s' \in S}} {p_{{s,s^{\prime } ,k}}^{a} } r_{s,s\prime ,k}^{a}$ to denote the immediate expected reward in the MDP k when action a ∊ A is taken in state s $\in$ S and ${{\varvec{r}}}_{k}^{0}={[{r}^{0}_{1,k},\,\dots\,,{r}_{|S|,k}^{0}]}^{\top}$ to denote the terminal reward in the MDP k.

Given a multi-scenario MDP $(T,S,A,C,{J}_{{\varvec{p}},{\varvec{r}}},\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle )$, the value of a policy π ∈ Π in MDP k, for a sepcific k ∈ C, is given by its expected total discounted rewards evaluated with the parameter pair (p_k, r_k):

$${V}_{k}^{\pi }\left({s}_{0}\right)={E}_{{s}_{0}}^{\pi }\left[\sum_{t=0}^{Z-1}{\gamma }^{t}{r}_{{s}_{t,}k}^{{a}_{t}}+{\gamma }^{Z}{r}_{{s}_{Z},k}^{0}\right].$$

(3)

The weighted value of any policy π ∈ Π in the multi-scenario MDP is defined by

$${W}^{\pi }({s}_{0},{b}_{{s}_{0}})=\sum_{k\in C}{b}_{{s}_{0},k}{V}_{k}^{\pi }\left({s}_{0}\right)=\sum_{k\in C}{b}_{{s}_{0},k}{E}_{{s}_{0}}^{\pi }\left[\sum_{t=0}^{Z-1}{\gamma }^{t}{r}_{{s}_{t},k}^{{a}_{t}}+{\gamma }^{Z}{r}_{{s}_{Z},k}^{0}\right],$$

(4)

and the weighted value problem (WVP) is defined as the problem of finding a solution to

$${W}^{*}\left({s}_{0},{b}_{{s}_{0}}\right)=\underset{\pi \in \Pi }{\text{max}}{W}^{\pi }\left({s}_{0},{b}_{{s}_{0}}\right)=\underset{\pi \in \Pi }{\text{max}}\left\{\sum_{k\in C}{b}_{{s}_{0},k}{E}_{{s}_{0}}^{\pi }\left[\sum_{t=0}^{Z-1}{\gamma }^{t}{r}_{{s}_{t},k}^{{a}_{t}}+{\gamma }^{Z}{r}_{{s}_{Z},k}^{0}\right]\right\}.$$

(5)

We also have the set of policies Π^* = {π: W^π = W^*} $\subseteq$ Π that achieve the maximum in (4) (see [4]).

Ref. [4] proposes two approximate methods and the MIP formulation for Eq. (5) and show that the WVP for the Markov deterministic policy class is a NP-hard problem.

3 Double-Factored Markov Decision Process

In this section, a novel framework named as double-factored Markov decision process (DFMDP) is formulated for the WVP of the multi-scenario MDP. A backward induction algorithm is proposed for solving the exact solutions of the DFMDPs.

3.1 Scenario Belief and Its Update

The weights in (5) are one’s estimate of the probabilities that scenarios occur in the multi-scenario MDPs. We believe that ${{\varvec{b}}}_{{s}_{0}}$ is only the initial probability distribution of the scenarios and it will change with the system evolution. In fact, system’s state transitions under specific actions contain the information related to scenario probabilities. The decision makers can utilize the information to update the scenario probabilities according to the observation of state transitions when the system evolves. For that reason, we introduce the scenario belief b_t as follows to describe the scenario probability distribution at time t:

$${{\varvec{b}}}_{t}={[{b}_{t,1},\cdots,{b}_{t,k},\cdots ,{b}_{t,|C|}]}^{\top}, {b}_{t,k}\geqslant 0, \forall\, k\in C \quad \text{and} \quad \sum_{k\in C}{b}_{t,k}=1.$$

In order to derive the update function for the scenario belief, we define g_t as the total available information in the multi-scenario MDP at decision epoch t. Since a_t−1 denotes the action implemented at decision epoch t – 1 and s_t the state observed at decision epoch t, we have

$${g}_{t}=\left({s}_{t},{a}_{t-1},{g}_{t-1}\right).$$

(6)

Then the elements of the scenario belief is defined as

$${b}_{t,k}=\mathrm{Pr}\left({c}_{k}|{g}_{t}\right), \forall\,k\in C,$$

(7)

where c_k denotes the event that the realization of system’s scenario is k. The substitution of (6) into (7) and the application of Bayes' rule yields

$${b}_{t,k}=\mathrm{Pr}\left({c}_{k}|{s}_{t},{a}_{t-1},{g}_{t-1}\right)=\frac{\mathrm{Pr}\left({s}_{t}|{c}_{k},{a}_{t-1},{g}_{t-1}\right)\mathrm{Pr}\left({c}_{k}|{a}_{t-1},{g}_{t-1}\right)}{\mathrm{Pr}\left({s}_{t}|{a}_{t-1},{g}_{t-1}\right)}, \forall\,k\in C.$$

(8)

In (8), the first probability in the numerator is the transition probability of scenario k, i.e., $\mathrm{Pr}\left({s}_{t}|{c}_{k},{a}_{t-1},{g}_{t-1}\right)={p}_{{s}_{t-1},{s}_{t},k}^{{a}_{t-1}}$ since ${g}_{t-1}=({s}_{t-1},{a}_{t-1},{g}_{t-2})$. The prior probability $\mathrm{Pr}\left({c}_{k}|{a}_{t-1},{g}_{t-1}\right)$ is independent of a_t−1 given g_t−1, since choosing for action is completely under our control at time t − 1, so $\mathrm{Pr}\left({c}_{k}|{a}_{t-1},{g}_{t-1}\right)=\mathrm{Pr}\left({c}_{k}|{g}_{t-1}\right)={b}_{t-1,k}$. And the denominator of the equation is just the sum of the numerator over all possible value of k representing each scenario. So we have

$${b}_{t,k}=\frac{{b}_{t-1,k}{p}_{{s}_{t-1},{s}_{t},k}^{{a}_{t-1}}}{{\sum }_{k\mathrm{^{\prime}}\in C}{b}_{t-1,k\mathrm{^{\prime}}}{p}_{{s}_{t-1},{s}_{t},k\mathrm{^{\prime}}}^{{a}_{t-1}}} , \forall\,k\in C.$$

(9)

We can see in (9) that b_t is a function of s_t, s_t−1, b_t−1 and a_t−1. So we use ${{\varvec{b}}}_{{s}_{t}}$ to denote the scenario belief when state s_t is observed at time t, in order to differentiate it from other states being observed, also the time index t only shows as subscript of s_t to simplify the notation. With these in mind, Eq. (9) can be rewritten as

$${b}_{{s}_{t},k}=\frac{{b}_{{s}_{t-1},k}{p}_{{s}_{t-1},{s}_{t},k}^{{a}_{t-1}}}{\sum _{k\mathrm{^{\prime}} \in C}{b}_{{s}_{t-1},k\mathrm{^{\prime}}}{p}_{{s}_{t-1},{s}_{t},k\mathrm{^{\prime}}}^{{a}_{t-1}}}, \forall\,{a}_{t-1}\in A, {s}_{t-1},{s}_{t}\in S, k\in C.$$

(10)

Equation (10) can also be expressed as a function that ${{\varvec{b}}}_{{s}_{t}}=\tau \left({s}_{t-1},{{\varvec{b}}}_{{s}_{t-1}},{a}_{t-1},{s}_{t}\right)$ where τ(·) is called the scenario belief update function. We stipulate that ${{\varvec{b}}}_{{s}_{t}}=\mathbf{0}$ if the denominator in (10) is equal to zero, implying the situation where state s_t is unreachable at time t. Clearly ${\varvec{b}}_{{s_{t} }} \in \Delta^{C}$, so Δ^C is also called scenario belief space.

Stating from any state s₀ $\in$ S and scenario belief ${{\varvec{b}}}_{{\mathrm{s}}_{0}}\in {\Delta }^{C}$, the scenario belief ${{\varvec{b}}}_{{s}_{t}}$ in state s_t at time t can be obtained by using (10) repeatedly if s_t is observed. We denote the combination of s_t and ${{\varvec{b}}}_{{s}_{t}}$ by an ordered pair $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle$, which is referred to as the ordered pair of state and scenario belief at time t or the state-belief pair for short. It follows that $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \in S\times {\Delta }^{C}$ where $S\times {\Delta }^{C}$ is the Cartesian production of S and Δ^C. In addition, the scenario belief update function could be written as ${{\varvec{b}}}_{{s}_{t}}=\tau \left(\langle {s}_{t-1},{{\varvec{b}}}_{{s}_{t-1}}\rangle ,{a}_{t-1},{s}_{t}\right)$ to highlight the state-belief pair.

The important feature of (10) is that the calculation of ${{\varvec{b}}}_{{s}_{t}}$ in state ${s}_{t}$ at time t requires only the state-belief pair $\langle {s}_{t-1},{{\varvec{b}}}_{{s}_{t-1}}\rangle$ at time t − 1; thus, $\langle {s}_{t-1},{{\varvec{b}}}_{{s}_{t-1}}\rangle$ summarizes all the information gained prior to time t − 1 and represents a sufficient statistic for the complete past history of the process g_t−1. This important feature sets the foundation of the double-factored decision theory.

3.2 Scenario Expectation and Expected Total Reward

Starting with any $\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \in S\times {\Delta }^{C}$ and using (10) repeatedly, all state-belief pairs $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle$ and actions a_t for t = 0, 1,$\cdots$, Z will form the transition tree of the system as shown in Fig. 1. Within the tree, there are usually |A| actions behind each state-belief pair and |S| state-belief pairs behind each action. Each $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle$ usually has |A|·|S| successors $\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle =\langle {s}_{t+1},\tau (\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle ,{a}_{t},{s}_{t+1})\rangle$. And apart from $\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle$, each $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle$ has one unique predecessor.

There are two types of special state-belief pairs in the transition tree. If ${{\varvec{b}}}_{{s}_{t}}={{\varvec{e}}}_{0}$ in state s_t at some decision epoch t where e₀ is a |C|-dimensional zero vector, the corresponding state-belief pair $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle$ is referred to as the pseudo state-belief pair. The reason for the appearance of $\langle {s}_{t},{{\varvec{e}}}_{0}\rangle$ is that state s_t is unreachable. Thus, a pseudo-state-belief pair is a "leaf" in the tree and it no longer has any successor. If ${{\varvec{b}}}_{{s}_{t}}={{\varvec{e}}}_{k}$ in state s_t at some decision epoch t where e_k is a |C|-dimensional unit vector with its kth element being one, the corresponding state-belief pair $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle$ is referred to as the degenerate state-belief pair.

Any path $(\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle ,{a}_{0},\langle {s}_{1},{{\varvec{b}}}_{{s}_{1}}\rangle ,\cdots ,{a}_{t-1},\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle )$ along the transition tree will be the possible realization when the system evolves. Thus the history for $\left(T,S,A,C,{J}_{{\varvec{p}},{\varvec{r}}},\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \right)$ is defined as the sequence of state-belief pairs and actions, i.e., ${h}_{t}=\left(\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle ,{a}_{0},\langle {s}_{1},{{\varvec{b}}}_{{s}_{1}}\rangle ,\cdots ,{a}_{t-1},\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \right)$, which is referred to as the history up to time t from $\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle$ onward, or the history up to time t for short. As a convention, $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle$ denotes the state-belief pair at time t when the history is h_t. The history h_t follows the recursion ${h}_{t}=\left({h}_{t-1},{a}_{t-1},\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \right)$ with ${h}_{0}=\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle$. We let H_t denote the set of all possible history up to time t and call it history set up to time t from $\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle$ onward, or history set up to time t for short.

Similar to the standard MDPs, there are four classes of decision rules x_t for the DFMDPs. History-dependent randomized decision rules map the history set H_t into the probability distribution set Δ^A on action set A, i.e., ${x}_{t}:{H}_{t}\to {\Delta }^{A}$ or ${\mu }_{{h}_{t}}\left(\cdot \right)={x}_{t}\left({h}_{t}\right),\mu \left(\cdot \right)\in {\Delta }^{A},{h}_{t}\in {H}_{t}$. History-dependent deterministic decision rules map the history set H_t into the action set A, i.e., ${x}_{t}:{H}_{t}\to A$ or ${a}_{t}={x}_{t}\left({h}_{t}\right),{h}_{t}\in {H}_{t}$. Double-factored Markovian randomized decision rules map S × Δ^C into Δ^A, i.e., ${x}_{t}:S\times {\Delta }^{C}\to {\Delta }^{A}$ or ${\mu }_{\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle }\left(\cdot \right)={x}_{t}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right),\mu \left(\cdot \right)\in {\Delta }^{A},\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \in S\times {\Delta }^{C}$. Double-factored Markovian deterministic decision rules map S × Δ^C into A, i.e., ${x}_{t}:S\times {\Delta }^{C}\to A$ or ${a}_{t}={x}_{t}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right),\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \in S\times {\Delta }^{C}$. A double-factored Markovian deterministic decision rule specifies the action choice only considering the fact that the system is in state s_t with belief ${{\varvec{b}}}_{{s}_{t}}$ at time t. We will prove later that it is Markovian (memoryless) because it depends on the previous history only through the current state-belief pair of the system, i.e., $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle$ is a sufficient statistic of the past history as mentioned above.

The policies composed of the above four classes of decision rules, $\pi=(x_{0},x_{1},\cdots,x_{Z-1})$, are referred to as double-factored history-dependent randomized policies, double-factored history-dependent deterministic policies, double-factored Markovian randomized policies and double-factored Markovian deterministic policies, respectively. And the sets of all policies of these classes are denoted by Π^DHR, Π^DHD, Π^DMR, and Π^DMD, respectively.

The relationship between the various classes of policies is as follows: ${\Pi }^{\mathrm{DMD}}\subset {\Pi }^{\mathrm{DMR}}\subset {\Pi }^{\mathrm{DHR}}$ and ${\Pi }^{\mathrm{DMD}}\subset {\Pi }^{\mathrm{DHD}}\subset {\Pi }^{\mathrm{DHR}}$. We will show later with a series of theorems that the optimal double-factored Markovian deterministic policies are the best among all classes of policies for the finite horizon DFMDPs.

We next use a one-period decision-making problem to introduce the concept of the scenario expectations. In a one-period problem, Z = 1 and T = {0}. We assume that whenever the system is in state s₁ at the end of this period, the decision maker receives a terminal reward V(s₁), where V is a definite real-valued function on S. Suppose the system sits in state s₀ with scenario belief ${{\varvec{b}}}_{{s}_{0}}$ at the start of the period. The decision maker aims to select an action a $\in$ A to maximize the expected total discounted reward. Suppose he chooses a deterministic policy π = (x₀), which selects action a $\in$ A at initial decision epoch.

In this problem, the expected reward for the MDP k is

$${V}_{k}^{a}\left({s}_{0}\right)={r}_{{s}_{0},k}^{a}+\gamma \sum_{{s}_{1}\in S}{p}_{{s}_{0,}{s}_{1,}k}^{a}V({s}_{1}).$$

(11)

Since the scenario realization of the system is uncertain before the decision-making, we can consider the problem from the perspective of expectations over possible scenarios and define the expected total discounted reward as follows:

$${V}^{a}\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)=\sum_{k\in C}{b}_{{s}_{0},k}{V}_{k}^{a}\left({s}_{0}\right)=\sum_{k\in C}{b}_{{s}_{0},k}\left[{r}_{{s}_{0},k}^{a}+\gamma \sum_{{s}_{1}\in S}{p}_{{s}_{0,}{s}_{1,}k}^{a}V({s}_{1})\right].$$

(12)

Letting ${\overline{r} }_{{s}_{0}}^{a}=\sum_{k\in C}{b}_{{s}_{0},k}{r}_{{s}_{0},k}^{a}$ and ${\overline{p} }_{{s}_{0},{s}_{1}}^{a}=\sum_{k\in C}{b}_{{s}_{0},k}{p}_{{s}_{0},{s}_{1},k}^{a}$, (12) is rewritten as

$${V}^{a}\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={\overline{r} }_{{s}_{0}}^{a}+\gamma \sum_{{s}_{1}\in S}{\overline{p} }_{{s}_{0},{s}_{1}}^{a}V({s}_{1}).$$

(13)

The optimal action a^* is obtained by finding the maximum of (13) over all actions in set A:

$${a}^{*}\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)\in \underset{a\in A}{\text{argmax}}\left\{{\overline{r} }_{{s}_{0}}^{a}+\gamma \sum_{{s}_{1}\in S}{\overline{p} }_{{s}_{0},{s}_{1}}^{a}V\left({s}_{1}\right)\right\}.$$

If the decision maker uses a randomized policy with probability μ(a) to select actions in state s₀, the expected total discounted reward equals $\sum_{a\in A}\mu (a)\left\{{\overline{r} }_{{s}_{0}}^{a}+\gamma \sum_{{s}_{1}\in S}{\overline{p} }_{{s}_{0},{s}_{1}}^{a}V({s}_{1})\right\}$, where $\sum_{a\in A}\mu \left(a\right)=1$ and $\mu \left(a\right)\geqslant 0$ for a $\in$ A. Since

$$\underset{\mu \in {\Delta }^{A}}{\text{max}}\left\{\sum_{a\in A}\mu (a)\left[{\overline{r} }_{{s}_{0}}^{a}+\gamma \sum_{{s}_{1}\in S}{\overline{p} }_{{s}_{0},{s}_{1}}^{a}V({s}_{1})\right]\right\}=\underset{a\in A}{\text{max}}\left\{{\overline{r} }_{{s}_{0}}^{a}+\gamma \sum_{{s}_{1}\in S}{\overline{p} }_{{s}_{0},{s}_{1}}^{a}V({s}_{1})\right\},$$

obviously we cannot obtain a larger expected reward in state s₀ by means of randomized policies.

We generalize the idea of scenario expectations above to the decision horizon of $\left(T,S,A,C,{J}_{{\varvec{p}},{\varvec{r}}},\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \right)$ and use it to define the expected total discounted reward generated by a policy. Let $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle$ be the state-belief pair in state s_t at decision epoch t. We define

$$\begin{aligned} & \overline{r}_{{s_{t} }}^{{a_{t} }} = \mathop \sum \limits_{k \in C} b_{{s_{t} ,k}} r_{{s_{t} ,k}}^{{a_{t} }} ,\forall\, a_{t} \in A, \\ & \overline{p}_{{s_{t} ,s_{t + 1} }}^{{a_{t} }} = \mathop \sum \limits_{k \in C} b_{{s_{t} ,k}} p_{{s_{t} ,s_{t + 1} ,k}}^{{a_{t} }} ,\forall \,a_{t} \in A,s_{t + 1} \in S,\\ \end{aligned}$$

(14)

where ${\overline{r} }_{{s}_{t}}^{{a}_{t}}$ is called the scenario expectation reward and ${\overline{p} }_{{s}_{t},{s}_{t+1}}^{{a}_{t}}$ the scenario expectation transition probability in state s_t at decision epoch $\textit{t}$. Let ${V}^{\pi }({s}_{0},{{\varvec{b}}}_{{s}_{0}})$ represent the expected total discounted reward over the decision-making horizon if policy π is used and the system has an initial state-belief pair $\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \in S\times {\Delta }^{C}$. For π $\in$ Π^DHR, it is defined by

$${V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={E}_{\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle }^{\pi }\left\{\sum_{t=0}^{Z-1}{\gamma }^{t}{\overline{r} }_{{s}_{t}}^{{a}_{t}}+{\gamma }^{Z}{\overline{r} }_{{s}_{Z}}^{0}\right\},$$

(15)

where ${\overline{r} }_{{s}_{Z}}^{0}=\sum_{k\in C}{b}_{{s}_{Z},k}{r}_{{s}_{Z},k}^{0}$. For π $\in$ Π^DHD and π $\in$ Π^DMD, the expected total discounted reward can be respectively expressed as

$${V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={E}_{\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle }^{\pi }\left\{\sum_{t=0}^{Z-1}{\gamma }^{t}{\overline{r} }_{{s}_{t}}^{{x}_{t}\left({h}_{t}\right)}+{\gamma }^{Z}{\overline{r} }_{{s}_{Z}}^{0}\right\},$$

(16)

and

$${V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={E}_{\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle }^{\pi }\left\{\sum_{t=0}^{Z-1}{\gamma }^{t}{\overline{r} }_{{s}_{t}}^{{x}_{t}({s}_{t},{{\varvec{b}}}_{{s}_{t}})}+{\gamma }^{Z}{\overline{r} }_{{s}_{Z}}^{0}\right\}.$$

(17)

It will be shown later by Theorem 7 that ${V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)$ equals ${W}^{\pi }({s}_{0},{\varvec b}_{{s}_{0}})$ in (4).

3.3 Finite Horizon DFMDP

According to the above theory, the optimization problem for a finite horizon DFMDP is expressed as

$${V}^{*}\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right) =\underset{\pi \in {\Pi }^{\text{DHR}}}{\text{max}}{V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right), \forall\, \langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \in S\times {\Delta }^{C}.$$

(18)

Now we provide a recursive algorithm to evaluate ${V}^{\pi }({s}_{0},{{\varvec{b}}}_{{s}_{0}})$. Let ${U}_{t}^{\pi }:{H}_{t}^{\pi }\to \text{R}$ denote the expected total discounted reward obtained by using a fixed policy π at decision epoch t, t + 1, ···, Z, where ${H}_{t}^{\pi }$ is the history set following the policy π up to time t and ${H}_{t}^{\pi }={H}_{t}$ for π $\in$ Π^DHR. If the history at decision epoch t is $h_{t} \in H_{t}^{\pi }$, then we define ${U}_{t}^{\pi }(t<Z)$ by

$${U}_{t}^{\pi }\left({h}_{t}\right)={E}_{{h}_{t}}^{\pi }\left\{\sum_{n=t}^{Z-1}{\gamma }^{n-t}{\overline{r} }_{{s}_{n}}^{{a}_{n}}+{\gamma }^{Z-t}{\overline{r} }_{{s}_{Z}}^{0}\right\}.$$

(19)

To simplify the notation, assume that a deterministic π $\in$ Π^DHD has been specified. In practice, we will not need to evaluate randomized policies, because subsequent results establish that deterministic policies are optimal under the expected total reward criteria. For a given $\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle$, the corresponding ${H}_{t}^{\pi }$ is defined by

$$\begin{aligned} & H_{0}^{\pi } = \left\{ {h_{0} } \right\} = \left\{ {\langle s_{0} ,{\varvec{b}}_{{s_{0} }} } \rangle\right\}, \\ & H_{t}^{\pi } = \left\{ {h_{t} :h_{t} = \left( {h_{t - 1} ,x_{t - 1} \left( {h_{t - 1} } \right),\langle s_{t} ,\tau \left( {\langle s_{t - 1} ,{\varvec{b}}_{{s_{t - 1} }} \rangle ,x_{t - 1} \left( {h_{t - 1} } \right),s_{t} } \right)}\rangle \right),\forall\, h_{t - 1} \in H_{t-1}^{\pi} ,\forall\, s_{t} \in S} \right\}. \\ \end{aligned}$$

(20)

It is obvious that ${H}_{t}^{\pi }\subset {H}_{t}$ when $|A|>1$. If the history at decision epoch t is $h_{t} \in H_{t}^{\pi }$, ${U}_{t}^{\pi }$ is expressed by

$${U}_{t}^{\pi }\left({h}_{t}\right)={E}_{{h}_{t}}^{\pi }\left\{\sum_{n=t}^{Z-1}{\gamma }^{n-t}{\overline{r} }_{{s}_{n}}^{{x}_{n}\left({h}_{n}\right)}+{\gamma }^{Z-t}{\overline{r} }_{{s}_{Z}}^{0}\right\} .$$

(21)

The following theorem guarantees that the expected value ${U}_{0}^{\pi }\left({h}_{0}\right)$ generated by the algorithm above is equal to ${V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)$ in (16).

Theorem 1

For any given ${h}_{0}=\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \in S\times {\Delta }^{C}$ and a fixed policy π $\in$ Π^DHD, suppose $U_{t}^{\pi } ,t \in T$, has been generated by the policy evaluation algorithm above. Then, for all t $\leqslant$ Z, (21) holds and ${V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)$ in (16) equals ${U}_{0}^{\pi }\left({h}_{0}\right)$ for any ${h}_{0}=\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \in S\times {\Delta }^{C}$.

For ease of reading, we defer all theorem proofs to “Appendix”.

To generalize the algorithm to randomized policies, it would require an additional summation in (22) to account for the probability distribution of the action at decision epoch t under decision rule x_t as follows:

$${U}_{t}^{\pi }\left({h}_{t}\right)=\sum_{a\in A}{\mu }_{{h}_{t}}(a)\left\{{\overline{r} }_{{s}_{t}}^{{x}_{t}\left({h}_{t}\right)}+\gamma \sum_{{s}_{t+1}\in S}{\overline{p} }_{{s}_{t},{s}_{t+1}}^{{x}_{t}\left({h}_{t}\right)}{U}_{t+1}^{\pi }\left({h}_{t},{x}_{t}\left({h}_{t}\right),\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle \right)\right\}.$$

(23)

Theorem 1 shall be extended to π $\in$ Π^DHR as follows.

Theorem 2

For any given $h_{0} = \langle s_{0} ,{\varvec{b}}_{{s_{0} }} \rangle\in S \times {\Delta }^{C}$ and a fixed policy π $\in$ Π^DHR, suppose $U_{t}^{\pi } ,t \in T$, has been generated by the policy evaluation algorithm with (23) replacing (22). Then, for all t $\leqslant$ Z, (19) holds and $V^{\pi } \left( {s_{0} ,{\varvec{b}}_{{s_{0} }} } \right)$ in (15) equals $U_{0}^{\pi } \left( {h_{0} } \right)$ for any $h_{0} = \langle s_{0} ,{\varvec{b}}_{{s_{0} }} \rangle \in S \times {\Delta }^{C}$.

Now let

$${U}_{t}^{*}\left({h}_{t}\right)=\underset{\pi \in {\Pi }^{\text{DHR}}}{\text{max}}{U}_{t}^{\pi }\left({h}_{t}\right).$$

It indicates the maximum over all policies of the expected total discounted reward from decision epoch t onward when the history up to time t is h_t.

The optimality equations for the finite horizon DFMDP are given by

$${U}_{t}\left({h}_{t}\right)=\underset{{a}_{t}\in \mathrm{A}}{\text{max}}\left\{{\overline{r} }_{{s}_{t}}^{{a}_{t}}+\gamma \sum_{{s}_{t+1}\in S}{\overline{p} }_{{s}_{t},{s}_{t+1}}^{{a}_{t}}{U}_{t+1}\left({h}_{t},{a}_{t},\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle \right)\right\}$$

(24)

for ${t}\in{T}$ and ${h}_{t}=\left({h}_{t-1},{a}_{t-1},\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \right)\in {H}_{t}$, where $\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle =\langle {s}_{t+1},\tau (\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle ,{a}_{t},{s}_{t+1})\rangle$. For ${t}={Z}$ and ${h}_{Z}=\left({h}_{Z-1},{a}_{Z-1},\langle {s}_{Z},{{\varvec{b}}}_{{s}_{Z}}\rangle \right)\in {H}_{Z}$, we add the boundary condition

$${U}_{Z}\left({h}_{Z}\right)={\overline{r} }_{{s}_{Z}}^{0}.$$

(25)

Before stating more theorems, we introduce the following lemma 1.

Lemma 1

Let f be a real-valued function on an arbitrary finite discrete set A and μ() be a probability distribution on A. Then

$$\underset{a\in A}{\text{sup}} f\left(a\right)\geqslant \sum_{a\in A}\mu \left(a\right)f\left(a\right).$$

The following theorem summarizes the optimality properties of solutions to the optimality equations.

Theorem 3

Let H_t be the history up to time t from any ${h}_{0}=\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle$ onward. Suppose U_t are solutions of (24) and (25) for t = 0,1,$\cdots$ , Z. Then

(i) ${U}_{t}\left({h}_{t}\right)={U}_{t}^{*}\left({h}_{t}\right)$ for all h_t ∊ H_t, t = 0,1,$\cdots$ , Z, and

(ii) ${V}^{*}\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={U}_{0}\left({h}_{0}\right)$.

The following result shows how to use the optimality equations to find the optimal policy and to verify its optimality.

Theorem 4

Suppose $U_{t}^{*} ,t = 0,1, \cdots ,Z$, are solutions of (24) and (25), and policy $\pi^{*} = \left( {x_{0}^{*} ,x_{1}^{*},\,\dots\,, x_{Z - 1}^{*} } \right) \in {\Pi }^{{{\text{DHD}}}}$ satisfies

$$\begin{aligned} & \overline{r}_{{s_{t} }}^{{x_{t}^{*} \left( {h_{t} } \right)}} + \gamma \mathop \sum \limits_{{s_{t + 1} \in S}} \overline{p}_{{s_{t} ,s_{t + 1} }}^{{x_{t}^{*} \left( {h_{t} } \right)}} U_{t + 1}^{*} \left( {h_{t} ,x_{t}^{*} \left( {h_{t} } \right),\langle s_{t + 1} ,{\varvec{b}}_{{s_{t + 1} }} \rangle } \right) \\ &= \mathop {{\text{max}}}\limits_{{a_{t} \in A}} \left\{ {\overline{r}_{{s_{t} }}^{{a_{t} }} + \gamma \mathop \sum \limits_{{s_{t + 1} \in S}} \overline{p}_{{s_{t} ,s_{t + 1} }}^{{a_{t} }} U_{t + 1}^{*} \left( {h_{t} ,a_{t} ,\langle s_{t + 1} ,{\varvec{b}}_{{s_{t + 1} }} \rangle } \right)} \right\} \\ \end{aligned}$$

(26)

for ${t}\in{T}$. Then

(i)
for t = 0,1,$\cdots$ , Z,
$$U_{t}^{{\pi^{*} }} \left( {h_{t} } \right) = U_{t}^{*} \left( {h_{t} } \right), \forall\, h_{t} \in H_{t} .$$
(ii)
π^* is the optimal policy, and
$$V^{{\pi^{*} }} \left( {s_{0} ,{\varvec{b}}_{{s_{0} }} } \right) = V^{*} \left( {s_{0} ,{\varvec{b}}_{{s_{0} }} } \right), \forall \langle s_{0} ,{\varvec{b}}_{{s_{0} }}\rangle \in S \times \Delta^{C} .$$

Note that we have restricted attention to double-factored history-dependent deterministic policies in Theorem 4. This is because if there existed a double-factored history-dependent randomized policy which satisfied the obvious generalization of (26), as a result of Lemma 1, we could find a deterministic policy which satisfies (26).

Equation (26) can be written as

$$x_{t}^{*} \left( {h_{t} } \right) \in \mathop {{\text{argmax}}}\limits_{{a_{t} \in A}} \left\{ {\overline{r}_{{s_{t} }}^{{a_{t} }} + \gamma \mathop \sum \limits_{{s_{t + 1} \in S}} \overline{p}_{{s_{t} ,s_{t + 1} }}^{{a_{t} }} U_{t + 1}^{*} \left( {h_{t} ,a_{t} ,\langle s_{t + 1} ,{\varvec{b}}_{{s_{t + 1} }} \rangle } \right)} \right\}, t \in T.$$

(27)

The optimal policy π^* derived from (27) in Theorem 4 exists because of the finite set S and A. So we obtain directly Theorem 5.

Theorem 5

There exists an optimal double-factored history-dependent deterministic policy for finite state set and action set.

We next show that there exists an optimal policy which is double-factored Markovian and deterministic.

Theorem 6

Let $U_{t}^{*} ,t = 0,1, \cdots ,Z$, be the solutions of (24) and (25). Then

(i)
for t = 0, 1,$\cdots$ , Z, $U_{t}^{*} \left( {h_{t} } \right)$ depends on h_t only through $\langle s_{t} ,{\varvec{b}}_{{s_{t} }}\rangle$, i.e., $U_{t}^{*} \left( {h_{t} } \right) = U_{t}^{*} \left( {s_{t} ,{\varvec{b}}_{{s_{t} }} } \right)$;
(ii)
there exists an optimal policy which is double-factored Markovian and deterministic when both S and A are finite.

Theorem 6 shows that there exists a double-factored Markovian deterministic policy that is optimal among all classes of policies. Furthermore, it follows from (A7) in the “Appendix” that there are the optimality equations in terms of the optimal double-factored value function, ${U}_{t}^{*}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right)$:

$${U}_{t}^{*}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right)=\underset{{a}_{t}\in A}{\text{max}}\left\{\sum_{k\in C}{b}_{{s}_{t},k}{r}_{{s}_{t},k}^{{a}_{t}}+\gamma \sum_{{s}_{t+1}\in S}\sum_{k\in C}{b}_{{s}_{t},k}{p}_{{s}_{t},{s}_{t+1},k}^{{a}_{t}}{U}_{t+1}^{*}\left({s}_{t+1},\tau \left(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle ,{a}_{t},{s}_{t+1}\right)\right)\right\}.$$

(28)

So we have established that

$${V}^{*}\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)=\underset{\pi \in {\Pi }^{\text{DHR}}}{\text{max}}{V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)=\underset{\pi \in {\Pi }^{\text{DHD}}}{\text{max}}{V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)=\underset{\pi \in {\Pi }^{\text{DMD}}}{\text{max}}{V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right), \forall\, \langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \in S\times {\Delta }^{C},$$

(29)

where the expected total discounted rewards ${V}^{\pi }({s}_{0},{{\varvec{b}}}_{{s}_{0}})$ generated by policies π $\in$ Π^DHR, π $\in$ Π^DHD, and π $\in$ Π^DMD are expressed by (15), (16) and (17), respectively.

The following theorem establishes the relationship between the WVP in [4] and our finite horizon DFMDP problem.

Theorem 7

The finite horizon DFMDP is equivalent to the WVP of multi-scenario MDP.

Thus a DFMDP can also be denoted by the 6-elements-tuple $\left(T,S,A,C,{J}_{{\varvec{p}},{\varvec{r}}},\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \right)$ according to the equivalence in Theorem 7.

Now we present the double-factored backward induction algorithm (DFBI) for the exact solutions of the finite horizon DFMDPs based on the above theory.

The DFBI is an exact solution method for the finite horizon DFMDPs. We know from the transition tree in Fig. 1 that if all state-belief pairs $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle$ for t = 0,1, ···, Z are reachable, |Ψ_t |= (|A|·|S|)^t. So the time complexity of the algorithm is O((|A|·|S|·|C|)^Z) and it increases exponentially with the number of decision epochs.

When actually implementing the method, there are several approaches listed below to reduce the computational complexity such that the algorithm would be practicable to solve real-world problems.

(i) For any$\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle$, t = 1, ···, Z, in step 1, if $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle$ equals its elder $\langle s_{{t^{\prime } }} ,{\varvec{b}}_{{s_{{t^{\prime } }} }} \rangle$, t′ < t, or its brother $\langle s_{t} ,{\varvec{b}}_{{s_{t} }} \rangle^{\prime }$, its sons can be directly copied from ones of $\langle s_{{t^{\prime } }} ,{\varvec{b}}_{{s_{{t^{\prime } }} }} \rangle$ or $\langle s_{t} ,{\varvec{b}}_{{s_{t} }} \rangle^{\prime }$ to reduce the computation. If $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle$ is a pseudo-state-belief pair, no computation for its sons is required.

(ii) For ${U}_{t}^{*}({s}_{t},{{\varvec{b}}}_{{s}_{t}})$ and ${A}_{t}^{*}({s}_{t},{{\varvec{b}}}_{{s}_{t}})$ of $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \in {\Psi }_{t}$ in step 4, since the same state-belief pairs in set Ψ_t have the same ${U}_{t}^{*}$ and ${A}_{t}^{*}$ according to Theorem 6, copies of ${U}_{t}^{*}$ and ${A}_{t}^{*}$ are directly used to avoid repetitive computations. When programming the algorithm, let ${U}_{t}^{*}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right)=0$ and ${A}_{t}^{*}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right)=$ $\varnothing$ when $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle =\langle {s}_{t},{\varvec e}_{0}\rangle$, and ${U}_{t}^{*}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right)={U}_{t,k}^{*}\left({s}_{t}\right)$ and ${A}_{t}^{*}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right)={A}_{t,k}^{*}\left({s}_{t}\right)$ when $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle =\langle {s}_{t},{\varvec e}_{k}\rangle ,k\in C$, where ${U}_{t,k}^{*}\left({s}_{t}\right)$ and ${A}_{t,k}^{*}\left({s}_{t}\right)$ are the optimal value function and the optimal action set of MDP k at time t.

(iii) When implementing the DFBI, we only need to store the set ${\Psi }_{t}$, as well as the parent-son relationship between $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \in {\Psi }_{t}$ and $\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle \in {\Psi }_{t+1}$ in order to reduce the storage demand of the algorithm.

There are three types of scenarios illustrated in [16]. They can be described by three sets of parameter pair respectively:

(I) For certain transition probabilities and uncertain rewards, ${J}_{{\varvec{p}},{\varvec{r}}}={J}_{{{\varvec{p}}}_{0},{\varvec{r}}}=\{\left({{\varvec{p}}}_{0},{{\varvec{r}}}_{1}\right),\cdots ,\left({{\varvec{p}}}_{0},{{\varvec{r}}}_{|C|}\right)\}$.
(II) For uncertain transition probabilities and certain rewards, ${J}_{{\varvec{p}},{\varvec{r}}}={J}_{{\varvec{p}},{{\varvec{r}}}_{0}}=\{\left({{\varvec{p}}}_{1},{{\varvec{r}}}_{0}\right),\cdots ,\left({{\varvec{p}}}_{|C|},{{\varvec{r}}}_{0}\right)\}$.
(III) For uncertain transition probabilities and rewards, ${J}_{{\varvec{p}},{\varvec{r}}}=\{\left({{\varvec{p}}}_{1},{{\varvec{r}}}_{1}\right),\cdots ,\left({{\varvec{p}}}_{|C|},{{\varvec{r}}}_{|C|}\right)\}$.

Actually, both the type-I and the type-II scenarios are the special cases of the type-III scenarios.

The following theorem once again establishes the relationship between the DFMDPs and the standard MDPs.

Theorem 8

The DFMDP $\left( {T,S,A,C,J_{{{\varvec{p}},{\varvec{r}}}} ,\langle s_{0} ,{\varvec{b}}_{{s_{0} }} }\rangle \right)$ with the type-I of the multiple scenarios, i.e., $J_{{{\varvec{p}},{\varvec{r}}}} = \left\{ {\left( {{\varvec{p}}_{0} ,{\varvec{r}}_{1} } \right), \cdots ,\left( {{\varvec{p}}_{0} ,{\varvec{r}}_{\left| C \right|} } \right)} \right\}$, is equivalent to the standard MDP with parameter pair (${\varvec{p}}_{0}$, $\overline{\varvec{r}}$) where $\overline{\varvec{r}} = \sum\limits_{k \in C} {b_{{s_{0} ,k}} {\varvec{r}}_{k} }$.

The Weight-Select-Update (WSU) approximation algorithm in [4] is also a backward induction algorithm. Now we show that the WSU can solve exactly the DFMDPs with the type-I scenarios.

With the notations in this article, the procedure of the WSU is as follows:

There is ${J}_{{\varvec{p}},{\varvec{r}}}=\{\left({{\varvec{p}}}_{0},{{\varvec{r}}}_{1}\right),\cdots ,\left({{\varvec{p}}}_{0},{{\varvec{r}}}_{|C|}\right)\}$ in the $\left(T,S,A,C,{J}_{{\varvec{p}},{\varvec{r}}},\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \right)$ with the type-I scenarios. Let ${U}_{t}\left({s}_{t}\right)=\sum_{k\in C}{b}_{{s}_{0},k}{U}_{t,k}\left({s}_{t}\right),t=\mathrm{0,1},\cdots ,Z$. With parameters ${{\varvec{p}}}_{0}$ and $\overline{{\varvec{r}} }$, (30) and (31) in the above procedure become as follow:

$${x}_{t}({s}_{t})\leftarrow \underset{{a}_{t}\in A}{\text{argmax}}\left\{{\overline{r} }_{{s}_{t}}^{{a}_{t}}+\sum_{{s}_{t+1}\in S}{p}_{{s}_{t},{s}_{t+1},0}^{{a}_{t}}{U}_{t+1}\left({s}_{t+1}\right)\right\},$$

(30′)

and

$$\begin{aligned} & U_{Z} \left( {s_{Z} } \right) = \overline{r}_{{s_{Z} }}^{0} , \\ & U_{t} \left( {s_{t} } \right) \leftarrow \overline{r}_{{s_{t} }}^{{x_{t} \left( {s_{t} } \right)}} + \mathop \sum \limits_{{s_{t + 1} \in S}} p_{{s_{t} ,s_{t + 1} ,0}}^{{x_{t} \left( {s_{t} } \right)}} U_{t + 1} \left( {s_{t + 1} } \right) \\ \end{aligned}.$$

(31′)

The procedure with (30′) and (31′) is just for the solutions of the standard MDP with parameter pair (${{\varvec{p}}}_{0}$,$\overline{{\varvec{r}} }$).

The mean value problem (MVP) heuristic is another approximation algorithm for the DFMDPs. The MVP is a simple problem in which all parameters take on their expected values. For the DFMDPs, MVP corresponds to the case where all transition probabilities and rewards are weighted as follows:

$$\tilde{r}_{s}^{a} = \mathop \sum \limits_{k \in C} b_{{s_{0} ,k}} r_{s,k}^{a} , \tilde{p}_{{s,s{^{\prime}}}}^{a} = \mathop \sum \limits_{k \in C} b_{{s_{0} ,k}} p_{{s,s{^{\prime}},k}}^{a} , \forall\, s,s^{\prime} \in S,a \in A.$$

That is, the MVP is a standard MDP problem with the parameter pair $(\widetilde{{\varvec{p}}},\widetilde{{\varvec{r}}})$.

By the means of computational experiment in Sect. 4, we will compare the solutions by the MVP and the WSU approximation methods with the solutions by our method. Furthermore, we can obtain the following corollary based on the definition of the MVP and the WSU methods above.

Corollary

Both the MVP and the WSU methods are the exact solution methods for DFMDPs with type-I scenarios.

3.4 Infinite Horizon DFMDP

Now we formulate the infinite horizon DFMDPs. We assume that the state set, the action set and the scenario set are finite, the transition probabilities and the rewards for each scenario are stationary (time homogeneous), the rewards of each scenario are bounded. The discounted factor γ satisfies that 0 < γ < 1.We consider the stationary and deterministic policy π = (x(s, b_s), x(s, b_s), ···), where the decision rule x: S × C → A specifies the action to be taken at any state-belief pair.

Under the above assumptions, the optimality equations (28) can also be written as

$$\begin{aligned} &{V}_{n}({s}_{n},{{\varvec{b}}}_{{s}_{n}})=\underset{{a}_{n}\in A}{\text{max}}\left\{\sum_{k\in C}{b}_{{s}_{n},k}{r}_{{s}_{n},k}^{{a}_{n}}+\gamma \sum_{{s}_{n+1}\in S}\sum_{k\in C}{b}_{{s}_{n},k}{p}_{{s}_{n},{s}_{n+1},k}^{{a}_{n}}{V}_{n+1}\left({s}_{n+1},\tau \left(\langle {s}_{n},{{\varvec{b}}}_{{s}_{n}}\rangle ,{a}_{n},{s}_{n+1}\right)\right)\right\}.\end{aligned}$$

(32)

Passing to the limit in (32) suggests that equations of the following form will characterize values and optimal policies for the infinite horizon DFMDPs:

$$V(s,{{\varvec{b}}}_{s})=\underset{a\in A}{\text{max}}\left\{\sum_{k\in C}{b}_{s,k}{r}_{s,k}^{a}+\gamma \sum_{s\mathrm{^{\prime}}\in {S}}\sum_{k\in C}{b}_{s,k}{p}_{s,s\mathrm{^{\prime}},k}^{a}V\left(s\mathrm{^{\prime}},\tau \left(\langle s,{{\varvec{b}}}_{s}\rangle ,a,s\mathrm{^{\prime}}\right)\right)\right\},$$

(33)

where V(s, b_s) denotes the optimal value function for $\langle s,{{\varvec{b}}}_{s}\rangle$. The system of equations (33) are called the optimality equations or Bellman equations for the infinite horizon DFMDPs.

The Bellman equations (33) can also be rewritten in the value function mapping form. Let $\mathcal{V}$ be the space of real-valued bounded functions V: S × C → $\mathbb{R}$, we have η: S × Δ^C × A × $\mathcal{V}$ → $\mathbb{R}$ defined as

$$\eta \left(s,{{\varvec{b}}}_{s},a,V\right)=\sum_{k\in C}{b}_{s,k}{r}_{s,k}^{a}+\gamma \sum_{s\mathrm{^{\prime}}\in {S}}\sum_{k\in C}{b}_{s,k}{p}_{s,s\mathrm{^{\prime}},k}^{a}V\left(s\mathrm{^{\prime}},\tau \left(\langle s,{{\varvec{b}}}_{s}\rangle ,a,s\mathrm{^{\prime}}\right)\right).$$

Now by defining the value function mapping H: $\mathcal{V}$ → $\mathcal{V}$ as $HV\left( {s,{\varvec{b}}_{s} } \right) = {\text{max}}_{a \in A} \eta \left( {s,{\varvec{b}}_{s} ,a,V} \right)$, the Bellman equations (33) can be written as V = HV.

Theorem 9

Let H be the value function mapping defined above, then H is an isotone mapping and a contraction under the supremum norm $||V||=\text{sup}_{\langle s,{{\varvec{b}}}_{s}\rangle \in S\times {\Delta }^{C}}|V(s,{{\varvec{b}}}_{s})|$.

Since H is a contraction mapping, there exists a unique V^* $\in$ $\mathcal{V}$ such that V^* = HV^* by Barnach's fixed-point theorem [1]. And for any V₀ $\in$ $\mathcal{V}$, the sequence {V_n} defined below converges to V^* (see [1]):

$${V}_{n}=H{V}_{n-1}={H}^{n}{V}_{0}.$$

The theory above is the base of developing algorithms solving the infinite horizon DFMDPs. To limit the scope of this paper, the algorithms and case studies for the infinite horizon DFMDPs would be presented by our follow-up articles.

4 Computational Experiments

The computational experiments involving three sets of test instances for comparing solution methods for finite horizon DFMDPs considering run-time and solution quality are illustrated in this section. The first set of experiments is based on a randomly generated multi-scenario MDP. The second set of experiments is based on a multi-scenario MDP for determining the most cost-effective human immunodeficiency virus (HIV) treatment policy which has been used for pedagogical purposes in the medical decision-making literature [21, 22]. The third set of experiments is an illustrative instance of the scenario recognition problem.

As benchmarks, other than solving the problems with the DFBI algorithm, we will also generate solutions from the WSU, MVP methods and the MIP formulation. We use ${V}_{Z}^{*}$ to denote the optimal value obtained by the DFBI, and ${\widetilde{V}}_{Z}^{*}$ to denote the optimal value obtained from either WSU or MVP for the multi-scenario MDP with Z decision epochs. Then let

$$\mathrm{Gap}=\frac{{V}_{Z}^{*}-{\widetilde{V}}_{Z}^{*}}{{V}_{Z}^{*}}\times 100\mathrm{\%}.$$

All solution methods are implemented using MATLAB 2017a.

4.1 Random Instance

To generate the random instances, firstly the number of states, actions, scenarios, and decision epochs for the problem need to be defined. Then, the scenario parameters are randomly sampled. We sample the rewards from uniform distribution and the transition probabilities from Dirichlet distribution. The initial beliefs are uninformed priors on the scenarios.

Let |S| =|A| =|C|= 4 and Z = 4. The rewards for all scenarios are randomly generated from uniform distribution between 0 and 1. The transition probabilities are randomly generated from Dirichlet distribution which is characterized by |S| parameters $(\rho {\alpha }_{1},\rho {\alpha }_{2},\cdots,\rho {\alpha }_{|S|})$, where $({\alpha }_{1},{\alpha }_{2},\cdots,{\alpha }_{|S|})$ with α_i > 0 ∀ i, is the base measure of the distribution and ρ > 0 is the concentration parameter. Then we repeat the process, generating 30 instances for ρ = 1, 10, 100, respectively. Table 1 demonstrates the run-time of the four methods: the MVP, WSU, MIP and DFBI. We find that the MVP and WSU methods are able to solve these instances more quickly (under 0.05 CPU seconds for each instance) and the DFBI also does fairly quickly (under 5 CPU seconds for each instance) with the exact solutions. The MIP takes much more run-time to solve for the exact solutions among these instances.

Table 1 The solution time (CPU seconds) of the MVP, WSU, MIP and DFBI methods on random DFMDP test instances

Full size table

We evaluate the optimality gap of MVP and WSU methods considering three type of the multiple scenarios. Table 2 demonstrates some summary statistics for the optimality gap of MVP and WSU methods, obtained from these 50 instances for each type of scenarios and each concentration parameter. For the type-I scenarios, the results conforms with the Corollary in Sect. 3.3 and both the MVP and the WSU find the exact solutions. For fixed ρ, the optimality gaps of the MVP and WSU methods for the type-III scenarios are much greater than the ones for the type-II scenarios. For both the type-II and type-III scenarios, the greater the value of ρ, the smaller the optimality gaps of the MVP and WSU methods. This is because the greater value of ρ corresponds to the smaller variance for transition matrix across scenarios, which approximates to the problem with the type-I scenarios.

Table 2 The optimality gap of MVP and WSU methods on random DFMDPs with three types of multiple scenarios

Full size table

4.2 Instance of Medical Decision-Making

An MDP for determining the optimal timing of treatments for HIV is considered. In the MDP, HIV is characterized according to 4 health states: Mild, Moderate, Severe, or Dead. The decision maker can choose to start the patient on one of three treatments: Treatment A, Treatment B, and Treatment C. Treatment A is the least effective but also the least expensive while Treatment C is the most effective but comes at the highest cost. A summary table of parameter values for this MDP as well as some sampling distributions for each parameter is provided in [21]. In our experiments, we sampled two scenarios of the transition probabilities from the Dirichlet distribution in [21] to simulate findings coming from different clinical studies. They are listed below.

$$\begin{aligned} & {{\varvec{p}}}_{1}^{\mathbf{A}}=\left[\begin{array}{llll}0.710&0.209&0.070&0.011\\ 0&0.581&0.400&0.019\\ 0&0&0.739&0.261\\ 0&0&0&1 \end{array}\right], {{\varvec{p}}}_{1}^{\mathbf{B}}=\left[\begin{array}{llll}0.790&0.151&0.051&0.008\\ 0&0.697&0.290&0.013 \\ 0&0&0.811&0.189 \\ 0&0&0&1 \end{array}\right], \\&{{\varvec{p}}}_{1}^{\mathbf{C}}=\left[\begin{array}{llll}0.898&0.074&0.025&0.003\\ 0&0.852&0.142&0.006\\ 0&0&0.908&0.092\\ 0&0&0&1\end{array}\right] \end{aligned}$$

$$\begin{aligned} & {{\varvec{p}}}_{2}^{\mathbf{A}}=\left[\begin{array}{llll}0.733&0.198&0.064&0.005\\ 0&0.582&0.408&0.010\\ 0&0&0.753&0.247\\ 0&0&0&1 \end{array}\right], {{\varvec{p}}}_{2}^{\mathbf{B}}=\left[\begin{array}{llll}0.806&0.143&0.046&0.005\\ 0&0.697&0.296&0.007\\ 0&0&0.821&0.179\\ 0&0&0&1 \end{array}\right], \\ & {{\varvec{p}}}_{2}^{\mathbf{C}}=\left[\begin{array}{llll}0.862&0.102&0.033&0.003\\ 0&0.784&0.211&0.005\\ 0&0&0.872&0.128\\ 0&0&0&1\end{array}\right].\end{aligned}$$

The data related to the rewards is taken directly from [21, 22] and they are the same for two scenarios. Let S = {Mild, Moderate, Severe}, A = {A, B, C}, Z = 9, and γ = 1. With these settings, we obtain the optimal policies, ${\pi }_{k}^{*}$ for MDP k (k = 1, 2) and ${\pi }^{*}$ for the DFMDP. We consider that the number of patients eligible for either scenario is balanced, thus ${{\varvec{b}}}_{{s}_{0}}={[\text{0.5,0.5}]}^{\top}$. Then we can achieve the weighted value (i.e., the average net benefit in [21, 22]) for each policy, i.e., ${V}^{\pi }\left({s}_{0},{b}_{{s}_{0}}\right), \pi ={\pi }_{1}^{*},{\pi }_{2}^{*}, {\pi }^{*}$ listed in Table 3.

Table 3 shows that for the patients starting in the mild health state, the average net benefit of policy ${\pi }^{*}$ is greater than that for policies ${\pi }_{k}^{*}, k=\mathrm{1,2}$ by 0.44% and 3.05% respectively. For the patients starting in the moderate health state, the average net benefit of policy ${\pi }^{*}$ is greater than that for policies ${\pi }_{k}^{*}, k=\mathrm{1,2}$ by 0.24% and 12.38% respectively. For the patients starting in the severe health state, the average net benefits of policies ${\pi }^{*},{\pi }_{1}^{*},{\pi }_{2}^{*}$ are all zero.

This instance illustrates that it is difficult to obtain the best average gain by using the optimal policy of a single MDP under multiple scenarios of the parameters, but the optimal policy from DFMDP can do.

Table 3 The average net benefits ${V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right), \pi ={\pi }_{1}^{*},{\pi }_{2}^{*}, {\pi }^{*}$ (Unit: US$)

Full size table

4.3 Variant of Medical Instance

A illustrative instance of the scenario recognition problem is given here. Assume that there are two authoritative findings with different transition probabilities in the previous medical instance. We represent them with scenario k = 1, 2 respectively as follows.

$$\begin{aligned}&{{\varvec{p}}}_{1}^{\mathbf{A}}=\left[\begin{array}{llll}0.750& 0.150 &0.090&0.010\\ 0 &0.600& 0.380&0.020\\ 0& 0 &0.750& 0.250\\ 0& 0& 0& 1 \end{array}\right], {{\varvec{p}}}_{1}^{\mathbf{B}}=\left[\begin{array}{llll}0.812& 0.112& 0.068 &0.008\\ 0& 0.700& 0.285 &0.015\\ 0& 0 &0.812&0.188\\ 0 &0 &0& 1 \end{array}\right], \\ & {{\varvec{p}}}_{1}^{\mathbf{C}}=\left[\begin{array}{llll}0.875 &0.075&0.045 &0.005\\ 0 &0.800& 0.190 &0.010\\ 0 &0 &0.875& 0.125\\ 0 &0& 0 &1\end{array}\right],\end{aligned}$$

$$\begin{aligned} & {{\varvec{p}}}_{2}^{\mathbf{A}}=\left[\begin{array}{llll}0.787&0.145&0.068&0 \\ 0&0.742&0.245&0.013\\ 0&0&0.771&0.229\\ 0& 0& 0& 1 \end{array}\right], {{\varvec{p}}}_{2}^{\mathbf{B}}=\left[\begin{array}{llll}0.842&0.108&0.050&0 \\ 0&0.809&0.182& 0.009\\ 0&0&0.830&0.170\\ 0& 0& 0&1 \end{array}\right], \\ & {{\varvec{p}}}_{2}^{\mathbf{C}}=\left[\begin{array}{llll}0.898&0.102&0&0 \\ 0.120&0.751&0.129& 0 \\ 0&0.084&0.796&0.120\\ 0&0&0&1\end{array}\right],\end{aligned}$$

where the transition probabilities of scenario 1 come from [22], and ones of scenario 2 are hypothetical. The data related to the rewards is taken directly from [21, 22] and the same for two scenarios. Let S = {1, 2, 3} = {Mild, Moderate, Severe}, A = {A, B, C}, Z = 8, and γ = 1.

If the patient exhibits transition that shows similar pattern as scenario 1, we would infer the patient belongs to patient group I, similarly for patient group II. If we don’t have any prior knowledge of the population percentage for these two groups, we would just assume an equal possibility, i.e., ${{\varvec{b}}}_{{s}_{0}}={[\mathrm{0.5,0.5}]}^{\top}$. Now let’s start with a patient currently in the mild health state, and apply the optimal treatment strategy while recognizing which group this patient belongs to. The optimal treatment policy for the patients starting in the mild health state (s₀ = 1) can be found by using the DFBI and is demonstrated in Fig. 2.

For this patient X with his/her current health state being in mild, our first-year treatment plan is C. And at the end of year-1, if his/her health station remains at mild or transitions to moderate, then we continue providing treatment plan C for the patient. If the health state becomes severe, the optimal treatment in year-2 is ${x}_{1}^{*}\left(3,{\mathbf b}_{3}\right)={x}_{1}^{*}\left(3,{{\varvec{e}}}_{1}\right)={x}_{\mathrm{1,1}}^{*}\left(3\right)=\mathbf{A}$. The state-belief pair $\langle 3,{{\varvec{e}}}_{1}\rangle$ implies that patient X belongs to group-I rather than group-II, since ${p}_{\mathrm{1,3},2}^{{\varvec{C}}}=0$ implies that it’s impossible for group-II patient’s health state transitioning from mild to severe. Similarly, for another patient Y if his/her health state becomes mild at the end of year-2, after two consecutive treatment plan C, the optimal treatment in year 3 is ${x}_{2}^{*}\left(1,{{\varvec{b}}}_{1}\right)={x}_{2}^{*}\left(1,{{\varvec{e}}}_{2}\right)={x}_{\mathrm{2,2}}^{*}\left(1\right)=\mathbf{C}$, and the state-belief pair $\langle 1,{{\varvec{e}}}_{2}\rangle$ implies that patient Y belongs to group-II.

This is only an illustrative example with very limited scale. In order to apply this idea to real-world scenario recognition problems, the method needs to be combined with algorithms dealing with infinite horizon DFMDPs. This will be one of our focusing directions for future works.

5 Conclusions and Future Works

For the weighted value problem of MDPs with multiple scenarios of the parameters, we introduce the concept of scenario belief to indicate the probability distribution that the scenarios realize in the system and derive its update at every decision epoch based on Bayesian rule. We formulate the expected total discounted reward of a policy by adding the expectation over the scenario beliefs, on top of the usual expectation over the intrinsic state transition uncertainty, and establish a new framework named as the DFMDPs. We show that the usage of state-belief pair is the sufficient statistics of the past history and contains the complete information required for decision-making. We discuss four classes of policies in the finite horizon DFMDPs and prove that there exists a double-factored Markovian deterministic policy which is optimal among all classes of policies. The double-factored backward induction algorithm is proposed. It is an efficient exact solution method for the double-factored Markovian deterministic policies in the finite horizon DFMDPs. We also show that the optimality equation for the infinite horizon DFMDPs is an isotone mapping and a contraction under the supremum norm. This ensures the existence of solutions of the optimality equation. Our work enriches the theories of MDPs and expands the application scope of MDPs.

Future work will focus on the following items:

(i)
The restrictions on the state set and action set can be relaxed such that DFMDP models apply to more real-world problems.
(ii)
The efficient solution methods for the infinite horizon DFMDPs are further studied.
(iii)
More efficient offline and online algorithms can be invented to solve the large-size DFMDPs for real-world applications.
(iv)
Investigating a learning-based approach in the framework of DFMDP is one of the directions for future exploration.

References

Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, 2nd edn. Wiley, New Jersey (2005)
MATH Google Scholar
Beemsterboer, B.J., Land, M.J., Teunter, R.H.: Flexible lot sizing in hybrid make-to-order/make-to-stock production planning. Eur. J. Oper. Res. 260(3), 1014–1023 (2017)
Article MathSciNet MATH Google Scholar
Chen, N., Teven, K., Wang, C.: A partitioning algorithm for markov decision processes with applications to market microstructure. Manag. Sci. 64(2), 784–803 (2018)
Article Google Scholar
Steimle, L.N., Kaufman, D.L., Denton, B.T.: Multi-model Markov decision processes. IISE Trans. 53(10), 1124–1139 (2022)
Google Scholar
Buchholz, P., Scheftelowitsch, D.: Computation of weighted sums of rewards for concurrent MDPs. Math. Methods Oper. Res. 89(1), 1–42 (2019)
Article MathSciNet MATH Google Scholar
Givan, R., Leach, S., Dean, T.: Bounded-parameter Markov decision processes. Artif. Intell. 122(1–2), 71–109 (2000)
Article MathSciNet MATH Google Scholar
Delgado, K.V., Sanner, S., de Barros, L.N.: Efficient solutions to factored MDPs with imprecise transition probabilities. Artif. Intell. 175(9–10), 1498–1527 (2011)
Article MathSciNet MATH Google Scholar
Witwicki, S.J., Melo, F.S., Capitan, J., Spaan, M.T.J.: A flexible approach to modeling unpredictable events in MDPs. In: Proceedings of Twenty-Third International Conference on Automated Planning and Scheduling ICAPS2013, pp. 260–268 (2013)
Duff, M.: Optimal learning: computational procedures for bayes-adaptive markov decision processes. Ph.D. thesis, University of Massachusetts Amherst, Amherst, MA (2002)
Castro, P. S., Precup, D.: Using linear programming for Bayesian exploration in Markov decision processes. In: International Joint Conference on Artificial Intelligence IJCAI2007, pp. 2437–2442 (2007)
Kumar, P.: Information theoretic learning methods for Markov decision processes with parametric uncertainty. Ph.D. thesis, University of Washington (2018).
Iyengar, G.: Robust dynamic programming. Math. Oper. Res. 30(2), 257–280 (2005)
Article MathSciNet MATH Google Scholar
Nilim, A., El Ghaoui, L.: Robust control of Markov decision processes with uncertain transition matrices. Oper. Res. 53(5), 780–798 (2005)
Article MathSciNet MATH Google Scholar
Delgado, K.V., de Barros, L.N., Dias, D.B., Sanner, S.: Real-time dynamic programming for Markov decision process with imprecise probabilities. Artif. Intell. 230(8), 192–223 (2016)
Article MATH Google Scholar
Moreira, D.A.M., Delgado, K.V., de Barros, L.N.: Robust probabilistic planning with ilao. Appl. Intell. 45(3), 662–672 (2016)
Article Google Scholar
Delage, E., Shie, M.: Percentile optimization for markov decision processes with parameter uncertainty. Oper. Res. 58(1), 203–213 (2010)
Article MathSciNet MATH Google Scholar
Adulyasak, Y., Varakantham, P., Ahmed, A., Jaillet, P.: Solving uncertain MDPs with objectives that are separable over instantiations of model uncertainty. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, Texas, AAAI Press, pp. 3454–3460 (2015)
Ahmed, A., Varakantham, P., Lowalekar, M., Adulyasak, Y., Jaillet, P.: Sampling based approaches for minimizing regret in uncertain Markov decision processes (MDPs). J. Artif. Intell. Res. 59, 229–264 (2017)
Article MathSciNet MATH Google Scholar
Meraklı, M., Küçükyavuz, S.: Risk aversion to parameter uncertainty in Markov decision processes with an application to slow-onset disaster relief. IISE Trans. 52(8), 811–831 (2019)
Article Google Scholar
Shani, G., Heckerman, D., Brafman, R.: An MDP-based recommender system. J. Mach. Learn. Res. 6(43), 1265–1295 (2005)
MathSciNet MATH Google Scholar
Chen, Q., Ayer, T., Chhatwal, J.: Sensitivity analysis in sequential decision models: a probabilistic approach. Med. Decis. Making 37(2), 243–252 (2017)
Article Google Scholar
Bala, M.V., Mauskopf, J.A.: Optimal assignment of treatments to health states using a Markov decision model. Pharmacoeconomics 24(4), 345–354 (2006)
Article Google Scholar

Download references

Acknowledgements

The author is grateful to the editors and reviews for their dedicated inputs and supports during the review process. The author would like to thank his advisor Prof. Theodore Allen from Systems Engineering of The Ohio State University, for his continuous support and guidance. The author also wants to express his sincere appreciation to Prof. Nick Hall from Fisher Business School of The Ohio State University for his valuable suggestions on this work.

Author information

Authors and Affiliations

Software Research and Data Science, Amazon Robotics, North Reading, MA, 01864, USA
Cheng-Jun Hou

Authors

Cheng-Jun Hou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.-J. Hou contributed to all the elements involved in this paper, including framework conceptualization, algorithm design, mathematical proof, and numeric validation.

Corresponding author

Correspondence to Cheng-Jun Hou.

Ethics declarations

Competing Interests

The author declares that there are no competing interests related to the research work and the writing of this paper.

Additional information

This research was originally supported by the (United States) National Science Foundation (No. 1409214).

Appendix

1.1 Proof of Theorem 1

For any given ${h}_{0}=\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \in S\times {\Delta }^{C}$ and a fixed π ∊ Π^DHD, the history sets ${H}_{t}^{\pi }$ for t = 0,1, ···, Z are determined by step 1 in the policy evaluation algorithm. Equation (22) shows that when the history up to time t is $h_{t} \in H_{t}^{\pi }$, the expected value of policy π at decision epoch t, t + 1, ···, Z is equal to the reward ${\overline{r} }_{{s}_{t}}^{{x}_{t}({h}_{t})}$ received by selecting action x_t(h_t) plus the expected total discounted reward over the remaining periods. The second term contains the product of ${\overline{p} }_{{s}_{t},{s}_{t+1}}^{{x}_{t}({h}_{t})}$ the probability of state transiting from s_t to s_t+1 when action x_t(h_t) is performed at decision epoch t, and ${U}_{\mathrm{t}+1}^{\pi }$ the expected total discounted reward obtained by applying π at decision epoch t + 1, ···, Z when the history up to time t + 1 is ${h}_{t+1}=\left({h}_{t},{x}_{t}({h}_{t}),\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle \right)$. Summing over all possible states s_t+1 gives the desired expectation expressed in terms of ${U}_{\mathrm{t}+1}^{\pi }$. So Eq. (22) can be written as below

$${U}_{t}^{\pi }\left({h}_{t}\right)={\overline{r} }_{{s}_{t}}^{{x}_{t}\left({h}_{t}\right)}+\gamma {E}_{{h}_{t}}^{\pi }\left\{{U}_{t+1}^{\pi }\left({h}_{t},{x}_{t}\left({h}_{t}\right),\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle \right)\right\}.$$

(A1)

The rest of the proof is by backward induction where the index of induction is t. It is obvious that (21) holds when t = Z. Suppose now that (21) holds for t + 1, ···, Z. Then by using (A1) and the induction hypothesis, we have

$$\begin{aligned} & U_{t}^{\pi } \left( {h_{t} } \right) = \overline{r}_{{s_{t} }}^{{x_{t} \left( {h_{t} } \right)}} + \gamma E_{{h_{t} }}^{\pi } \left\{ {E_{{h_{t + 1} }}^{\pi } \left[ {\mathop \sum \limits_{n = t + 1}^{Z - 1} \gamma^{{n - \left( {t + 1} \right)}} \overline{r}_{{s_{n} }}^{{x_{n} \left( {h_{n} } \right)}} + \gamma^{{Z - \left( {t + 1} \right)}} \overline{r}_{{s_{Z} }}^{0} } \right]} \right\} \\ & \quad = \overline{r}_{{s_{t} }}^{{x_{t} \left( {h_{t} } \right)}} + \gamma E_{{h_{t} }}^{\pi } \left\{ {\mathop \sum \limits_{n = t + 1}^{Z - 1} \gamma^{{n - \left( {t + 1} \right)}} \overline{r}_{{s_{n} }}^{{x_{n} \left( {h_{n} } \right)}} + \gamma^{{Z - \left( {t + 1} \right)}} \overline{r}_{{s_{Z} }}^{0} } \right\} = E_{{h_{t} }}^{\pi } \left\{ {\mathop \sum \limits_{n = t}^{Z - 1} \gamma^{n - t} \overline{r}_{{s_{n} }}^{{x_{n} \left( {h_{n} } \right)}} + \gamma^{Z - t} \overline{r}_{{s_{Z} }}^{0} } \right\}. \\ \end{aligned}$$

It is true that (21) holds for t. Therefore, ${V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={U}_{0}^{\pi }\left({h}_{0}\right)$ when t = 0.

1.2 Proof of Theorem 3

The proof is in two parts. Firstly, we establish by induction that ${U}_{t}\left({h}_{t}\right)\geqslant {U}_{t}^{*}\left({h}_{t}\right)$ for all h_t $\in$ H_t and t = 0,1, ···, Z. Obviously, ${U}_{Z}\left({h}_{Z}\right)={\overline{r} }_{{s}_{Z}}^{0}={U}_{Z}^{\pi }\left({h}_{Z}\right)$ for all h_Z $\in$ H_Z and π $\in$ Π^DHR. Therefore ${U}_{Z}\left({h}_{Z}\right)={U}_{Z}^{*}\left({h}_{Z}\right)$ for all h_Z $\in$ H_Z. Now assume that ${U}_{t}\left({h}_{t}\right)\geqslant {U}_{t}^{*}\left({h}_{t}\right)$ for all $h_{t} \in H_{t}$ for t = n + 1, ···, Z. Let $\pi^{\prime } = \left( {x_{0}^{\prime } ,x_{1}^{\prime } ,\cdots,x_{Z - 1}^{\prime } } \right)$ be an arbitrary policy in Π^DHR. So for t = n, we have

$$U_{n} \left( {h_{n} } \right) = \mathop {{\text{max}}}\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\}$$

$$\geqslant \mathop {{\text{max}}}\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{*} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\}$$

(A2)

$$\geqslant \mathop {{\text{max}}}\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{{\pi {^{\prime}}}} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\}$$

(A3)

$$\geqslant \mathop \sum \limits_{{a_{n} \in A}} \mu_{{h_{n} }}^{^{\prime}} \left( {a_{n} } \right)\left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{{\pi {^{\prime}}}} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\}$$

(A4)

$$= U_{n}^{{\pi {^{\prime}}}} \left( {h_{n} } \right).$$

Line (A2) holds because of the induction hypothesis and non-negativity of $\overline{{\varvec{p}} }$. Line (A3) holds because of the definition of ${U}_{n+1}^{*}$. Line (A4) follows from Lemma 1. The last equality follows from (23) and Theorem 2.

Because ${\pi}'$ is arbitrary, we have

$${U}_{n}\left({h}_{n}\right)\geqslant {U}_{n}^{\pi }\left({h}_{n}\right)$$

for all π $\in$ Π^DHR. Thus ${U}_{t}\left({h}_{t}\right)\geqslant {U}_{t}^{*}\left({h}_{t}\right)$ and the induction hypothesis holds. The first part finishes.

For the second part, we establish that for any ε > 0, there always exists a ${\pi}' {\in}$ Π^DHD so that

$${U}_{t}^{\pi \mathrm{^{\prime}}}\left({h}_{t}\right)+\left(Z-t\right)\varepsilon \geqslant {U}_{t}\left({h}_{t}\right)$$

(A5)

for all h_t $\in$ H_t and t = 0,1, ··· , Z. Since $U_{Z}^{{\pi^{\prime } }} \left( {h_{Z} } \right) = U_{Z} \left( {h_{Z} } \right) = \overline{r}_{{s_{Z} }}^{0}$, the induction hypothesis holds for t = Z. Assuming that $U_{t}^{{\pi^{\prime } }} \left( {h_{t} } \right) + \left( {Z - t} \right)\varepsilon \geqslant U_{t} \left( {h_{t} } \right)$ for t = n + 1, ···, Z, we have

$$\begin{aligned} U_{n}^{\pi ^{\prime}} \left( {h_{n} } \right) & = \overline{r}_{{s_{n} }}^{{x_{n} \left( {h_{n} } \right)}} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{x_{n} \left( {h_{n} } \right)}} U_{n + 1}^{\pi ^{\prime}} \left( {h_{n} ,x_{n} \left( {h_{n} } \right),\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right) \\ & \geqslant \overline{r}_{{s_{n} }}^{{x_{n} \left( {h_{n} } \right)}} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{x_{n} \left( {h_{n} } \right)}} U_{n + 1} \left( {h_{n} ,x_{n} \left( {h_{n} } \right),\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right) - \left( {Z - n - 1} \right)\varepsilon \\ & \geqslant U_{n} \left( {h_{n} } \right) - \left( {Z - n} \right)\varepsilon .\\ \end{aligned}$$

(A6)

The second line in (A6) holds because of the induction hypothesis and ε > 0. Thus the inductive hypothesis is satisfied and (A5) holds for t = 0,1, ···, Z.

Thus for any ε > 0, there exists a ${\pi}'$ $\in$ Π^DHR for which

$$U_{t}^{*} \left( {h_{t} } \right) + \left( {Z - t} \right)\varepsilon \geqslant U_{t}^{{\pi^{\prime } }} \left( {h_{t} } \right) + \left( {Z - t} \right)\varepsilon \geqslant U_{t} \left( {h_{t} } \right) \geqslant U_{t}^{*} \left( {h_{t} } \right),$$

then part (i) in the theorem holds. Part (ii) in the theorem holds because

$${U}_{0}\left({h}_{0}\right)={U}_{0}^{*}\left({h}_{0}\right)=\underset{\pi \in {\Pi }^{\text{DHR}}}{\text{max}}{U}_{0}^{\pi }\left({h}_{0}\right)=\underset{\pi \in {\Pi }^{\text{DHR}}}{\text{max}}{V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={V}^{*}\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right).$$

1.2.1 Proof of Theorem 4

For t = Z, clearly ${U}_{Z}^{{\pi }^{*}}\left({h}_{Z}\right)={U}_{Z}^{*}\left({h}_{Z}\right)$ for all h_Z $\in$ H_Z. Assume argument in part (i) in the theorem is true for t = n + 1, ···, Z. Then, for t = n and ${h}_{n}=\left({h}_{n-1},{x}_{n-1}^{*}({h}_{n-1}),\langle {s}_{n},{{\varvec{b}}}_{{s}_{n}}\rangle \right)$,

$$\begin{aligned} U_{n}^{*} \left( {h_{n} } \right) & = \mathop {\max }\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{*} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\} \\ & = \overline{r}_{{s_{n} }}^{{x_{n}^{*} \left( {h_{n} } \right)}} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{x_{n}^{*} \left( {h_{n} } \right)}} U_{n + 1}^{{\pi^{*} }} \left( {h_{n} ,x_{n}^{*} \left( {h_{n} } \right),\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right) = U_{n}^{{\pi^{*} }} \left( {h_{n} } \right). \\ \end{aligned}$$

Thus the induction hypothesis is true. Part (ii) in the theorem follows from Theorem 1 and Theorem 3-(ii).

1.3 Proof of Theorem 6

We show that (i) holds by induction. Since ${U}_{Z}^{*}\left({h}_{Z}\right)={U}_{Z}^{*}\left({h}_{Z-1},{a}_{Z-1},\langle {s}_{Z},{{\varvec{b}}}_{{s}_{Z}}\rangle \right)={\overline{r} }_{{s}_{Z}}^{0}$ for all h_Z–1 $\in$ H_Z–1 and a_Z–1 $\in$ A, ${U}_{Z}^{*}\left({h}_{Z}\right)={U}_{Z}^{*}\left({s}_{Z},{{\varvec{b}}}_{{s}_{Z}}\right)$. Assume now that (i) is true for t = n + 1, ···, Z. Then let t = n. For any ${h}_{n}=\left({h}_{n-1},{a}_{n-1},\langle {s}_{n},{{\varvec{b}}}_{{s}_{n}}\rangle \right)\in {H}_{n}$, it follows from (24), the induction hypothesis and (14) that

$$\begin{aligned} U_{n}^{*} \left( {h_{n} } \right) & = \mathop {\max }\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{*} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\} \\ & = \mathop {\max }\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{*} \left( {s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} } \right)} \right\} \\ & = \mathop {\max }\limits_{{a_{n} \in A}} \left\{ {\mathop \sum \limits_{k \in C} b_{{s_{n} ,k}} r_{{s_{n} ,k}}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \mathop \sum \limits_{k \in C} b_{{s_{n} ,k}} p_{{s_{n} ,s_{n + 1} ,k}}^{{a_{n} }} U_{n + 1}^{*} \left( {s_{n + 1} ,\tau \left( {\langle s_{n} ,{\varvec{b}}_{{s_{n} }} \rangle ,a_{n} ,s_{n + 1} } \right)} \right)} \right\}. \\ \end{aligned}$$

(A7)

Because the quantities within brackets in the last line of (A7) depends on h_n only through $\langle {s}_{n},{{\varvec{b}}}_{{s}_{n}}\rangle$, we have ${U}_{n}^{*}\left({h}_{n}\right)={U}_{n}^{*}\left({s}_{n},{{\varvec{b}}}_{{s}_{n}}\right)$. Thus part (i) in the theorem is true for t = 0,1, ··· , Z.

When S and A are finite, there exists a policy $\pi^{*} = \left( {x_{0}^{*} ,x_{1}^{*},\,\dots\,,x_{Z - 1}^{*} } \right) \in {\Pi }^{{{\text{DMD}}}}$ derived from

$${x}_{t}^{*}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right)\in \underset{{a}_{t}\in A}{\text{argmax}}\left\{\sum \limits_{k \in C}{b}_{{s}_{t},k}{r}_{{s}_{t},k}^{{a}_{t}}+\gamma \sum \limits_{{s_{t + 1} \in S}}\sum \limits_{k \in C} {b}_{{s}_{t},k}{p}_{{s}_{t},{s}_{t+1},k}^{{a}_{t}}{U}_{t+1}^{*}\left({s}_{t+1},\tau \left(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle ,{a}_{t},{s}_{t+1}\right)\right)\right\}.$$

Therefore, by part (i) in the theorem and Theorem 4-(ii), ${\pi }^{*}$ is optimal.

1.4 Proof of Theorem 7

We will show that ${W}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)$ in (4) is equal to ${V}^{\pi }({s}_{0},{{\varvec{b}}}_{{s}_{0}})$ in (16). Firstly, we provide another calculation of scenario belief ${{\varvec{b}}}_{{s}_{t}}$ in state s_t at decision epoch t. For any $h_{0} = \langle s_{0} ,{\varvec{b}}_{{s_{0} }} \rangle \in S \times {\Delta }^{C}$, using history recursion ${h}_{t}=({h}_{t-1},{a}_{t-1},\langle {s}_{t},\tau (\langle {s}_{t-1},{{\varvec{b}}}_{{s}_{t-1}}\rangle ,{a}_{t-1},{s}_{t})\rangle )$ repeatedly, the belief ${{\varvec{b}}}_{{s}_{t}}$ within $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle$ in h_t $\in$ H_t can be expressed by

$$b_{{s_{t} ,k}} = \frac{{b_{{s_{0} ,k}} p_{{s_{0} ,s_{1} ,k}}^{{a_{0} }} p_{{s_{1} ,s_{2} ,k}}^{{a_{1} }} \ldots p_{{s_{t - 1} ,s_{t} ,k}}^{{a_{t - 1} }} }}{{\mathop \sum\limits_{\text{k}' \in \text{C}}b_{{s_{0} ,k{^{\prime}}}} p_{{s_{0} ,s_{1} ,k{^{\prime}}}}^{{a_{0} }} p_{{s_{1} ,s_{2} ,k{^{\prime}}}}^{{a_{1} }} \ldots p_{{s_{t - 1} ,s_{t} ,k{^{\prime}}}}^{{a_{t - 1} }} }}, \forall\,k \in C,$$

(A8)

where s₀, a₀, s₁, ···, a_t−1, s_t are the state realizations and actions taken up to t.

For a fixed policy π $\in$ Π^DHD, starting from the boundary condition ${U}_{Z}^{\pi }\left({h}_{Z}\right)={\overline{r} }_{{s}_{Z}}^{0}$, we evaluate recursively ${U}_{t}^{\pi }$ for $t=Z-1,\cdots,1,0$ by (22), that is,

$$\begin{aligned} U_{Z - 1}^{\pi } \left( {h_{Z - 1} } \right) & = \overline{r}_{{s_{Z - 1} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z} \in S}} \overline{p}_{{s_{Z - 1} ,s_{Z} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} U_{Z}^{\pi } \left( {h_{Z - 1} ,x_{Z - 1} \left( {h_{Z - 1} } \right),\langle s_{Z} ,{\varvec{b}}_{{s_{Z} }} \rangle } \right) \\ & = \overline{r}_{{s_{Z - 1} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z} \in S}} \overline{p}_{{s_{Z - 1} ,s_{Z} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} \overline{r}_{{s_{Z} }}^{0} ,\forall\, h_{Z - 1} \in H_{Z - 1}^{\pi } , \\ \end{aligned}$$

$$\begin{aligned} &U_{Z - 2}^{\pi } \left( {h_{Z - 2} } \right) = \overline{r}_{{s_{Z - 2} }}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z - 1} \in S}} \overline{p}_{{s_{Z - 2} ,s_{Z - 1} }}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} U_{Z - 1}^{\pi } \left( {h_{Z - 2} ,x_{Z - 2} \left( {h_{Z - 2} } \right),\langle s_{Z - 1} ,{\varvec{b}}_{{s_{Z - 1} }} \rangle } \right) \\ &\quad\quad\quad\quad\!\!\quad\quad\quad = \overline{r}_{{s_{Z - 2} }}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z - 1} \in S}} \overline{p}_{{s_{Z - 2} ,s_{Z - 1} }}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} \left\{ {\overline{r}_{{s_{Z - 1} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z} \in S}} \overline{p}_{{s_{Z - 1} ,s_{Z} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} \overline{r}_{{s_{Z} }}^{0} } \right\},\forall\, h_{Z - 2} \in H_{Z - 2}^{\pi } , \\ & \cdots, \\ \end{aligned}$$

$$\begin{aligned} U_{0}^{\pi } \left( {h_{0} } \right) & = \overline{r}_{{s_{0} }}^{{x_{0} \left( {h_{0} } \right)}} + \gamma \mathop \sum \limits_{{s_{1} \in S}} \overline{p}_{{s_{0} ,s_{1} }}^{{x_{0} \left( {h_{0} } \right)}} \left\{ {\overline{r}_{{s_{1} }}^{{x_{1} \left( {h_{1} } \right)}} + \gamma \mathop \sum \limits_{{s_{2} \in S}} \overline{p}_{{s_{1} ,s_{2} }}^{{x_{1} \left( {h_{1} } \right)}} } \right. \\ & \quad \cdot \left\{ {\overline{r}_{{s_{2} }}^{{x_{2} \left( {h_{2} } \right)}} + \ldots + \gamma \mathop \sum \limits_{{s_{Z - 1} \in S}} \overline{p}_{{s_{Z - 2} ,s_{Z - 1} }}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} \left. {\left\{ {\overline{r}_{{s_{Z - 1} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z} \in S}} \overline{p}_{{s_{Z - 1} ,s_{Z} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} \overline{r}_{{s_{Z} }}^{0} } \right\} \ldots } \right\}} \right\}. \\ \end{aligned}$$

(A9)

Submitting (14) into the last equation in (A9) and using (A8), we have

$$\begin{aligned} U_{0}^{\pi } \left( {h_{0} } \right) & = \mathop \sum \limits_{k \in C} b_{{s_{0} ,k}} \left\{ {r_{{s_{0} ,k}}^{{x_{0} \left( {h_{0} } \right)}} + \gamma \mathop \sum \limits_{{s_{1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} r_{{s_{1} ,k}}^{{x_{1} \left( {h_{1} } \right)}} } \right. + \gamma^{2} \mathop \sum \limits_{{s_{1} \in S}} \mathop \sum \limits_{{s_{2} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} r_{{s_{2} ,k}}^{{x_{2} \left( {h_{2} } \right)}} \\ & \quad + \ldots + \gamma^{Z - 1} \mathop \sum \limits_{{s_{1} \in S}} \mathop \sum \limits_{{s_{2} \in S}} \ldots \mathop \sum \limits_{{s_{Z - 1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} \ldots p_{{s_{Z - 2} ,s_{Z - 1} ,k}}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} r_{{s_{Z - 1} ,k}}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} \\ & \quad \left. { + \gamma^{Z} \mathop \sum \limits_{{s_{1} \in S}} \mathop \sum \limits_{{s_{2} \in S}} \ldots \mathop \sum \limits_{{s_{Z - 1} \in S}} \mathop \sum \limits_{{s_{Z} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} \ldots p_{{s_{Z - 2} ,s_{Z - 1} ,k}}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} p_{{s_{Z - 1} ,s_{Z} ,k}}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} r_{{s_{Z} ,k}}^{0} } \right\}. \\ \end{aligned}$$

(A10)

On the other hand, Steimle et al. [4] infer that there are no optimal policies that are Markovian for the problem in (5). Then, we consider the history-dependent deterministic policies of standard MDPs, $\pi = \left\{ {x_{t} \left( {h_{t} } \right):h_{t} \in S \times A \times \cdots \times A \times S,t \in T} \right\}$. Using the finite horizon policy evaluation algorithm presented in [1] and the same logic as in (A9), we obtain ${V}_{k}^{\pi }\left({s}_{0}\right)$ in (3) as follows

$$\begin{aligned} V_{k}^{\pi } \left( {s_{0} } \right) & = r_{{s_{0} ,k}}^{{x_{0} \left( {h_{0} } \right)}} + \gamma \mathop \sum \limits_{{s_{1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} r_{{s_{1} ,k}}^{{x_{1} \left( {h_{1} } \right)}} + \gamma^{2} \mathop \sum \limits_{{s_{1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} \mathop \sum \limits_{{s_{2} \in S}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} r_{{s_{2} ,k}}^{{x_{2} \left( {h_{2} } \right)}} \\ & \quad + \ldots + \gamma^{Z - 1} \mathop \sum \limits_{{s_{1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} \mathop \sum \limits_{{s_{2} \in S}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} \ldots \mathop \sum \limits_{{s_{Z - 1} \in S}} p_{{s_{Z - 2} ,s_{Z - 1} ,k}}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} r_{{s_{Z - 1} ,k}}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} \\ & \quad + \gamma^{Z} \mathop \sum \limits_{{s_{1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} \mathop \sum \limits_{{s_{2} \in S}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} \ldots \mathop \sum \limits_{{s_{Z - 1} \in S}} p_{{s_{Z - 2} ,s_{Z - 1} ,k}}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} \mathop \sum \limits_{{s_{Z} \in S}} p_{{s_{Z - 1} ,s_{Z} ,k}}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} r_{{s_{Z} ,k}}^{0} . \\ \end{aligned}$$

(A11)

It holds from (4), (A11) and (A10) that

$${W}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)=\sum_{k\in C}{b}_{{s}_{0},k}{V}_{k}^{\pi }\left({s}_{0}\right)={U}_{0}^{\pi }\left({h}_{0}\right)={V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right).$$

Then it follows from (5) and (29) that the finite horizon DFMDP is equivalent to the WVP of the multi-scenario MDP.

1.5 Proof of Theorem 8

For a given $\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle$, (10) becomes

$$b_{{s_{t} ,k}} = \frac{{b_{{s_{t - 1} ,k}} p_{{s_{t - 1} ,s_{t} ,0}}^{{a_{t - 1} }} }}{{p_{{s_{t - 1} ,s_{t} ,0}}^{{a_{t - 1} }} \mathop \sum \nolimits_{{k{^{\prime}} \in {\text{C}}}} b_{{s_{t - 1} ,k{^{\prime}}}} }} = b_{{s_{t - 1} ,k}} ,\forall\, a_{t - 1} \in A,s_{t - 1} ,s_{t} \in S, k \in C,$$

since p₁ = … = p_|C|= p₀. This implies that the all scenario beliefs for any state at any time are constant and equal to ${{\varvec{b}}}_{{s}_{0}}$. As a result, $\overline{r}_{{s_{t} }}^{{a_{t} }} = \mathop \sum \nolimits_{k \in C} b_{{s_{0} ,k}} r_{{s_{t} ,k}}^{{a_{t} }} ,\forall\, s_{t} \in S,a_{t} \in A,t \in T$, i.e., $\overline{\varvec{r}} = \mathop \sum \nolimits_{k \in C} b_{{s_{0} ,k}} {\varvec{r}}_{k}$ which means that the scenario expected reward does not change over time. Therefore, the DFMDP with the type-I of scenarios can be reduced to a standard MDP with parameter pair (${{\varvec{p}}}_{0}$,$\overline{{\varvec{r}} }$).

1.6 Proof of Theorem 9

Firstly, we will prove that H is an isotone mapping. Let V, U $\in$ $\mathcal{V}$ and U $\geqslant$ V. Suppose that

$${a}^{*}(s,{{\varvec{b}}}_{s})\in \underset{a\in A}{\text{argmax}}\left\{ \sum \limits_{k\in C}{b}_{s,k}{r}_{s,k}^{a}+\gamma \sum \limits_{s\mathrm{^{\prime}}\in S}\sum \limits_{k\in C}{b}_{s,k}{p}_{s,s\mathrm{^{\prime}},k}^{a}V\left(s\mathrm{^{\prime}},\tau \left(\langle s,{{\varvec{b}}}_{s}\rangle ,a,s\mathrm{^{\prime}}\right)\right)\right\},$$

then for any $\langle s,{{\varvec{b}}}_{s}\rangle \in S\times {\Delta }^{C}$,

$$\begin{aligned} & \left( {HU} \right)\left( {s,{\varvec{b}}_{s} } \right) - \left( {HV} \right)\left( {s,{\varvec{b}}_{s} } \right) \\ & \geqslant \mathop \sum \limits_{k \in C} b_{s,k} r_{s,k}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} + \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} U\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right) \\ & \quad - \mathop \sum \limits_{k \in C} b_{s,k} r_{s,k}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} - \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} V\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right) \\ & = \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} \left\{ {\text{argmax}\,U\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right) - V\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right)} \right\} \geqslant 0. \\ \end{aligned}$$

It shows that HU $\geqslant$ HV when U $\geqslant$ V. So H is an isotone mapping.

Then we will prove that for all ${\text{V}}$, U $\in$ ${\mathcal{V}}$, that $\left \| HV-HU \right \| \leqslant \gamma \left \| V-U \right \|$ is true for any 0 < γ < 1. Let V, U $\in$ ${\mathcal{V}}$ and HV(s, b_s) $\geqslant$ HU(s, b_s) for a fixed $\langle s,{{\varvec{b}}}_{s}\rangle \in S\times {\Delta }^{C}$. Again, suppose that

$${a}^{*}\left(s,{{\varvec{b}}}_{s}\right)\in \underset{a\in A}{\text{argmax}}\left\{ \sum \limits_{k\in C}{b}_{s,k}{r}_{s,k}^{a}+\gamma \sum \limits_{{s}^{\prime} \in S} \sum \limits_{k\in C}{b}_{s,k}{p}_{s,{s}^{\mathrm{^{\prime}}},k}^{a}V\left({s}^{\prime},\tau \left(\langle s,{{\varvec{b}}}_{s}\rangle ,a,{s}^{\prime}\right)\right)\right\}.$$

Then

$$\begin{aligned} 0 & \leqslant \left( {HV} \right)\left( {s,{\varvec{b}}_{s} } \right) - \left( {HU} \right)\left( {s,{\varvec{b}}_{s} } \right) \\ & \leqslant \mathop \sum \limits_{k \in C} b_{s,k} r_{s,k}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} + \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} V\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right) \\ & \quad - \mathop \sum \limits_{k \in C} b_{s,k} r_{s,k}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} - \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} U\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right) \\ & = \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} \left\{ {V\left( {s^{\prime},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s^{\prime}} \right)} \right) - U\left( {s^{\prime},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s^{\prime}} \right)} \right)} \right\} \\ & \leqslant \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} \Vert V - U \Vert \leqslant \gamma \Vert V - U \Vert . \\ \end{aligned}$$

If we assume HU(s, b_s) $\geqslant$ HV(s, b_s), the same logic will imply that

$$\left|\left(HV\right)\left(s,{{\varvec{b}}}_{s}\right)-\left(HU\right)\left(s,{{\varvec{b}}}_{s}\right)\right|\leqslant \gamma \Vert V-U\Vert$$

for any $\langle s,{{\varvec{b}}}_{s}\rangle \in S\times {\Delta }^{C}$. This results in $\left \| \, HV-HU \right \| \leqslant \gamma \left \| V-U \right \|$ So H is a contraction mapping.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hou, CJ. Double-Factored Decision Theory for Markov Decision Processes with Multiple Scenarios of the Parameters. J. Oper. Res. Soc. China (2023). https://doi.org/10.1007/s40305-023-00457-5

Download citation

Received: 16 March 2022
Revised: 20 December 2022
Accepted: 02 March 2023
Published: 15 May 2023
DOI: https://doi.org/10.1007/s40305-023-00457-5

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Double-Factored Decision Theory for Markov Decision Processes with Multiple Scenarios of the Parameters

Abstract

Similar content being viewed by others

Finite Markov Chains and Markov Decision Processes

Analysis of Markov Decision Processes Under Parameter Uncertainty

Markov Decision Processes with Discounted Costs over a Finite Horizon: Action Elimination

Explore related subjects

1 Introduction

2 Background and Notation

2.1 Standard MDP

2.2 Multi-Scenario MDP

3 Double-Factored Markov Decision Process

3.1 Scenario Belief and Its Update

3.2 Scenario Expectation and Expected Total Reward

3.3 Finite Horizon DFMDP

Theorem 1

Theorem 2

Lemma 1

Theorem 3

Theorem 4

Theorem 5

Theorem 6

Theorem 7

Theorem 8

Corollary

3.4 Infinite Horizon DFMDP

Theorem 9

4 Computational Experiments

4.1 Random Instance

4.2 Instance of Medical Decision-Making

4.3 Variant of Medical Instance

5 Conclusions and Future Works

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Appendix

Appendix

1.1 Proof of Theorem 1

1.2 Proof of Theorem 3

1.2.1 Proof of Theorem 4

1.3 Proof of Theorem 6

1.4 Proof of Theorem 7

1.5 Proof of Theorem 8

1.6 Proof of Theorem 9

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation