1 Introduction

The Markov decision process (MDP) is an effective tool for modeling and decision-making in a dynamic environment, which are used in many disciplines such as robotics, automatic control, economics, manufactures and so on [1,2,3,4,5]. MDPs consider the intrinsic uncertainty which is realized by the uncertainty among its future transition. But the indeterminacy of the parameters (transition probability and reward) in the MDP is source of uncertainty that is not well addressed in many MDP problems. What’s more, in reality it imposes significant difficulty in obtaining a precise representation of the MDP parameters. There are various reasons, such as (i) imprecise or conflicting information given by the content experts [6]; (ii) insufficient data from which to build structures for parameter estimation [7]; (iii) non-stationary transition probabilities due to insufficient and/or hidden state information [7]; and (iv) unpredictable but prevailing events impacting the system [8].

There are generally two categories of parameter uncertainty in the MDPs. Firstly, either or both of the parameters are unknown completely. For handling such problems, the theory of Bayes-adaptive Markov decision process (BAMDP) is established by utilizing the concept of Bayesian inference [9, 10]. In a BAMDP, prior information regarding the unknown parameters is represented by a parameterized distribution and Bayesian inference is used to incorporate any new information in order to update the distribution so that the problem of exploration–exploitation is addressed during learning and sampling. [11] builds a rigorous framework rooted in information theory for solving the MDPs with unknown transition probabilities. But these theories are very difficult to implement on real-world problems. Secondly, the parameters lie in a given range, in another words, there is a pre-defined uncertainty set that the parameters belong to. In this context, there is the more traditional approach of mitigating the parameter uncertainty in MDPs, known as robust dynamic programming. The standard robust dynamic programming is a “max–min” approach in which the decision maker seeks to find a policy that maximizes the worst-case performance when the transition probabilities are allowed to vary within an uncertainty set [6, 12,13,14,15]. One of the key results is that the max–min problem is tractable for instances that satisfy the rectangularity property [12, 13]. Essentially, rectangularity means that observing the realization of a transition probability parameter gives no information about the values of other parameters for any other state-action-time triplet. Because each parameter value for any given state-action-time triplet is independent to each other, the problem can be decomposed so that each worst-case parameter is found via an optimization problem called the inner problem. [12, 13] provide algorithms for solving the max–min problem for a variety of uncertainty sets by providing polynomial-time methods for solving the corresponding inner problem. While rectangular uncertainty sets are desirable from a computational perspective, they can give rise to policies that are overly-conservative.

Recently, there has been a line of research on mitigating the impact of the parameter uncertainty by incorporating multiple scenarios of the parameters into the solution of the MDP. The parameter uncertainty in MDPs is encoded by allowing for multiple scenarios of the rewards and transition probabilities. A scenario is one possible realization of uncertainty and may result from possible realizations of a system, or some observation of a system, or from a simulation model. So this model clearly falls into the second category of parameter uncertainty above. Each scenario is referred to as a “model” of the MDP in [4], where all models are defined on the same state space, action space, and set of decision epochs, but each model may vary in terms of its rewards and transition probabilities. Such an MDP is known as the multi-model Markov decision process in [4] and the concurrent Markov decision process in [5]. We refer to it as the Markov decision process with multiple scenarios of the parameters or the multi-scenario MDP. [4, 5, 16, 17] propose designing policies that maximize a weighted value across multiple scenarios. [18] proposes minimizing the maximum regret with respect to each scenario for the finite horizon multi-scenario MDPs. [17] and [16, 19] propose a percentile optimization approach for finite horizon and infinite horizon multi-scenario MDPs, respectively. So far, the exact solution methods for this problem have relied on mixed-integer programming (MIP) formulations where binary decision variables encode the policy and continuous variables encode the value functions for each scenario of the multi-scenario MDP [4]. Although the MIP can be used to find exact solutions, these problems are NP-hard [4, 17], and the MIP solution methods for this problem have been limited to small multi-scenario MDPs.

The invention of multi-scenario MDPs is motivated by parameters estimated with statistical uncertainty [4], sometimes the results could be dramatically different or even conflicting with each other, especially in the field of clinical/medical research, where MDPs has been successfully used to design optimal treatment and screening protocols, however, longitudinal observation data used to characterize the MDP model are often very limited due to the patient population and cost of acquisition. In this context, there is the need for designing policies that is general favorable for several parameter possibilities, known as model or scenario among related literatures. Moreover, we believe when actually implementing the policies to an individual case in practice, there is information hidden under the realization of state transitions that could be utilized to make inference on the individual itself, on how likely the individual is following each scenario. And with an initial set of weights or distribution of the scenarios, the policies is developed considering all future possibilities so that it’s optimal even before any information and action being collected or implemented.

So in this article, we propose a new theoretical framework for the weighted value problem of the multi-scenario MDPs and methods solving for its exact solution. We introduce the concepts of scenario belief and the scenario expectation to formulate the expected total discounted reward of a policy, and establish a new framework named as double-factored Markov decision process (DFMDP). The “double-factored” here are two factors, i.e., the physical state and the scenario belief of the system, that work as sufficient statistics for the past history of observations and actions in the multi-scenario MDPs. We will demonstrate that these two factors summarize all the prior information gained up to the current stage. Thus they provide complete information for future decision-making. We analyze four classes of policies for the finite horizon DFMDPs. They are respectively double-factored history-dependent randomized, double-factored history-dependent deterministic, double-factored Markovian randomized and double-factored Markovian deterministic policies. It is shown that there exists a double-factored Markovian deterministic policy which is optimal among all policies above. An exact solution method for the double-factored Markovian deterministic policies is proposed. We also formulate the infinite horizon DFMDPs in a later section, but the focus of this paper is still on finite horizon problems. With this framework, we are able to develop algorithms that can be scaled to real-world problems, solving for the exact solutions to the identical weighted value problem of the multi-scenario MDPs, with a side benefit of keeping updating the scenario distribution based on all history observations and actions. So other than solving for the weighted value problem which is our primary goal, this framework can also be extended and utilized to solve learning or recognition problems.

Other than clinical justification where there exists multiple well-establish authorities or studies to build up the MDP parameters, in the context of E-commerce where MDP-based recommendation system has been utilized [20], there is ample amount of data to characterize the whole customer population into sub-category based on census data and shopping behaviors. When a new customer shops online, we can make tailored recommendation when we have more confidence on the sub-category the customer fits, based on his/her history reactions. Yet before the customer performs any actions, we could still make recommendation considering an initial belief or weights on all sub-categories that is benefiting the expectation of the objectives measured via either sales or click-through rate.

This article is organized as follows. In Sect. 2, the definition and notation of the multi-scenario MDPs are presented. In Sect. 3, the finite horizon and the infinite horizon DFMDPs are formulated. Four classes of policies are discussed and an exact solution method is proposed for the finite horizon DFMDPs. The computational experiments involving three sets of test instances for comparing several solution methods for the finite horizon DFMDPs are illustrated in Sect. 4. Section 5 summarizes the work presented in this article and shares plans for future research on the relevant topics.

2 Background and Notation

2.1 Standard MDP

We denote the standard MDP by a 5-elements-tuple (T, S, A, (p,r), s0) where T = {0, 1,\(\cdots\), Z − 1} is the finite discrete set of decision epoch, S is the finite state set, A is the finite action set, (p,r) is the parameter pair in which p is the |S| ×|S| ×|A| transition probability matrix whose element \(p_{{s,{\text{s}}^{\prime } }}^{a}\) is the probability of ending in state s′ ∈ S if the system performs action a ∈ A in state s ∈ S, and r is the |S| ×|S| ×|A| reward matrix whose element \(r_{{s,{\text{s}}^{\prime } }}^{a}\) is the reward obtained by ending in state s′ ∈ S if the system performs action a ∈ A in state s ∈ S, and s0 ∈ S is the initial state. Moreover, we assume that the terminal reward is denoted by \({\varvec{r}}^{0} = \left[ {r_{1}^{0} , \cdots ,r_{\left| S \right|}^{0} } \right]^{\top}\) and the discounted factor by γ ∈ [0, 1].

Decision epoch t is also called time t. The interval between two successive decision epochs is called period. We consider the Markov deterministic policy π = (x0, x1, ···, xZ-1) ∈ Π where xt: S → A is the decision rule at epoch t ∈ T, which assign one action at = xt (st) to each state st ∈ S, and Π is the policy set. The goal of the decision maker is to specify a policy π ∈ Π that maximize the expected total discounted rewards over the planning horizon [1]

$$\mathop {\max }\limits_{\pi \in \Pi } {\text{E}}_{{s_{0} }}^{\pi } \left\{ {\mathop \sum \limits_{t = 0}^{Z - 1} \gamma^{t} r_{{s_{t} }}^{{a_{t} }} + \gamma^{Z} r_{{s_{Z} }}^{0} } \right\},\forall\, s_{0} \in S,$$
(1)

where \(r_{s}^{a} = \sum\limits_{s' \in S} {p_{{s,s^{\prime } }}^{a} } r_{{s,s^{\prime } }}^{a}\). The optimal policy of (1) can be found by the backward induction algorithm in [1].

The Bellman’s equations for the infinite horizon MDP are

$$V^{*} \left( s \right) = \mathop {{\text{max}}}\limits_{a \in A} \left\{ {r_s^a + \gamma \mathop \sum \limits_{{s^{\prime } \in {\text{S}}}} p_{s,s'}^a V^{*} \left( {s^{\prime } } \right)} \right\}, \forall\, s \in S,$$
(2)

where \({V}^{*}\left(s\right)\) is the optimal value function of state s for the infinite horizon MDP. The optimal stationary and deterministic policy of this problem can be found by the value iteration or the policy iteration in [1].

2.2 Multi-Scenario MDP

We denote the multi-scenario MDP by a 6-elements-tuple \((T,S,A,C,{J}_{{\varvec{p}},{\varvec{r}}},\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle )\) where T, S and A are following the same definitions above, C is the finite discrete set of scenarios, Jp,r = {(p1, r1),···,(p|C|, r|C|)} is the parameter pair set in which (pk, rk) is the parameter pair of scenario k\(\in\)C where the elements in pk are denoted by \(p_{{s,{\text{s}}^{\prime } ,k}}^{a}\) with the same definitions as \(p_{{s,{\text{s}}^{\prime } }}^{a}\) above, and the elements in rk are denoted by \(r_{{s,{\text{s}}^{\prime } ,k}}^{a}\) with the same definitions as \(r_{{s,{\text{s}}^{\prime } }}^{a}\) above, \(\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle\) is the ordered pair consisting of the initial state s0\(\in\)S and the initial probability distribution of scenarios, \({{\varvec{b}}}_{{s}_{0}}={[{b}_{{s}_{0},1},\cdots,{b}_{{s}_{0},|C|}]}^{\top}\in {\Delta }^{C}\), where ΔC denotes the set of probability distributions on set C.

For easy of description, we refer to the standard MDP with the parameter pair (pk, rk), k ∈ C, as the MDP k as in [4]. We use \(r_{s,k}^{a} = {\mathop \sum\limits_{s' \in S}} {p_{{s,s^{\prime } ,k}}^{a} } r_{s,s\prime ,k}^{a}\) to denote the immediate expected reward in the MDP k when action a ∊ A is taken in state s\(\in\)S and \({{\varvec{r}}}_{k}^{0}={[{r}^{0}_{1,k},\,\dots\,,{r}_{|S|,k}^{0}]}^{\top}\) to denote the terminal reward in the MDP k.

Given a multi-scenario MDP \((T,S,A,C,{J}_{{\varvec{p}},{\varvec{r}}},\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle )\), the value of a policy π ∈ Π in MDP k, for a sepcific k ∈ C, is given by its expected total discounted rewards evaluated with the parameter pair (pk, rk):

$${V}_{k}^{\pi }\left({s}_{0}\right)={E}_{{s}_{0}}^{\pi }\left[\sum_{t=0}^{Z-1}{\gamma }^{t}{r}_{{s}_{t,}k}^{{a}_{t}}+{\gamma }^{Z}{r}_{{s}_{Z},k}^{0}\right].$$
(3)

The weighted value of any policy π ∈ Π in the multi-scenario MDP is defined by

$${W}^{\pi }({s}_{0},{b}_{{s}_{0}})=\sum_{k\in C}{b}_{{s}_{0},k}{V}_{k}^{\pi }\left({s}_{0}\right)=\sum_{k\in C}{b}_{{s}_{0},k}{E}_{{s}_{0}}^{\pi }\left[\sum_{t=0}^{Z-1}{\gamma }^{t}{r}_{{s}_{t},k}^{{a}_{t}}+{\gamma }^{Z}{r}_{{s}_{Z},k}^{0}\right],$$
(4)

and the weighted value problem (WVP) is defined as the problem of finding a solution to

$${W}^{*}\left({s}_{0},{b}_{{s}_{0}}\right)=\underset{\pi \in \Pi }{\text{max}}{W}^{\pi }\left({s}_{0},{b}_{{s}_{0}}\right)=\underset{\pi \in \Pi }{\text{max}}\left\{\sum_{k\in C}{b}_{{s}_{0},k}{E}_{{s}_{0}}^{\pi }\left[\sum_{t=0}^{Z-1}{\gamma }^{t}{r}_{{s}_{t},k}^{{a}_{t}}+{\gamma }^{Z}{r}_{{s}_{Z},k}^{0}\right]\right\}.$$
(5)

We also have the set of policies Π* = {π: Wπ = W*} \(\subseteq\) Π that achieve the maximum in (4) (see [4]).

Ref. [4] proposes two approximate methods and the MIP formulation for Eq. (5) and show that the WVP for the Markov deterministic policy class is a NP-hard problem.

3 Double-Factored Markov Decision Process

In this section, a novel framework named as double-factored Markov decision process (DFMDP) is formulated for the WVP of the multi-scenario MDP. A backward induction algorithm is proposed for solving the exact solutions of the DFMDPs.

3.1 Scenario Belief and Its Update

The weights in (5) are one’s estimate of the probabilities that scenarios occur in the multi-scenario MDPs. We believe that \({{\varvec{b}}}_{{s}_{0}}\) is only the initial probability distribution of the scenarios and it will change with the system evolution. In fact, system’s state transitions under specific actions contain the information related to scenario probabilities. The decision makers can utilize the information to update the scenario probabilities according to the observation of state transitions when the system evolves. For that reason, we introduce the scenario belief bt as follows to describe the scenario probability distribution at time t:

$${{\varvec{b}}}_{t}={[{b}_{t,1},\cdots,{b}_{t,k},\cdots ,{b}_{t,|C|}]}^{\top}, {b}_{t,k}\geqslant 0, \forall\, k\in C \quad \text{and} \quad \sum_{k\in C}{b}_{t,k}=1.$$

In order to derive the update function for the scenario belief, we define gt as the total available information in the multi-scenario MDP at decision epoch t. Since at−1 denotes the action implemented at decision epoch t – 1 and st the state observed at decision epoch t, we have

$${g}_{t}=\left({s}_{t},{a}_{t-1},{g}_{t-1}\right).$$
(6)

Then the elements of the scenario belief is defined as

$${b}_{t,k}=\mathrm{Pr}\left({c}_{k}|{g}_{t}\right), \forall\,k\in C,$$
(7)

where ck denotes the event that the realization of system’s scenario is k. The substitution of (6) into (7) and the application of Bayes' rule yields

$${b}_{t,k}=\mathrm{Pr}\left({c}_{k}|{s}_{t},{a}_{t-1},{g}_{t-1}\right)=\frac{\mathrm{Pr}\left({s}_{t}|{c}_{k},{a}_{t-1},{g}_{t-1}\right)\mathrm{Pr}\left({c}_{k}|{a}_{t-1},{g}_{t-1}\right)}{\mathrm{Pr}\left({s}_{t}|{a}_{t-1},{g}_{t-1}\right)}, \forall\,k\in C.$$
(8)

In (8), the first probability in the numerator is the transition probability of scenario k, i.e., \(\mathrm{Pr}\left({s}_{t}|{c}_{k},{a}_{t-1},{g}_{t-1}\right)={p}_{{s}_{t-1},{s}_{t},k}^{{a}_{t-1}}\) since \({g}_{t-1}=({s}_{t-1},{a}_{t-1},{g}_{t-2})\). The prior probability \(\mathrm{Pr}\left({c}_{k}|{a}_{t-1},{g}_{t-1}\right)\) is independent of at−1 given gt−1, since choosing for action is completely under our control at time t − 1, so \(\mathrm{Pr}\left({c}_{k}|{a}_{t-1},{g}_{t-1}\right)=\mathrm{Pr}\left({c}_{k}|{g}_{t-1}\right)={b}_{t-1,k}\). And the denominator of the equation is just the sum of the numerator over all possible value of k representing each scenario. So we have

$${b}_{t,k}=\frac{{b}_{t-1,k}{p}_{{s}_{t-1},{s}_{t},k}^{{a}_{t-1}}}{{\sum }_{k\mathrm{^{\prime}}\in C}{b}_{t-1,k\mathrm{^{\prime}}}{p}_{{s}_{t-1},{s}_{t},k\mathrm{^{\prime}}}^{{a}_{t-1}}} , \forall\,k\in C.$$
(9)

We can see in (9) that bt is a function of st, st−1, bt−1 and at−1. So we use \({{\varvec{b}}}_{{s}_{t}}\) to denote the scenario belief when state st is observed at time t, in order to differentiate it from other states being observed, also the time index t only shows as subscript of st to simplify the notation. With these in mind, Eq. (9) can be rewritten as

$${b}_{{s}_{t},k}=\frac{{b}_{{s}_{t-1},k}{p}_{{s}_{t-1},{s}_{t},k}^{{a}_{t-1}}}{\sum _{k\mathrm{^{\prime}} \in C}{b}_{{s}_{t-1},k\mathrm{^{\prime}}}{p}_{{s}_{t-1},{s}_{t},k\mathrm{^{\prime}}}^{{a}_{t-1}}}, \forall\,{a}_{t-1}\in A, {s}_{t-1},{s}_{t}\in S, k\in C.$$
(10)

Equation (10) can also be expressed as a function that \({{\varvec{b}}}_{{s}_{t}}=\tau \left({s}_{t-1},{{\varvec{b}}}_{{s}_{t-1}},{a}_{t-1},{s}_{t}\right)\) where τ(·) is called the scenario belief update function. We stipulate that \({{\varvec{b}}}_{{s}_{t}}=\mathbf{0}\) if the denominator in (10) is equal to zero, implying the situation where state st is unreachable at time t. Clearly \({\varvec{b}}_{{s_{t} }} \in \Delta^{C}\), so ΔC is also called scenario belief space.

Stating from any state s0\(\in\)S and scenario belief \({{\varvec{b}}}_{{\mathrm{s}}_{0}}\in {\Delta }^{C}\), the scenario belief \({{\varvec{b}}}_{{s}_{t}}\) in state st at time t can be obtained by using (10) repeatedly if st is observed. We denote the combination of st and \({{\varvec{b}}}_{{s}_{t}}\) by an ordered pair \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle\), which is referred to as the ordered pair of state and scenario belief at time t or the state-belief pair for short. It follows that \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \in S\times {\Delta }^{C}\) where \(S\times {\Delta }^{C}\) is the Cartesian production of S and ΔC. In addition, the scenario belief update function could be written as \({{\varvec{b}}}_{{s}_{t}}=\tau \left(\langle {s}_{t-1},{{\varvec{b}}}_{{s}_{t-1}}\rangle ,{a}_{t-1},{s}_{t}\right)\) to highlight the state-belief pair.

The important feature of (10) is that the calculation of \({{\varvec{b}}}_{{s}_{t}}\) in state \({s}_{t}\) at time t requires only the state-belief pair \(\langle {s}_{t-1},{{\varvec{b}}}_{{s}_{t-1}}\rangle\) at time t − 1; thus, \(\langle {s}_{t-1},{{\varvec{b}}}_{{s}_{t-1}}\rangle\) summarizes all the information gained prior to time t − 1 and represents a sufficient statistic for the complete past history of the process gt−1. This important feature sets the foundation of the double-factored decision theory.

3.2 Scenario Expectation and Expected Total Reward

Starting with any \(\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \in S\times {\Delta }^{C}\) and using (10) repeatedly, all state-belief pairs \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle\) and actions at for t = 0, 1,\(\cdots\), Z will form the transition tree of the system as shown in Fig. 1. Within the tree, there are usually |A| actions behind each state-belief pair and |S| state-belief pairs behind each action. Each \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle\) usually has |A|·|S| successors \(\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle =\langle {s}_{t+1},\tau (\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle ,{a}_{t},{s}_{t+1})\rangle\). And apart from \(\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle\), each \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle\) has one unique predecessor.

Fig. 1
figure 1

The transition tree

There are two types of special state-belief pairs in the transition tree. If \({{\varvec{b}}}_{{s}_{t}}={{\varvec{e}}}_{0}\) in state st at some decision epoch t where e0 is a |C|-dimensional zero vector, the corresponding state-belief pair \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle\) is referred to as the pseudo state-belief pair. The reason for the appearance of \(\langle {s}_{t},{{\varvec{e}}}_{0}\rangle\) is that state st is unreachable. Thus, a pseudo-state-belief pair is a "leaf" in the tree and it no longer has any successor. If \({{\varvec{b}}}_{{s}_{t}}={{\varvec{e}}}_{k}\) in state st at some decision epoch t where ek is a |C|-dimensional unit vector with its kth element being one, the corresponding state-belief pair \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle\) is referred to as the degenerate state-belief pair.

Any path \((\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle ,{a}_{0},\langle {s}_{1},{{\varvec{b}}}_{{s}_{1}}\rangle ,\cdots ,{a}_{t-1},\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle )\) along the transition tree will be the possible realization when the system evolves. Thus the history for \(\left(T,S,A,C,{J}_{{\varvec{p}},{\varvec{r}}},\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \right)\) is defined as the sequence of state-belief pairs and actions, i.e., \({h}_{t}=\left(\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle ,{a}_{0},\langle {s}_{1},{{\varvec{b}}}_{{s}_{1}}\rangle ,\cdots ,{a}_{t-1},\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \right)\), which is referred to as the history up to time t from \(\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle\) onward, or the history up to time t for short. As a convention, \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle\) denotes the state-belief pair at time t when the history is ht. The history ht follows the recursion \({h}_{t}=\left({h}_{t-1},{a}_{t-1},\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \right)\) with \({h}_{0}=\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle\). We let Ht denote the set of all possible history up to time t and call it history set up to time t from \(\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle\) onward, or history set up to time t for short.

Similar to the standard MDPs, there are four classes of decision rules xt for the DFMDPs. History-dependent randomized decision rules map the history set Ht into the probability distribution set ΔA on action set A, i.e., \({x}_{t}:{H}_{t}\to {\Delta }^{A}\) or \({\mu }_{{h}_{t}}\left(\cdot \right)={x}_{t}\left({h}_{t}\right),\mu \left(\cdot \right)\in {\Delta }^{A},{h}_{t}\in {H}_{t}\). History-dependent deterministic decision rules map the history set Ht into the action set A, i.e., \({x}_{t}:{H}_{t}\to A\) or \({a}_{t}={x}_{t}\left({h}_{t}\right),{h}_{t}\in {H}_{t}\). Double-factored Markovian randomized decision rules map S × ΔC into ΔA, i.e., \({x}_{t}:S\times {\Delta }^{C}\to {\Delta }^{A}\) or \({\mu }_{\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle }\left(\cdot \right)={x}_{t}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right),\mu \left(\cdot \right)\in {\Delta }^{A},\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \in S\times {\Delta }^{C}\). Double-factored Markovian deterministic decision rules map S × ΔC into A, i.e., \({x}_{t}:S\times {\Delta }^{C}\to A\) or \({a}_{t}={x}_{t}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right),\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \in S\times {\Delta }^{C}\). A double-factored Markovian deterministic decision rule specifies the action choice only considering the fact that the system is in state st with belief \({{\varvec{b}}}_{{s}_{t}}\) at time t. We will prove later that it is Markovian (memoryless) because it depends on the previous history only through the current state-belief pair of the system, i.e., \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle\) is a sufficient statistic of the past history as mentioned above.

The policies composed of the above four classes of decision rules, \(\pi=(x_{0},x_{1},\cdots,x_{Z-1})\), are referred to as double-factored history-dependent randomized policies, double-factored history-dependent deterministic policies, double-factored Markovian randomized policies and double-factored Markovian deterministic policies, respectively. And the sets of all policies of these classes are denoted by ΠDHR, ΠDHD, ΠDMR, and ΠDMD, respectively.

The relationship between the various classes of policies is as follows: \({\Pi }^{\mathrm{DMD}}\subset {\Pi }^{\mathrm{DMR}}\subset {\Pi }^{\mathrm{DHR}}\) and \({\Pi }^{\mathrm{DMD}}\subset {\Pi }^{\mathrm{DHD}}\subset {\Pi }^{\mathrm{DHR}}\). We will show later with a series of theorems that the optimal double-factored Markovian deterministic policies are the best among all classes of policies for the finite horizon DFMDPs.

We next use a one-period decision-making problem to introduce the concept of the scenario expectations. In a one-period problem, Z = 1 and T = {0}. We assume that whenever the system is in state s1 at the end of this period, the decision maker receives a terminal reward V(s1), where V is a definite real-valued function on S. Suppose the system sits in state s0 with scenario belief \({{\varvec{b}}}_{{s}_{0}}\) at the start of the period. The decision maker aims to select an action a\(\in\)A to maximize the expected total discounted reward. Suppose he chooses a deterministic policy π = (x0), which selects action a\(\in\)A at initial decision epoch.

In this problem, the expected reward for the MDP k is

$${V}_{k}^{a}\left({s}_{0}\right)={r}_{{s}_{0},k}^{a}+\gamma \sum_{{s}_{1}\in S}{p}_{{s}_{0,}{s}_{1,}k}^{a}V({s}_{1}).$$
(11)

Since the scenario realization of the system is uncertain before the decision-making, we can consider the problem from the perspective of expectations over possible scenarios and define the expected total discounted reward as follows:

$${V}^{a}\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)=\sum_{k\in C}{b}_{{s}_{0},k}{V}_{k}^{a}\left({s}_{0}\right)=\sum_{k\in C}{b}_{{s}_{0},k}\left[{r}_{{s}_{0},k}^{a}+\gamma \sum_{{s}_{1}\in S}{p}_{{s}_{0,}{s}_{1,}k}^{a}V({s}_{1})\right].$$
(12)

Letting \({\overline{r} }_{{s}_{0}}^{a}=\sum_{k\in C}{b}_{{s}_{0},k}{r}_{{s}_{0},k}^{a}\) and \({\overline{p} }_{{s}_{0},{s}_{1}}^{a}=\sum_{k\in C}{b}_{{s}_{0},k}{p}_{{s}_{0},{s}_{1},k}^{a}\), (12) is rewritten as

$${V}^{a}\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={\overline{r} }_{{s}_{0}}^{a}+\gamma \sum_{{s}_{1}\in S}{\overline{p} }_{{s}_{0},{s}_{1}}^{a}V({s}_{1}).$$
(13)

The optimal action a* is obtained by finding the maximum of (13) over all actions in set A:

$${a}^{*}\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)\in \underset{a\in A}{\text{argmax}}\left\{{\overline{r} }_{{s}_{0}}^{a}+\gamma \sum_{{s}_{1}\in S}{\overline{p} }_{{s}_{0},{s}_{1}}^{a}V\left({s}_{1}\right)\right\}.$$

If the decision maker uses a randomized policy with probability μ(a) to select actions in state s0, the expected total discounted reward equals \(\sum_{a\in A}\mu (a)\left\{{\overline{r} }_{{s}_{0}}^{a}+\gamma \sum_{{s}_{1}\in S}{\overline{p} }_{{s}_{0},{s}_{1}}^{a}V({s}_{1})\right\}\), where \(\sum_{a\in A}\mu \left(a\right)=1\) and \(\mu \left(a\right)\geqslant 0\) for a\(\in\)A. Since

$$\underset{\mu \in {\Delta }^{A}}{\text{max}}\left\{\sum_{a\in A}\mu (a)\left[{\overline{r} }_{{s}_{0}}^{a}+\gamma \sum_{{s}_{1}\in S}{\overline{p} }_{{s}_{0},{s}_{1}}^{a}V({s}_{1})\right]\right\}=\underset{a\in A}{\text{max}}\left\{{\overline{r} }_{{s}_{0}}^{a}+\gamma \sum_{{s}_{1}\in S}{\overline{p} }_{{s}_{0},{s}_{1}}^{a}V({s}_{1})\right\},$$

obviously we cannot obtain a larger expected reward in state s0 by means of randomized policies.

We generalize the idea of scenario expectations above to the decision horizon of \(\left(T,S,A,C,{J}_{{\varvec{p}},{\varvec{r}}},\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \right)\) and use it to define the expected total discounted reward generated by a policy. Let \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle\) be the state-belief pair in state st at decision epoch t. We define

$$\begin{aligned} & \overline{r}_{{s_{t} }}^{{a_{t} }} = \mathop \sum \limits_{k \in C} b_{{s_{t} ,k}} r_{{s_{t} ,k}}^{{a_{t} }} ,\forall\, a_{t} \in A, \\ & \overline{p}_{{s_{t} ,s_{t + 1} }}^{{a_{t} }} = \mathop \sum \limits_{k \in C} b_{{s_{t} ,k}} p_{{s_{t} ,s_{t + 1} ,k}}^{{a_{t} }} ,\forall \,a_{t} \in A,s_{t + 1} \in S,\\ \end{aligned}$$
(14)

where \({\overline{r} }_{{s}_{t}}^{{a}_{t}}\) is called the scenario expectation reward and \({\overline{p} }_{{s}_{t},{s}_{t+1}}^{{a}_{t}}\) the scenario expectation transition probability in state st at decision epoch \(\textit{t}\). Let \({V}^{\pi }({s}_{0},{{\varvec{b}}}_{{s}_{0}})\) represent the expected total discounted reward over the decision-making horizon if policy π is used and the system has an initial state-belief pair \(\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \in S\times {\Delta }^{C}\). For π\(\in\) ΠDHR, it is defined by

$${V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={E}_{\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle }^{\pi }\left\{\sum_{t=0}^{Z-1}{\gamma }^{t}{\overline{r} }_{{s}_{t}}^{{a}_{t}}+{\gamma }^{Z}{\overline{r} }_{{s}_{Z}}^{0}\right\},$$
(15)

where \({\overline{r} }_{{s}_{Z}}^{0}=\sum_{k\in C}{b}_{{s}_{Z},k}{r}_{{s}_{Z},k}^{0}\). For π\(\in\) ΠDHD and π\(\in\) ΠDMD, the expected total discounted reward can be respectively expressed as

$${V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={E}_{\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle }^{\pi }\left\{\sum_{t=0}^{Z-1}{\gamma }^{t}{\overline{r} }_{{s}_{t}}^{{x}_{t}\left({h}_{t}\right)}+{\gamma }^{Z}{\overline{r} }_{{s}_{Z}}^{0}\right\},$$
(16)

and

$${V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={E}_{\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle }^{\pi }\left\{\sum_{t=0}^{Z-1}{\gamma }^{t}{\overline{r} }_{{s}_{t}}^{{x}_{t}({s}_{t},{{\varvec{b}}}_{{s}_{t}})}+{\gamma }^{Z}{\overline{r} }_{{s}_{Z}}^{0}\right\}.$$
(17)

It will be shown later by Theorem 7 that \({V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)\) equals \({W}^{\pi }({s}_{0},{\varvec b}_{{s}_{0}})\) in (4).

3.3 Finite Horizon DFMDP

According to the above theory, the optimization problem for a finite horizon DFMDP is expressed as

$${V}^{*}\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right) =\underset{\pi \in {\Pi }^{\text{DHR}}}{\text{max}}{V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right), \forall\, \langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \in S\times {\Delta }^{C}.$$
(18)

Now we provide a recursive algorithm to evaluate \({V}^{\pi }({s}_{0},{{\varvec{b}}}_{{s}_{0}})\). Let \({U}_{t}^{\pi }:{H}_{t}^{\pi }\to \text{R}\) denote the expected total discounted reward obtained by using a fixed policy π at decision epoch t, t + 1, ···, Z, where \({H}_{t}^{\pi }\) is the history set following the policy π up to time t and \({H}_{t}^{\pi }={H}_{t}\) for π\(\in\) ΠDHR. If the history at decision epoch t is \(h_{t} \in H_{t}^{\pi }\), then we define \({U}_{t}^{\pi }(t<Z)\) by

$${U}_{t}^{\pi }\left({h}_{t}\right)={E}_{{h}_{t}}^{\pi }\left\{\sum_{n=t}^{Z-1}{\gamma }^{n-t}{\overline{r} }_{{s}_{n}}^{{a}_{n}}+{\gamma }^{Z-t}{\overline{r} }_{{s}_{Z}}^{0}\right\}.$$
(19)

To simplify the notation, assume that a deterministic π\(\in\) ΠDHD has been specified. In practice, we will not need to evaluate randomized policies, because subsequent results establish that deterministic policies are optimal under the expected total reward criteria. For a given \(\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle\), the corresponding \({H}_{t}^{\pi }\) is defined by

$$\begin{aligned} & H_{0}^{\pi } = \left\{ {h_{0} } \right\} = \left\{ {\langle s_{0} ,{\varvec{b}}_{{s_{0} }} } \rangle\right\}, \\ & H_{t}^{\pi } = \left\{ {h_{t} :h_{t} = \left( {h_{t - 1} ,x_{t - 1} \left( {h_{t - 1} } \right),\langle s_{t} ,\tau \left( {\langle s_{t - 1} ,{\varvec{b}}_{{s_{t - 1} }} \rangle ,x_{t - 1} \left( {h_{t - 1} } \right),s_{t} } \right)}\rangle \right),\forall\, h_{t - 1} \in H_{t-1}^{\pi} ,\forall\, s_{t} \in S} \right\}. \\ \end{aligned}$$
(20)

It is obvious that \({H}_{t}^{\pi }\subset {H}_{t}\) when \(|A|>1\). If the history at decision epoch t is \(h_{t} \in H_{t}^{\pi }\), \({U}_{t}^{\pi }\) is expressed by

$${U}_{t}^{\pi }\left({h}_{t}\right)={E}_{{h}_{t}}^{\pi }\left\{\sum_{n=t}^{Z-1}{\gamma }^{n-t}{\overline{r} }_{{s}_{n}}^{{x}_{n}\left({h}_{n}\right)}+{\gamma }^{Z-t}{\overline{r} }_{{s}_{Z}}^{0}\right\} .$$
(21)
figure a

The following theorem guarantees that the expected value \({U}_{0}^{\pi }\left({h}_{0}\right)\) generated by the algorithm above is equal to \({V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)\) in (16).

Theorem 1

For any given \({h}_{0}=\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \in S\times {\Delta }^{C}\) and a fixed policy π\(\in\) ΠDHD, suppose \(U_{t}^{\pi } ,t \in T\), has been generated by the policy evaluation algorithm above. Then, for all t\(\leqslant\)Z, (21) holds and \({V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)\) in (16) equals \({U}_{0}^{\pi }\left({h}_{0}\right)\) for any \({h}_{0}=\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \in S\times {\Delta }^{C}\).

For ease of reading, we defer all theorem proofs to “Appendix”.

To generalize the algorithm to randomized policies, it would require an additional summation in (22) to account for the probability distribution of the action at decision epoch t under decision rule xt as follows:

$${U}_{t}^{\pi }\left({h}_{t}\right)=\sum_{a\in A}{\mu }_{{h}_{t}}(a)\left\{{\overline{r} }_{{s}_{t}}^{{x}_{t}\left({h}_{t}\right)}+\gamma \sum_{{s}_{t+1}\in S}{\overline{p} }_{{s}_{t},{s}_{t+1}}^{{x}_{t}\left({h}_{t}\right)}{U}_{t+1}^{\pi }\left({h}_{t},{x}_{t}\left({h}_{t}\right),\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle \right)\right\}.$$
(23)

Theorem 1 shall be extended to π\(\in\) ΠDHR as follows.

Theorem 2

For any given \(h_{0} = \langle s_{0} ,{\varvec{b}}_{{s_{0} }} \rangle\in S \times {\Delta }^{C}\) and a fixed policy π\(\in\) ΠDHR, suppose \(U_{t}^{\pi } ,t \in T\), has been generated by the policy evaluation algorithm with (23) replacing (22). Then, for all t\(\leqslant\)Z, (19) holds and \(V^{\pi } \left( {s_{0} ,{\varvec{b}}_{{s_{0} }} } \right)\) in (15) equals \(U_{0}^{\pi } \left( {h_{0} } \right)\) for any \(h_{0} = \langle s_{0} ,{\varvec{b}}_{{s_{0} }} \rangle \in S \times {\Delta }^{C}\).

Now let

$${U}_{t}^{*}\left({h}_{t}\right)=\underset{\pi \in {\Pi }^{\text{DHR}}}{\text{max}}{U}_{t}^{\pi }\left({h}_{t}\right).$$

It indicates the maximum over all policies of the expected total discounted reward from decision epoch t onward when the history up to time t is ht.

The optimality equations for the finite horizon DFMDP are given by

$${U}_{t}\left({h}_{t}\right)=\underset{{a}_{t}\in \mathrm{A}}{\text{max}}\left\{{\overline{r} }_{{s}_{t}}^{{a}_{t}}+\gamma \sum_{{s}_{t+1}\in S}{\overline{p} }_{{s}_{t},{s}_{t+1}}^{{a}_{t}}{U}_{t+1}\left({h}_{t},{a}_{t},\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle \right)\right\}$$
(24)

for \({t}\in{T}\) and \({h}_{t}=\left({h}_{t-1},{a}_{t-1},\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \right)\in {H}_{t}\), where \(\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle =\langle {s}_{t+1},\tau (\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle ,{a}_{t},{s}_{t+1})\rangle\). For \({t}={Z}\) and \({h}_{Z}=\left({h}_{Z-1},{a}_{Z-1},\langle {s}_{Z},{{\varvec{b}}}_{{s}_{Z}}\rangle \right)\in {H}_{Z}\), we add the boundary condition

$${U}_{Z}\left({h}_{Z}\right)={\overline{r} }_{{s}_{Z}}^{0}.$$
(25)

Before stating more theorems, we introduce the following lemma 1.

Lemma 1

Let f be a real-valued function on an arbitrary finite discrete set A and μ() be a probability distribution on A. Then

$$\underset{a\in A}{\text{sup}} f\left(a\right)\geqslant \sum_{a\in A}\mu \left(a\right)f\left(a\right).$$

The following theorem summarizes the optimality properties of solutions to the optimality equations.

Theorem 3

Let Ht be the history up to time t from any \({h}_{0}=\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle\) onward. Suppose Ut are solutions of (24) and (25) for t = 0,1,\(\cdots\) , Z. Then

(i) \({U}_{t}\left({h}_{t}\right)={U}_{t}^{*}\left({h}_{t}\right)\) for all ht ∊ Ht, t = 0,1,\(\cdots\) , Z, and

(ii) \({V}^{*}\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={U}_{0}\left({h}_{0}\right)\).

The following result shows how to use the optimality equations to find the optimal policy and to verify its optimality.

Theorem 4

Suppose \(U_{t}^{*} ,t = 0,1, \cdots ,Z\), are solutions of (24) and (25), and policy \(\pi^{*} = \left( {x_{0}^{*} ,x_{1}^{*},\,\dots\,, x_{Z - 1}^{*} } \right) \in {\Pi }^{{{\text{DHD}}}}\) satisfies

$$\begin{aligned} & \overline{r}_{{s_{t} }}^{{x_{t}^{*} \left( {h_{t} } \right)}} + \gamma \mathop \sum \limits_{{s_{t + 1} \in S}} \overline{p}_{{s_{t} ,s_{t + 1} }}^{{x_{t}^{*} \left( {h_{t} } \right)}} U_{t + 1}^{*} \left( {h_{t} ,x_{t}^{*} \left( {h_{t} } \right),\langle s_{t + 1} ,{\varvec{b}}_{{s_{t + 1} }} \rangle } \right) \\ &= \mathop {{\text{max}}}\limits_{{a_{t} \in A}} \left\{ {\overline{r}_{{s_{t} }}^{{a_{t} }} + \gamma \mathop \sum \limits_{{s_{t + 1} \in S}} \overline{p}_{{s_{t} ,s_{t + 1} }}^{{a_{t} }} U_{t + 1}^{*} \left( {h_{t} ,a_{t} ,\langle s_{t + 1} ,{\varvec{b}}_{{s_{t + 1} }} \rangle } \right)} \right\} \\ \end{aligned}$$
(26)

for \({t}\in{T}\). Then

  1. (i)

    for t = 0,1,\(\cdots\) , Z,

    $$U_{t}^{{\pi^{*} }} \left( {h_{t} } \right) = U_{t}^{*} \left( {h_{t} } \right), \forall\, h_{t} \in H_{t} .$$
  2. (ii)

    π* is the optimal policy, and

    $$V^{{\pi^{*} }} \left( {s_{0} ,{\varvec{b}}_{{s_{0} }} } \right) = V^{*} \left( {s_{0} ,{\varvec{b}}_{{s_{0} }} } \right), \forall \langle s_{0} ,{\varvec{b}}_{{s_{0} }}\rangle \in S \times \Delta^{C} .$$

Note that we have restricted attention to double-factored history-dependent deterministic policies in Theorem 4. This is because if there existed a double-factored history-dependent randomized policy which satisfied the obvious generalization of (26), as a result of Lemma 1, we could find a deterministic policy which satisfies (26).

Equation (26) can be written as

$$x_{t}^{*} \left( {h_{t} } \right) \in \mathop {{\text{argmax}}}\limits_{{a_{t} \in A}} \left\{ {\overline{r}_{{s_{t} }}^{{a_{t} }} + \gamma \mathop \sum \limits_{{s_{t + 1} \in S}} \overline{p}_{{s_{t} ,s_{t + 1} }}^{{a_{t} }} U_{t + 1}^{*} \left( {h_{t} ,a_{t} ,\langle s_{t + 1} ,{\varvec{b}}_{{s_{t + 1} }} \rangle } \right)} \right\}, t \in T.$$
(27)

The optimal policy π* derived from (27) in Theorem 4 exists because of the finite set S and A. So we obtain directly Theorem 5.

Theorem 5

There exists an optimal double-factored history-dependent deterministic policy for finite state set and action set.

We next show that there exists an optimal policy which is double-factored Markovian and deterministic.

Theorem 6

Let \(U_{t}^{*} ,t = 0,1, \cdots ,Z\), be the solutions of (24) and (25). Then

  1. (i)

    for t = 0, 1,\(\cdots\) , Z, \(U_{t}^{*} \left( {h_{t} } \right)\) depends on ht only through \(\langle s_{t} ,{\varvec{b}}_{{s_{t} }}\rangle\), i.e., \(U_{t}^{*} \left( {h_{t} } \right) = U_{t}^{*} \left( {s_{t} ,{\varvec{b}}_{{s_{t} }} } \right)\);

  2. (ii)

    there exists an optimal policy which is double-factored Markovian and deterministic when both S and A are finite.

Theorem 6 shows that there exists a double-factored Markovian deterministic policy that is optimal among all classes of policies. Furthermore, it follows from (A7) in the “Appendix” that there are the optimality equations in terms of the optimal double-factored value function, \({U}_{t}^{*}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right)\):

$${U}_{t}^{*}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right)=\underset{{a}_{t}\in A}{\text{max}}\left\{\sum_{k\in C}{b}_{{s}_{t},k}{r}_{{s}_{t},k}^{{a}_{t}}+\gamma \sum_{{s}_{t+1}\in S}\sum_{k\in C}{b}_{{s}_{t},k}{p}_{{s}_{t},{s}_{t+1},k}^{{a}_{t}}{U}_{t+1}^{*}\left({s}_{t+1},\tau \left(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle ,{a}_{t},{s}_{t+1}\right)\right)\right\}.$$
(28)

So we have established that

$${V}^{*}\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)=\underset{\pi \in {\Pi }^{\text{DHR}}}{\text{max}}{V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)=\underset{\pi \in {\Pi }^{\text{DHD}}}{\text{max}}{V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)=\underset{\pi \in {\Pi }^{\text{DMD}}}{\text{max}}{V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right), \forall\, \langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \in S\times {\Delta }^{C},$$
(29)

where the expected total discounted rewards \({V}^{\pi }({s}_{0},{{\varvec{b}}}_{{s}_{0}})\) generated by policies π\(\in\) ΠDHR, π\(\in\) ΠDHD, and π\(\in\) ΠDMD are expressed by (15), (16) and (17), respectively.

The following theorem establishes the relationship between the WVP in [4] and our finite horizon DFMDP problem.

Theorem 7

The finite horizon DFMDP is equivalent to the WVP of multi-scenario MDP.

Thus a DFMDP can also be denoted by the 6-elements-tuple \(\left(T,S,A,C,{J}_{{\varvec{p}},{\varvec{r}}},\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \right)\) according to the equivalence in Theorem 7.

Now we present the double-factored backward induction algorithm (DFBI) for the exact solutions of the finite horizon DFMDPs based on the above theory.

figure b

The DFBI is an exact solution method for the finite horizon DFMDPs. We know from the transition tree in Fig. 1 that if all state-belief pairs \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle\) for t = 0,1, ···, Z are reachable, |Ψt |= (|A|·|S|)t. So the time complexity of the algorithm is O((|A|·|S|·|C|)Z) and it increases exponentially with the number of decision epochs.

When actually implementing the method, there are several approaches listed below to reduce the computational complexity such that the algorithm would be practicable to solve real-world problems.

(i) For any\(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle\), t = 1, ···, Z, in step 1, if \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle\) equals its elder \(\langle s_{{t^{\prime } }} ,{\varvec{b}}_{{s_{{t^{\prime } }} }} \rangle\), t′ < t, or its brother \(\langle s_{t} ,{\varvec{b}}_{{s_{t} }} \rangle^{\prime }\), its sons can be directly copied from ones of \(\langle s_{{t^{\prime } }} ,{\varvec{b}}_{{s_{{t^{\prime } }} }} \rangle\) or \(\langle s_{t} ,{\varvec{b}}_{{s_{t} }} \rangle^{\prime }\) to reduce the computation. If \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle\) is a pseudo-state-belief pair, no computation for its sons is required.

(ii) For \({U}_{t}^{*}({s}_{t},{{\varvec{b}}}_{{s}_{t}})\) and \({A}_{t}^{*}({s}_{t},{{\varvec{b}}}_{{s}_{t}})\) of \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \in {\Psi }_{t}\) in step 4, since the same state-belief pairs in set Ψt have the same \({U}_{t}^{*}\) and \({A}_{t}^{*}\) according to Theorem 6, copies of \({U}_{t}^{*}\) and \({A}_{t}^{*}\) are directly used to avoid repetitive computations. When programming the algorithm, let \({U}_{t}^{*}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right)=0\) and \({A}_{t}^{*}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right)=\) \(\varnothing\) when \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle =\langle {s}_{t},{\varvec e}_{0}\rangle\), and \({U}_{t}^{*}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right)={U}_{t,k}^{*}\left({s}_{t}\right)\) and \({A}_{t}^{*}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right)={A}_{t,k}^{*}\left({s}_{t}\right)\) when \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle =\langle {s}_{t},{\varvec e}_{k}\rangle ,k\in C\), where \({U}_{t,k}^{*}\left({s}_{t}\right)\) and \({A}_{t,k}^{*}\left({s}_{t}\right)\) are the optimal value function and the optimal action set of MDP k at time t.

(iii) When implementing the DFBI, we only need to store the set \({\Psi }_{t}\), as well as the parent-son relationship between \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle \in {\Psi }_{t}\) and \(\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle \in {\Psi }_{t+1}\) in order to reduce the storage demand of the algorithm.

There are three types of scenarios illustrated in [16]. They can be described by three sets of parameter pair respectively:

  • (I) For certain transition probabilities and uncertain rewards, \({J}_{{\varvec{p}},{\varvec{r}}}={J}_{{{\varvec{p}}}_{0},{\varvec{r}}}=\{\left({{\varvec{p}}}_{0},{{\varvec{r}}}_{1}\right),\cdots ,\left({{\varvec{p}}}_{0},{{\varvec{r}}}_{|C|}\right)\}\).

  • (II) For uncertain transition probabilities and certain rewards, \({J}_{{\varvec{p}},{\varvec{r}}}={J}_{{\varvec{p}},{{\varvec{r}}}_{0}}=\{\left({{\varvec{p}}}_{1},{{\varvec{r}}}_{0}\right),\cdots ,\left({{\varvec{p}}}_{|C|},{{\varvec{r}}}_{0}\right)\}\).

  • (III) For uncertain transition probabilities and rewards, \({J}_{{\varvec{p}},{\varvec{r}}}=\{\left({{\varvec{p}}}_{1},{{\varvec{r}}}_{1}\right),\cdots ,\left({{\varvec{p}}}_{|C|},{{\varvec{r}}}_{|C|}\right)\}\).

Actually, both the type-I and the type-II scenarios are the special cases of the type-III scenarios.

The following theorem once again establishes the relationship between the DFMDPs and the standard MDPs.

Theorem 8

The DFMDP \(\left( {T,S,A,C,J_{{{\varvec{p}},{\varvec{r}}}} ,\langle s_{0} ,{\varvec{b}}_{{s_{0} }} }\rangle \right)\) with the type-I of the multiple scenarios, i.e., \(J_{{{\varvec{p}},{\varvec{r}}}} = \left\{ {\left( {{\varvec{p}}_{0} ,{\varvec{r}}_{1} } \right), \cdots ,\left( {{\varvec{p}}_{0} ,{\varvec{r}}_{\left| C \right|} } \right)} \right\}\), is equivalent to the standard MDP with parameter pair (\({\varvec{p}}_{0}\), \(\overline{\varvec{r}}\)) where \(\overline{\varvec{r}} = \sum\limits_{k \in C} {b_{{s_{0} ,k}} {\varvec{r}}_{k} }\).

The Weight-Select-Update (WSU) approximation algorithm in [4] is also a backward induction algorithm. Now we show that the WSU can solve exactly the DFMDPs with the type-I scenarios.

With the notations in this article, the procedure of the WSU is as follows:

figure c

There is \({J}_{{\varvec{p}},{\varvec{r}}}=\{\left({{\varvec{p}}}_{0},{{\varvec{r}}}_{1}\right),\cdots ,\left({{\varvec{p}}}_{0},{{\varvec{r}}}_{|C|}\right)\}\) in the \(\left(T,S,A,C,{J}_{{\varvec{p}},{\varvec{r}}},\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \right)\) with the type-I scenarios. Let \({U}_{t}\left({s}_{t}\right)=\sum_{k\in C}{b}_{{s}_{0},k}{U}_{t,k}\left({s}_{t}\right),t=\mathrm{0,1},\cdots ,Z\). With parameters \({{\varvec{p}}}_{0}\) and \(\overline{{\varvec{r}} }\), (30) and (31) in the above procedure become as follow:

$${x}_{t}({s}_{t})\leftarrow \underset{{a}_{t}\in A}{\text{argmax}}\left\{{\overline{r} }_{{s}_{t}}^{{a}_{t}}+\sum_{{s}_{t+1}\in S}{p}_{{s}_{t},{s}_{t+1},0}^{{a}_{t}}{U}_{t+1}\left({s}_{t+1}\right)\right\},$$
(30′)

and

$$\begin{aligned} & U_{Z} \left( {s_{Z} } \right) = \overline{r}_{{s_{Z} }}^{0} , \\ & U_{t} \left( {s_{t} } \right) \leftarrow \overline{r}_{{s_{t} }}^{{x_{t} \left( {s_{t} } \right)}} + \mathop \sum \limits_{{s_{t + 1} \in S}} p_{{s_{t} ,s_{t + 1} ,0}}^{{x_{t} \left( {s_{t} } \right)}} U_{t + 1} \left( {s_{t + 1} } \right) \\ \end{aligned}.$$
(31′)

The procedure with (30′) and (31′) is just for the solutions of the standard MDP with parameter pair (\({{\varvec{p}}}_{0}\),\(\overline{{\varvec{r}} }\)).

The mean value problem (MVP) heuristic is another approximation algorithm for the DFMDPs. The MVP is a simple problem in which all parameters take on their expected values. For the DFMDPs, MVP corresponds to the case where all transition probabilities and rewards are weighted as follows:

$$\tilde{r}_{s}^{a} = \mathop \sum \limits_{k \in C} b_{{s_{0} ,k}} r_{s,k}^{a} , \tilde{p}_{{s,s{^{\prime}}}}^{a} = \mathop \sum \limits_{k \in C} b_{{s_{0} ,k}} p_{{s,s{^{\prime}},k}}^{a} , \forall\, s,s^{\prime} \in S,a \in A.$$

That is, the MVP is a standard MDP problem with the parameter pair \((\widetilde{{\varvec{p}}},\widetilde{{\varvec{r}}})\).

By the means of computational experiment in Sect. 4, we will compare the solutions by the MVP and the WSU approximation methods with the solutions by our method. Furthermore, we can obtain the following corollary based on the definition of the MVP and the WSU methods above.

Corollary

Both the MVP and the WSU methods are the exact solution methods for DFMDPs with type-I scenarios.

3.4 Infinite Horizon DFMDP

Now we formulate the infinite horizon DFMDPs. We assume that the state set, the action set and the scenario set are finite, the transition probabilities and the rewards for each scenario are stationary (time homogeneous), the rewards of each scenario are bounded. The discounted factor γ satisfies that 0 < γ < 1.We consider the stationary and deterministic policy π = (x(s, bs), x(s, bs), ···), where the decision rule x: S × C → A specifies the action to be taken at any state-belief pair.

Under the above assumptions, the optimality equations (28) can also be written as

$$\begin{aligned} &{V}_{n}({s}_{n},{{\varvec{b}}}_{{s}_{n}})=\underset{{a}_{n}\in A}{\text{max}}\left\{\sum_{k\in C}{b}_{{s}_{n},k}{r}_{{s}_{n},k}^{{a}_{n}}+\gamma \sum_{{s}_{n+1}\in S}\sum_{k\in C}{b}_{{s}_{n},k}{p}_{{s}_{n},{s}_{n+1},k}^{{a}_{n}}{V}_{n+1}\left({s}_{n+1},\tau \left(\langle {s}_{n},{{\varvec{b}}}_{{s}_{n}}\rangle ,{a}_{n},{s}_{n+1}\right)\right)\right\}.\end{aligned}$$
(32)

Passing to the limit in (32) suggests that equations of the following form will characterize values and optimal policies for the infinite horizon DFMDPs:

$$V(s,{{\varvec{b}}}_{s})=\underset{a\in A}{\text{max}}\left\{\sum_{k\in C}{b}_{s,k}{r}_{s,k}^{a}+\gamma \sum_{s\mathrm{^{\prime}}\in {S}}\sum_{k\in C}{b}_{s,k}{p}_{s,s\mathrm{^{\prime}},k}^{a}V\left(s\mathrm{^{\prime}},\tau \left(\langle s,{{\varvec{b}}}_{s}\rangle ,a,s\mathrm{^{\prime}}\right)\right)\right\},$$
(33)

where V(s, bs) denotes the optimal value function for \(\langle s,{{\varvec{b}}}_{s}\rangle\). The system of equations (33) are called the optimality equations or Bellman equations for the infinite horizon DFMDPs.

The Bellman equations (33) can also be rewritten in the value function mapping form. Let \(\mathcal{V}\) be the space of real-valued bounded functions V: S × C → \(\mathbb{R}\), we have η: S × ΔC × A × \(\mathcal{V}\) → \(\mathbb{R}\) defined as

$$\eta \left(s,{{\varvec{b}}}_{s},a,V\right)=\sum_{k\in C}{b}_{s,k}{r}_{s,k}^{a}+\gamma \sum_{s\mathrm{^{\prime}}\in {S}}\sum_{k\in C}{b}_{s,k}{p}_{s,s\mathrm{^{\prime}},k}^{a}V\left(s\mathrm{^{\prime}},\tau \left(\langle s,{{\varvec{b}}}_{s}\rangle ,a,s\mathrm{^{\prime}}\right)\right).$$

Now by defining the value function mapping H: \(\mathcal{V}\) → \(\mathcal{V}\) as \(HV\left( {s,{\varvec{b}}_{s} } \right) = {\text{max}}_{a \in A} \eta \left( {s,{\varvec{b}}_{s} ,a,V} \right)\), the Bellman equations (33) can be written as V = HV.

Theorem 9

Let H be the value function mapping defined above, then H is an isotone mapping and a contraction under the supremum norm \(||V||=\text{sup}_{\langle s,{{\varvec{b}}}_{s}\rangle \in S\times {\Delta }^{C}}|V(s,{{\varvec{b}}}_{s})|\).

Since H is a contraction mapping, there exists a unique V*\(\in\)\(\mathcal{V}\) such that V* = HV* by Barnach's fixed-point theorem [1]. And for any V0\(\in\)\(\mathcal{V}\), the sequence {Vn} defined below converges to V* (see [1]):

$${V}_{n}=H{V}_{n-1}={H}^{n}{V}_{0}.$$

The theory above is the base of developing algorithms solving the infinite horizon DFMDPs. To limit the scope of this paper, the algorithms and case studies for the infinite horizon DFMDPs would be presented by our follow-up articles.

4 Computational Experiments

The computational experiments involving three sets of test instances for comparing solution methods for finite horizon DFMDPs considering run-time and solution quality are illustrated in this section. The first set of experiments is based on a randomly generated multi-scenario MDP. The second set of experiments is based on a multi-scenario MDP for determining the most cost-effective human immunodeficiency virus (HIV) treatment policy which has been used for pedagogical purposes in the medical decision-making literature [21, 22]. The third set of experiments is an illustrative instance of the scenario recognition problem.

As benchmarks, other than solving the problems with the DFBI algorithm, we will also generate solutions from the WSU, MVP methods and the MIP formulation. We use \({V}_{Z}^{*}\) to denote the optimal value obtained by the DFBI, and \({\widetilde{V}}_{Z}^{*}\) to denote the optimal value obtained from either WSU or MVP for the multi-scenario MDP with Z decision epochs. Then let

$$\mathrm{Gap}=\frac{{V}_{Z}^{*}-{\widetilde{V}}_{Z}^{*}}{{V}_{Z}^{*}}\times 100\mathrm{\%}.$$

All solution methods are implemented using MATLAB 2017a.

4.1 Random Instance

To generate the random instances, firstly the number of states, actions, scenarios, and decision epochs for the problem need to be defined. Then, the scenario parameters are randomly sampled. We sample the rewards from uniform distribution and the transition probabilities from Dirichlet distribution. The initial beliefs are uninformed priors on the scenarios.

Let |S| =|A| =|C|= 4 and Z = 4. The rewards for all scenarios are randomly generated from uniform distribution between 0 and 1. The transition probabilities are randomly generated from Dirichlet distribution which is characterized by |S| parameters \((\rho {\alpha }_{1},\rho {\alpha }_{2},\cdots,\rho {\alpha }_{|S|})\), where \(({\alpha }_{1},{\alpha }_{2},\cdots,{\alpha }_{|S|})\) with αi > 0 ∀ i, is the base measure of the distribution and ρ > 0 is the concentration parameter. Then we repeat the process, generating 30 instances for ρ = 1, 10, 100, respectively. Table 1 demonstrates the run-time of the four methods: the MVP, WSU, MIP and DFBI. We find that the MVP and WSU methods are able to solve these instances more quickly (under 0.05 CPU seconds for each instance) and the DFBI also does fairly quickly (under 5 CPU seconds for each instance) with the exact solutions. The MIP takes much more run-time to solve for the exact solutions among these instances.

Table 1 The solution time (CPU seconds) of the MVP, WSU, MIP and DFBI methods on random DFMDP test instances

We evaluate the optimality gap of MVP and WSU methods considering three type of the multiple scenarios. Table 2 demonstrates some summary statistics for the optimality gap of MVP and WSU methods, obtained from these 50 instances for each type of scenarios and each concentration parameter. For the type-I scenarios, the results conforms with the Corollary in Sect. 3.3 and both the MVP and the WSU find the exact solutions. For fixed ρ, the optimality gaps of the MVP and WSU methods for the type-III scenarios are much greater than the ones for the type-II scenarios. For both the type-II and type-III scenarios, the greater the value of ρ, the smaller the optimality gaps of the MVP and WSU methods. This is because the greater value of ρ corresponds to the smaller variance for transition matrix across scenarios, which approximates to the problem with the type-I scenarios.

Table 2 The optimality gap of MVP and WSU methods on random DFMDPs with three types of multiple scenarios

4.2 Instance of Medical Decision-Making

An MDP for determining the optimal timing of treatments for HIV is considered. In the MDP, HIV is characterized according to 4 health states: Mild, Moderate, Severe, or Dead. The decision maker can choose to start the patient on one of three treatments: Treatment A, Treatment B, and Treatment C. Treatment A is the least effective but also the least expensive while Treatment C is the most effective but comes at the highest cost. A summary table of parameter values for this MDP as well as some sampling distributions for each parameter is provided in [21]. In our experiments, we sampled two scenarios of the transition probabilities from the Dirichlet distribution in [21] to simulate findings coming from different clinical studies. They are listed below.

$$\begin{aligned} & {{\varvec{p}}}_{1}^{\mathbf{A}}=\left[\begin{array}{llll}0.710&0.209&0.070&0.011\\ 0&0.581&0.400&0.019\\ 0&0&0.739&0.261\\ 0&0&0&1 \end{array}\right], {{\varvec{p}}}_{1}^{\mathbf{B}}=\left[\begin{array}{llll}0.790&0.151&0.051&0.008\\ 0&0.697&0.290&0.013 \\ 0&0&0.811&0.189 \\ 0&0&0&1 \end{array}\right], \\&{{\varvec{p}}}_{1}^{\mathbf{C}}=\left[\begin{array}{llll}0.898&0.074&0.025&0.003\\ 0&0.852&0.142&0.006\\ 0&0&0.908&0.092\\ 0&0&0&1\end{array}\right] \end{aligned}$$
$$\begin{aligned} & {{\varvec{p}}}_{2}^{\mathbf{A}}=\left[\begin{array}{llll}0.733&0.198&0.064&0.005\\ 0&0.582&0.408&0.010\\ 0&0&0.753&0.247\\ 0&0&0&1 \end{array}\right], {{\varvec{p}}}_{2}^{\mathbf{B}}=\left[\begin{array}{llll}0.806&0.143&0.046&0.005\\ 0&0.697&0.296&0.007\\ 0&0&0.821&0.179\\ 0&0&0&1 \end{array}\right], \\ & {{\varvec{p}}}_{2}^{\mathbf{C}}=\left[\begin{array}{llll}0.862&0.102&0.033&0.003\\ 0&0.784&0.211&0.005\\ 0&0&0.872&0.128\\ 0&0&0&1\end{array}\right].\end{aligned}$$

The data related to the rewards is taken directly from [21, 22] and they are the same for two scenarios. Let S = {Mild, Moderate, Severe}, A = {A, B, C}, Z = 9, and γ = 1. With these settings, we obtain the optimal policies, \({\pi }_{k}^{*}\) for MDP k (k = 1, 2) and \({\pi }^{*}\) for the DFMDP. We consider that the number of patients eligible for either scenario is balanced, thus \({{\varvec{b}}}_{{s}_{0}}={[\text{0.5,0.5}]}^{\top}\). Then we can achieve the weighted value (i.e., the average net benefit in [21, 22]) for each policy, i.e., \({V}^{\pi }\left({s}_{0},{b}_{{s}_{0}}\right), \pi ={\pi }_{1}^{*},{\pi }_{2}^{*}, {\pi }^{*}\) listed in Table 3.

Table 3 shows that for the patients starting in the mild health state, the average net benefit of policy \({\pi }^{*}\) is greater than that for policies \({\pi }_{k}^{*}, k=\mathrm{1,2}\) by 0.44% and 3.05% respectively. For the patients starting in the moderate health state, the average net benefit of policy \({\pi }^{*}\) is greater than that for policies \({\pi }_{k}^{*}, k=\mathrm{1,2}\) by 0.24% and 12.38% respectively. For the patients starting in the severe health state, the average net benefits of policies \({\pi }^{*},{\pi }_{1}^{*},{\pi }_{2}^{*}\) are all zero.

This instance illustrates that it is difficult to obtain the best average gain by using the optimal policy of a single MDP under multiple scenarios of the parameters, but the optimal policy from DFMDP can do.

Table 3 The average net benefits \({V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right), \pi ={\pi }_{1}^{*},{\pi }_{2}^{*}, {\pi }^{*}\) (Unit: US$)

4.3 Variant of Medical Instance

A illustrative instance of the scenario recognition problem is given here. Assume that there are two authoritative findings with different transition probabilities in the previous medical instance. We represent them with scenario k = 1, 2 respectively as follows.

$$\begin{aligned}&{{\varvec{p}}}_{1}^{\mathbf{A}}=\left[\begin{array}{llll}0.750& 0.150 &0.090&0.010\\ 0 &0.600& 0.380&0.020\\ 0& 0 &0.750& 0.250\\ 0& 0& 0& 1 \end{array}\right], {{\varvec{p}}}_{1}^{\mathbf{B}}=\left[\begin{array}{llll}0.812& 0.112& 0.068 &0.008\\ 0& 0.700& 0.285 &0.015\\ 0& 0 &0.812&0.188\\ 0 &0 &0& 1 \end{array}\right], \\ & {{\varvec{p}}}_{1}^{\mathbf{C}}=\left[\begin{array}{llll}0.875 &0.075&0.045 &0.005\\ 0 &0.800& 0.190 &0.010\\ 0 &0 &0.875& 0.125\\ 0 &0& 0 &1\end{array}\right],\end{aligned}$$
$$\begin{aligned} & {{\varvec{p}}}_{2}^{\mathbf{A}}=\left[\begin{array}{llll}0.787&0.145&0.068&0 \\ 0&0.742&0.245&0.013\\ 0&0&0.771&0.229\\ 0& 0& 0& 1 \end{array}\right], {{\varvec{p}}}_{2}^{\mathbf{B}}=\left[\begin{array}{llll}0.842&0.108&0.050&0 \\ 0&0.809&0.182& 0.009\\ 0&0&0.830&0.170\\ 0& 0& 0&1 \end{array}\right], \\ & {{\varvec{p}}}_{2}^{\mathbf{C}}=\left[\begin{array}{llll}0.898&0.102&0&0 \\ 0.120&0.751&0.129& 0 \\ 0&0.084&0.796&0.120\\ 0&0&0&1\end{array}\right],\end{aligned}$$

where the transition probabilities of scenario 1 come from [22], and ones of scenario 2 are hypothetical. The data related to the rewards is taken directly from [21, 22] and the same for two scenarios. Let S = {1, 2, 3} = {Mild, Moderate, Severe}, A = {A, B, C}, Z = 8, and γ = 1.

If the patient exhibits transition that shows similar pattern as scenario 1, we would infer the patient belongs to patient group I, similarly for patient group II. If we don’t have any prior knowledge of the population percentage for these two groups, we would just assume an equal possibility, i.e., \({{\varvec{b}}}_{{s}_{0}}={[\mathrm{0.5,0.5}]}^{\top}\). Now let’s start with a patient currently in the mild health state, and apply the optimal treatment strategy while recognizing which group this patient belongs to. The optimal treatment policy for the patients starting in the mild health state (s0 = 1) can be found by using the DFBI and is demonstrated in Fig. 2.

Fig. 2
figure 2

The treatment policy of patients starting in the mild health state

For this patient X with his/her current health state being in mild, our first-year treatment plan is C. And at the end of year-1, if his/her health station remains at mild or transitions to moderate, then we continue providing treatment plan C for the patient. If the health state becomes severe, the optimal treatment in year-2 is \({x}_{1}^{*}\left(3,{\mathbf b}_{3}\right)={x}_{1}^{*}\left(3,{{\varvec{e}}}_{1}\right)={x}_{\mathrm{1,1}}^{*}\left(3\right)=\mathbf{A}\). The state-belief pair \(\langle 3,{{\varvec{e}}}_{1}\rangle\) implies that patient X belongs to group-I rather than group-II, since \({p}_{\mathrm{1,3},2}^{{\varvec{C}}}=0\) implies that it’s impossible for group-II patient’s health state transitioning from mild to severe. Similarly, for another patient Y if his/her health state becomes mild at the end of year-2, after two consecutive treatment plan C, the optimal treatment in year 3 is \({x}_{2}^{*}\left(1,{{\varvec{b}}}_{1}\right)={x}_{2}^{*}\left(1,{{\varvec{e}}}_{2}\right)={x}_{\mathrm{2,2}}^{*}\left(1\right)=\mathbf{C}\), and the state-belief pair \(\langle 1,{{\varvec{e}}}_{2}\rangle\) implies that patient Y belongs to group-II.

This is only an illustrative example with very limited scale. In order to apply this idea to real-world scenario recognition problems, the method needs to be combined with algorithms dealing with infinite horizon DFMDPs. This will be one of our focusing directions for future works.

5 Conclusions and Future Works

For the weighted value problem of MDPs with multiple scenarios of the parameters, we introduce the concept of scenario belief to indicate the probability distribution that the scenarios realize in the system and derive its update at every decision epoch based on Bayesian rule. We formulate the expected total discounted reward of a policy by adding the expectation over the scenario beliefs, on top of the usual expectation over the intrinsic state transition uncertainty, and establish a new framework named as the DFMDPs. We show that the usage of state-belief pair is the sufficient statistics of the past history and contains the complete information required for decision-making. We discuss four classes of policies in the finite horizon DFMDPs and prove that there exists a double-factored Markovian deterministic policy which is optimal among all classes of policies. The double-factored backward induction algorithm is proposed. It is an efficient exact solution method for the double-factored Markovian deterministic policies in the finite horizon DFMDPs. We also show that the optimality equation for the infinite horizon DFMDPs is an isotone mapping and a contraction under the supremum norm. This ensures the existence of solutions of the optimality equation. Our work enriches the theories of MDPs and expands the application scope of MDPs.

Future work will focus on the following items:

  1. (i)

    The restrictions on the state set and action set can be relaxed such that DFMDP models apply to more real-world problems.

  2. (ii)

    The efficient solution methods for the infinite horizon DFMDPs are further studied.

  3. (iii)

    More efficient offline and online algorithms can be invented to solve the large-size DFMDPs for real-world applications.

  4. (iv)

    Investigating a learning-based approach in the framework of DFMDP is one of the directions for future exploration.