Computation of weighted sums of rewards for concurrent MDPs

Buchholz, Peter; Scheftelowitsch, Dimitri

doi:10.1007/s00186-018-0653-1

Computation of weighted sums of rewards for concurrent MDPs

Original Article
Published: 31 October 2018

Volume 89, pages 1–42, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Mathematical Methods of Operations Research Aims and scope Submit manuscript

Computation of weighted sums of rewards for concurrent MDPs

Download PDF

691 Accesses
14 Citations
Explore all metrics

Abstract

We consider sets of Markov decision processes (MDPs) with shared state and action spaces and assume that the individual MDPs in such a set represent different scenarios for a system’s operation. In this setting, we solve the problem of finding a single policy that performs well under each of these scenarios by considering the weighted sum of value vectors for each of the scenarios. Several solution approaches as well as the general complexity of the problem are discussed and algorithms that are based on these solution approaches are presented. Finally, we compare the derived algorithms on a set of benchmark problems.

Tuning the Discount Factor in Order to Reach Average Optimality on Deterministic MDPs

Simple Strategies in Multi-Objective MDPs

Concurrent MDPs with Finite Markovian Policies

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Markov decision processes (MDPs) are commonly used in modeling probabilistic state-based systems with non-deterministic controller actions where the goal is to find a control policy that optimizes some time-based reward measure. Briefly, a Markov decision process describes a Markov chain with inputs that are called actions and rewards that make it possible for a controller to, first, control the evolution of the Markov chain and, second, assess its performance. The controller can then define a policy which selects the right action and, thus, defines a Markov chain with rewards. The existing theory allows one to compute optimal policies for Markov decision processes with respect to various reward measures, such as expected discounted reward, expected average reward, or expected finite-horizon reward (Puterman 2005). Here, the expected discounted reward measure is considered, that is, given a discount factor$\gamma \in [{0, 1})$, we weight the reward from the ith step in the future with $\gamma ^i$, and compute the expected sum of discounted rewards.

The general theory of MDPs is well established and applied in various branches of operations research or artificial intelligence. Several textbooks (Feinberg and Schwartz 2002; Puterman 2005; Sigaud and Buffet 2010) and tutorial papers (White and White 1989) on the subject exist. The final goal is to find an optimal control strategy for a system using the optimal policy resulting from the MDP. However, MDPs, like many other models, are an abstract image of reality which is superposed by several levels of uncertainty. This is reflected in a limited knowledge of the exact values of transition probabilities or rewards. A wide variety of approaches exists to consider uncertainty in MDPs (see the overview in Sect. 1.1). Often uncertainty is coded in the model and then the modified model is analyzed with some assumptions in mind, e.g., being pessimistic or optimistic with respect to the realization of uncertainty. The resulting formalisms that capture additional uncertainty in MDPs share the general property that the computational complexity of finding an optimal policy becomes higher when compared to a “plain” MDP, as it is, for example, the case with stochastic games (Björklund and Vorobyov 2007).

In this paper, we present a specific approach to introduce uncertainty in MDPs by so-called scenarios. A scenario is one possible realization of uncertainty and may result from possible realizations of a system or some observation of a system or from a simulation model. Each scenario defines a new transition probability matrix for an action in the MDP and the complete information specifies a multi-scenario MDP. In this work, we shall concentrate on the weighted optimization problem for this class of models. Formally, this means to compute, given a set of K Markov decision processes with common state and action spaces (but different transition probabilities and rewards) and K weight coefficients, a policy that optimizes the weighted sum of expected discounted rewards assuming that the policy has to be selected before the scenario which finally occurs is known.

This problem can be interpreted in a different way as follows: given a probability distribution on Markov decision processes denoted as scenarios, compute a policy that optimizes the expected discounted reward, without knowing the scenario which is finally realized. This is often denoted as non-anticipative behavior in stochastic optimization (Colvin and Maravelias 2010). We will later show that, on the one hand, this problem is hard to solve but, on the other hand, simple heuristic optimization approaches usually yield optimal or almost optimal results. Before we outline the rest of the paper and introduce our approach, a short overview of related work is given.

1.1 Related work

As already mentioned, there is an enormous amount of work that has been published to describe uncertainty in MDPs. Consequently, we can only give a short summary of approaches that are related to our approach and we briefly outline the differences.

Multi-scenario problems The use of scenarios is one possibility to describe uncertainty. In this case, examples for the realization of uncertainty are given instead of characterizing the whole uncertainty set. Scenarios are commonly used in two-stage or multi-stage stochastic programming (Rockafellar and Wets 1991; Dupacová et al. 2000). In this setting it is usually assumed that information about the realization of a specific scenario is gained over time. Thus, future decisions can take into account additional information by, for example, reducing the set of possible scenarios. This is different from our approach where we assume that a policy has been selected in advance before any information about the concrete scenario becomes available.

Scenarios in MDPs are mentioned in Nilim and Ghaoui (2005, Sect. 7.1) as a simple uncertainty model. In contrast to the scenarios defined here, it is assumed that the scenarios observe the (s, a)-rectangularity property, i.e., they are defined independently for every state-action pair. Concrete application examples for the use of scenarios are product line design problems (Bertsimas and Mišić 2017) where finitely many models for customer behavior are considered, healthcare decision making (Bertsimas et al. 2016), where different screening strategies to detect cancer are available, and project scheduling (Mercier and Hentenryck 2008) where different projects are described by Markov chains. The combination of scenarios and partially observable MDPs is considered in Walraven and Spaan (2015) in the context of planning problems. If one cannot distinguish the current scenario during operation of the system or decisions cannot anticipate the scenario, one has to find a policy that behaves well in all scenarios. This is different from the computation of robust policies which behave good in the worst case but may have a bad performance in the average case.

In our case, every scenario defines its own MDP but all MDPs are coupled by a single policy. The same model has been concurrently developed by Steimle et al. (2018) under the name multi-model MDP. However, Steimle et al. (2018) considers optimal policies for finite horizons whereas we consider optimal stationary policies for discounted infinite horizons. Both problems are shown to be NP-hard. The optimal policy in the finite horizon case depends on the current time step and is shown to be deterministic in Steimle et al. (2018), whereas the optimal stationary policy for the infinite horizon case can be randomized and strictly better than any deterministic policy as shown here. This is similar to partially observable MDPs where the optimal policy can also be randomized (Singh et al. 1994). This interesting relation has also been recognized in Steimle et al. (2018, Sect. 4). Another difference between (Steimle et al. 2018) and this paper are the heuristic algorithms proposed for approximate policies, Steimle et al. (2018) uses an extension of value iteration based on the mean value relaxation of the original problem whereas we propose an approach based on policy iteration on the unrelaxed problem with a guaranteed local convergence.

Furthermore, a similar model has been considered in Raskin and Sankur (2014), however, in a model-checking context of reachability, safety, and parity properties. The results in Raskin and Sankur (2014) show that the qualitative model-checking problems (such as limit-sure reachability) can be solved efficiently while the quantitative problems (such as quantitative reachability and safety) are NP-hard, which corresponds to the hardness results in Steimle et al. (2018) and this paper.

Multi-scenario optimization problems can be interpreted as multi-objective optimization which amounts to optimizing given functions $f_1(x), \ldots , f_K(x)$ subject to $x \in X$ simultaneously. As multi-dimensional spaces can only be partially ordered in a “natural” way, there may exist multiple incomparable optimal solutions. Hence, for multi-objective problems, several approaches are known including the computation of the Pareto frontier, the generation of the convex hull of Pareto optimal solution, the analysis whether a solution exceeds a lower bound and the computation of a weighted sum of function values (Ehrgott 2005). Among these problems, the weighted optimization problem is often the simplest one. Furthermore, the computation of optimal policies for different weight vectors is the central step to characterize the convex hull of Pareto optimal solutions (Roijers et al. 2014).

MDPs with uncertainties Scenarios are only one way of expressing parameter uncertainty. Parameter uncertainty in MDPs has been discussed in a more general context for more than 40 years by various authors. Early references are Satia and Lave (1973) and White and Eldeib (1994). Givan et al. (2000), Iyengar (2005), Klamroth et al. (2013), Nilim and Ghaoui (2005) and Wiesemann et al. (2013) are more recent and important publications. Different forms of representing uncertainty have been proposed. Usually, so-called uncertainty sets of the transition probabilities for the underlying Markov chain and sometimes also the set of rewards are defined. These sets then characterize a usually infinite and not even countable set of MDPs. Very often the optimization problem is then interpreted as a robust optimization problem. Mathematically, robust optimization means finding a policy that maximizes the rewards under the worst possible realization of uncertainty; that is, we solve a problem of the type $\max _{x \in X} \min _{y \in Y} f(x, y)$. It is intuitive that robust mathematical programs are much harder than simple maximization problems. However, in some cases, like for Bounded Parameter MDPs, it is still possible to find optimal policies efficiently (Givan et al. 2000). In contrast to our approach with a fixed number of scenarios, the uncertainty sets define an infinite and usually not countable set of transition kernels. However, the minimum operator usually maps this continuous set on a finite set of optimal transition kernels which then have to be considered in the maximum operation. From this perspective, one can consider the robust problem as a two-stage optimization problem where a scenario-based approach is applied after the inner optimization problem has been solved or an approximate solution for this problem is known. Usually the robust problem is solved for the discounted reward over an infinite horizon.

Concerning the theoretical assumptions behind most MDP problems with uncertainties about the model parameters, we refer to the work of Iyengar (2005). There, the (s, a)-rectangularity property is defined as a generally desirable property of uncertainty sets for transition probability matrices which, intuitively, amounts to separability of the uncertainty set into a product of uncertainty sets of transition probability vectors for each state and each action. Under this assumption, robust policies for MDPs are studied and it is shown that robust policies are pure and can be computed with an effort only modestly larger than the effort of computing optimal policies in MDPs (for details see Iyengar 2005). It is important to note that this assumption holds for the uncertainty models in Givan et al. (2000), Nilim and Ghaoui (2005), Satia and Lave (1973) and White and Eldeib (1994). The most prominent uncertainty model are intervals such that rows of the transition matrix are chosen from sets of probabilistic vectors which are defined by lower and upper bounds for the elements (Givan et al. 2000). Other uncertainty models are a Bayesian formulation (Satia and Lave 1973), likelihood or entropy models (Nilim and Ghaoui 2005).

The (s, a)-rectangularity property has been extended in Wiesemann et al. (2013) to an s-rectangularity property. This intuitively means that dependencies between the uncertain parameters for different decisions in a single state are allowed. It can be shown that for convex uncertainty sets observing the s-rectangularity property, a stationary randomized policy can be found that minimizes the discounted reward. Computation of the optimal policy is harder than for uncertainty sets having the (s, a)-rectangularity property but the effort remains polynomial for convex uncertainty sets and the robust results can be much better as the uncertainty is constrained. However, in Wiesemann et al. (2013) it is also shown that without rectangularity, the problem becomes hard to solve.

Concerning applications of robust MDP models, we refer to a discussion of robust multi-armed bandit problems which have been transformed into MDPs with uncertain parameters observing the rectangularity property (Caro and Das-Gupta 2015), and to the work on product line design problems under model uncertainty (Bertsimas and Mišić 2017).

Stochastic games The $\max \min $ problems mentioned above can also be interpreted as stochastic games with two players. MDPs with uncertainty can be interpreted in this way by assuming that two players which decide in an alternating sequence and one player is trying to minimize the goal function whereas the other one tries to maximize it (Filar and Vrieze 1997). Therefore the approach becomes similar to the robust optimization of MDPs with uncertainties.

Partially observable MDPs The scenario based perspective can also be interpreted as a variant of partially observable MDPs (Kaelbling et al. 1998), where the controller may have only indirect information about the exact state, such as that the state belongs with some probability to a subset of states. Partially observable Markov decision processes have more expressive power and thus high computational solution cost in the general case (Papadimitriou and Tsitsiklis 1987). Here, on the other hand, we have a very specific form of uncertainty since we know that the uncertainty set consists of finitely many possible realizations only.

Coupled MDPs Different MDPs that are combined by coupling constraints are considered for example in Singh and Cohn (1997) to analyze parallel tasks. These models differ from scenario based models because the state and action space of the coupled model consists of the cross product of the MDPs which is different from the scenario based approach where a single state and action space are considered.

Optimization methods Since a major aspect of this paper is the finding of efficient methods to compute or approximate the optimal policy and reward for combined MDPs, we briefly review optimization methods which are relevant for our work. Standard methods for MDPs can be found in Puterman (2005). Convex problems have been discussed in the book of Nesterov and Nemirovskii (1994), further methods of interest are mixed-integer programming (Vielma 2015) and quadratically constrained quadratic programming (Qualizza et al. 2012). Specifically for Markov decision problems, quadratic programming methods have been proposed in Amato et al. (2007). The relationship between robustness and multi-objective problems has been explored in the work of Klamroth et al. (2013).

1.2 Contribution of the paper

The class of concurrent MDPs is defined in this paper. It is based on a finite set of different scenarios which define different transition kernels. The goal is to find a common policy that minimizes or maximizes the weighted sum of rewards if applied to the concurrent MDPs. The approach assumes that the policy has to be found without knowing the scenario. Weights may be interpreted as probabilities that scenarios occur and the expected discounted reward is optimized. The resulting problem is different from two-stage or multi-stage problems, where information about the scenario becomes available during optimization. Problems of the form considered in this paper are important for situations where decisions have to be made beforehand and cannot be modified later when the scenario is known or can be estimated. This is for example the case for decisions made in the planning phase when the coming demands can only be estimated or during the design of a vaccine with only statistical information about future virus variants.

In contrast to Nilim and Ghaoui (2005), where scenarios observe the (s, a)-rectangularity property, scenarios here are general. This allows us to capture also those cases, where the parameters are defined for some model which is a high level description of an MDP. In this case, a single parameter, like the service rate of a queue, appears several times in the transition matrix and has to have the same value for each appearance. This property cannot be described by uncertainty sets that observe s-rectangularity or (s, a)-rectangularity.

The price for introducing this more general form of uncertainty is an increase in the complexity of the resulting optimization problem which is shown to be NP-hard. We will present five different algorithms to compute or approximate the optimal weighted sum of rewards for the concurrent MDPs under a common policy. These algorithms belong to three different classes, namely integer linear programming (ILP), non-linear programming (NLP) and problem-specific local search heuristics. The algorithms are compared empirically by means of several examples. The experiments show that non-linear programming approaches, which might be the natural choice for problems like the one we consider here, often show the worst performance and have convergence issues whereas the problem-specific local search heuristics are guaranteed to find local optima and are much more efficient than general non-linear programming algorithms. A second finding is that, even if the problems are non-convex and pure policies are not sufficient in general, the best pure policy is most times as good as the best general policy computed by a non-linear programming algorithm or by a local search heuristic. Furthermore, it is shown how the best pure policy can be computed from an integer linear program.

1.3 Structure of the work

The formal problem statement is given in Sect. 2, and its mathematical structure is explained and discussed in Sect. 3. Before we dive into optimization approaches, we briefly discuss complexity aspects in Sect. 3.1 (with a proof in “Appendix B”). In Sect. 4, we propose several approaches to solving the optimization problem; the practical evaluation of these algorithms is given in Sect. 5. Finally, Sect. 6 summarizes the work done and gives ideas for future research.

1.4 Notation

Over the course of this work, we shall use a couple of custom notation conventions for brevity and clarity purposes. In particular, multi-dimensional identifiers such as vectors and matrices are written in bold script, such as $\varvec{v}$ or ${\varvec{M}}$; furthermore, matrices are capitalized. To access the individual rows of a matrix ${\varvec{M}}$, we use the ${\varvec{M}}(i\bullet )$ notation to identify the ith row of ${\varvec{M}}$. Analogously, the jth column is identified by ${\varvec{M}}(\bullet j)$.

We also shall use some special identifiers for sets and constants. ${\mathbb {B}}$ is the set $\{0,1\}, \varvec{0}$ is, depending on the context, a matrix or a row vector of zeros (in case of ambiguity, the exact meaning is given), and $\mathbb {1}$ is a column vector of ones; in some cases, we shall write $\mathbb {1}_n$ to clarify that the vector has n dimensions. $\varvec{e}_i$ is the ith basis row vector. For sets of multi-dimensional values over some set ${\mathcal {K}}$ we shall use ${\mathcal {K}}^{n \times m}$ to designate the set of matrices with n rows and m columns where each entry is in ${\mathcal {K}}$. ${\varvec{I}}_N$ is the identity matrix of dimension N.

At some points in this work, Kronecker matrix-analytic operators will be used. We designate by $\otimes $ the Kronecker product.

2 Basic problem

We consider a set ${\mathcal {M}}$ of Markov decision processes (MDPs) with a joint finite state space ${\mathcal {S}}$ and a joint finite action space ${\mathcal {A}}$. Let K be the number of MDPs, N be the number of states and M be the number of actions. We identify the sets by consecutive integers, i.e., ${\mathcal {M}}= \{1,\ldots ,K\}, {\mathcal {S}}= \{1,\ldots ,N\}$ and ${\mathcal {A}}= \{1,\ldots ,M\}$. Each MDP $k \in {\mathcal {M}}$ is defined by

$$\begin{aligned} \left( {\mathcal {S}}, \varvec{\alpha }, \left( {\varvec{P}}^a_k\right) _{a \in {\mathcal {A}}}, \varvec{r}_k\right) \end{aligned}$$

where $\varvec{\alpha }\in {{\mathbb {R}}}^{1 \times N}_{\ge 0}$ with $\varvec{\alpha }\mathbb {1}=1$ is the common initial distribution,

$$\begin{aligned} {\varvec{P}}^a_k = \left( \begin{array}{c} {\varvec{P}}^a_k(1\bullet )\\ \vdots \\ {\varvec{P}}^a_k(N\bullet ) \end{array}\right) \text{ where } {\varvec{P}}^a_k(i\bullet ) \in {{\mathbb {R}}}^{1 \times N}_{\ge 0}, {\varvec{P}}^a_k(i\bullet )\mathbb {1}= 1 \end{aligned}$$

is the stochastic transition matrix for action $a \in {\mathcal {A}}$, and $\varvec{r}_k \in {{\mathbb {R}}}^{N \times 1}_{\ge 0}$ are the non-negative reward vectors. Furthermore, we assume a given discount factor $\gamma \in ({0,1})$.

Here, we consider rewards that are not action-dependent; this does not impose a limitation with respect to modeling power but does simplify some of the mathematical derivations. MDPs with action-dependent rewards can be transformed into equivalent MDPs with rewards that depend only on the state by enlarging the state space. This is formally shown in “Appendix A”.

We consider stationary policies which can be represented by $N \times M$ matrices

$$\begin{aligned} {\varvec{{\varPi }}}= \left( \begin{array}{c} \varvec{\pi }_1\\ \vdots \\ \varvec{\pi }_N \end{array}\right) \text{ where } \varvec{\pi }_i = \left( \varvec{\pi }_i(1), \ldots , \varvec{\pi }_i(M) \right) \in {{\mathbb {R}}}^{1 \times M} \end{aligned}$$

and $\varvec{\pi }_i(a)$ is the probability of choosing $a \in {\mathcal {A}}$ in state $i \in {\mathcal {S}}$. We have $\varvec{\pi }_i(a) \ge 0$ and $\sum _{a \in {\mathcal {A}}} \varvec{\pi }_i(a) = 1$. If for all $i \in {\mathcal {S}}$ it is $\varvec{\pi }_i(a_i)=1$ for some $a_i \in {\mathcal {A}}$, the policy is pure or deterministic otherwise it is randomized. We shall identify policies by the corresponding $N \times M$ matrices. In theory, a pure policy could be represented by a vector of length N which contains in position i the action chosen in state i; however, for the sake of uniformity in the mathematical formalism we will use a more general notation. Let ${\mathcal {P}}$ be the set of stationary policies and ${\mathcal {P}}_{pure}$ the set of pure policies. Policy ${\varvec{{\varPi }}}$ defines for MDP $k \in {\mathcal {M}}$ a stochastic transition matrix

$$\begin{aligned} {\varvec{P}}^{{\varvec{{\varPi }}}}_k = \left( \begin{array}{c} \sum \nolimits _{m=1}^M \varvec{\pi }_1(m){\varvec{P}}^m_k(1\bullet )\\ \vdots \\ \sum \nolimits _{m=1}^M \varvec{\pi }_N(m){\varvec{P}}^j_k(N\bullet ) \end{array}\right) . \end{aligned}$$

Define ${\varvec{C}}_k^{\varvec{{\varPi }}}= {\varvec{I}}- \gamma {\varvec{P}}_k^{\varvec{{\varPi }}}$ with discount factor $\gamma \in (0,1)$. ${\varvec{C}}_k^{\varvec{{\varPi }}}$ is a non-singular M-matrix. The inverse matrix ${\varvec{{\overline{C}}}}_k^{{\varvec{{\varPi }}}} = \left( {\varvec{C}}_k^{\varvec{{\varPi }}}\right) ^{-1}$ exists and is non-negative (Berman and Plemmons 1994). The discounted gain for policy ${\varvec{{\varPi }}}$ and discount factor $\gamma \in ({0,1})$ is given by

$$\begin{aligned} \varvec{g}_k^{\varvec{{\varPi }}}= {\varvec{{\overline{C}}}}_k^{\varvec{{\varPi }}}\varvec{r}_k\quad \text{ and } \quad G^{\varvec{{\varPi }}}_k = \varvec{\alpha }\varvec{g}^{\varvec{{\varPi }}}_k. \end{aligned}$$

(1)

$\varvec{g}_k^{\varvec{{\varPi }}}$ is the value function or value vector of the policy ${\varvec{{\varPi }}}$ for scenario k and $G^{\varvec{{\varPi }}}_k$ is denoted as the (scalar) gain of ${\varvec{{\varPi }}}$ for scenario k under initial distribution $\varvec{\alpha }$. For further details about MDPs we refer to Puterman (2005); the expected discounted reward criterion is covered in Chapter 6 of the textbook.

A set of MDPs of the above form is denoted as a set of concurrent MDPs. We consider the computation of the weighted discounted gain for some weight vector $\varvec{w}\in {{\mathbb {R}}}^{1 \times K}_{\ge 0}$ with $\varvec{w}^T \mathbb {1}= 1$ which is given by

$$\begin{aligned} G^*(\varvec{w}) = \max _{{\varvec{{\varPi }}}\in {\mathcal {P}}} \left( \sum _{k=1}^K \varvec{w}(k)G^{\varvec{{\varPi }}}_k\right) \text{ and } {\varvec{{\varPi }}}^* = \arg \max _{{\varvec{{\varPi }}}\in {\mathcal {P}}} \left( \sum _{k=1}^K \varvec{w}(k)G^{\varvec{{\varPi }}}_k\right) \end{aligned}$$

(2)

which can be considered a stochastic programming problem (Ruszczyński and Shapiro 2009, Chapter 1) as shown in the following section.

If we consider a single MDP $k \in {\mathcal {M}}$, then

$$\begin{aligned} G^*_k = \max _{{\varvec{{\varPi }}}\in {\mathcal {P}}} \left( G^{\varvec{{\varPi }}}_k\right) \quad \text{ and } \quad {\varvec{{\varPi }}}^*_k = \arg \max _{{\varvec{{\varPi }}}\in {\mathcal {P}}} \left( G^{\varvec{{\varPi }}}_k\right) . \end{aligned}$$

(3)

${\varvec{{\varPi }}}^*_k$ is not necessarily unique but it is known that a pure policy exists with gain $G^*_k$ (Puterman 2005).

3 Mathematical structure of the optimization problem

We begin with the computation of an optimal policy for a single MDP. The optimal policy and value vectors can then be computed from a linear program (LP). Thus, the optimal gain for (3) results from the LP (Puterman 2005, Section 6.9.)

$$\begin{aligned} \min _{\varvec{g}_k \in {{\mathbb {R}}}^{N \times 1}} \left( \varvec{\alpha }\varvec{g}_k\right) \text{ subject } \text{ to } \left( \begin{array}{c} \left( {\varvec{I}}- \gamma {\varvec{P}}^1_k\right) \\ \vdots \\ \left( {\varvec{I}}- \gamma {\varvec{P}}^M_k\right) \end{array}\right) \varvec{g}_k \ge \mathbb {1}_M \otimes \varvec{r}_k. \end{aligned}$$

(4)

The LP has N variables and NM constraints. All variables are real-valued, which makes usage of standard LP solving techniques possible. The resulting vector $\varvec{g}_k$ is the value function of the optimal policy. It is important to note that none of the variables in the linear program is a decision variable in the sense that it directly reflects a controller’s action; the decisions are only indirectly visible. Concretely, in an optimal solution, the constraint for state i and action a is tight, i.e., it is $(\varvec{e}_i - \gamma {\varvec{P}}^a_k(i \bullet )) \varvec{g}_k(i) = \varvec{r}_k(i)$ if action a is selected in state i.^{Footnote 1} The dual LP is given, following d’Epenoux (1963), by

$$\begin{aligned} \begin{array}{l} \max _{\varvec{f}_k \in {{\mathbb {R}}}^{MN \times 1}} \varvec{f}_k^T \left( \mathbb {1}_M \otimes \varvec{r}_k\right) \\ \text{ subject } \text{ to } \left( \left( {\varvec{I}}- \gamma {\varvec{P}}^1_k\right) ^T \cdots \left( {\varvec{I}}- \gamma {\varvec{P}}^M_k\right) ^T \right) \varvec{f}_k = \varvec{\alpha }^T \text{ and } \varvec{f}_k \ge 0. \end{array} \end{aligned}$$

(5)

The optimal gain $G^*_k$ and a corresponding optimal pure policy ${\varvec{{\varPi }}}^*_k$ can be computed from the LP or its dual. However, to compute the global optimum $G^*(\varvec{w})$, the LPs have to be coupled, resulting in a non-linear program. To define the non-linear program we first define for $i \in {\mathcal {S}}, k \in {\mathcal {M}}$ the matrices

$$\begin{aligned} {\varvec{A}}^i_k = \left( \begin{array}{c} \varvec{e}_i - \gamma {\varvec{P}}^1_k(i\bullet )\\ \vdots \\ \varvec{e}_i - \gamma {\varvec{P}}^M_k(i\bullet )\ \end{array}\right) \text{ and } \text{ vectors } \varvec{\alpha }_k = \varvec{w}(k)\varvec{\alpha }. \end{aligned}$$

(6)

Then every matrix ${\varvec{C}}_k^{\varvec{{\varPi }}}$ can be represented as

$$\begin{aligned} {\varvec{C}}_k^{\varvec{{\varPi }}}=\left( \begin{array}{c} \varvec{\pi }_1{\varvec{A}}_k^1\\ \vdots \\ \varvec{\pi }_N{\varvec{A}}_k^N \end{array}\right) . \end{aligned}$$

We now derive a non-linear program for the global optimization problem. Equation (4) can be equivalently formulated as

$$\begin{aligned} \min _{{\varvec{{\varPi }}}\in {\mathcal {P}}} \left( \varvec{\alpha }\varvec{g}_k\right) \ \text{ subject } \text{ to } \varvec{\pi }_i{\varvec{A}}_k^i\varvec{g}_k = {\varvec{C}}^{\varvec{{\varPi }}}_k(i \bullet )\varvec{g}_k \ge \varvec{r}_k(i) \text{ for } \text{ all } i \in {\mathcal {S}}. \end{aligned}$$

(7)

To see this, let $\varvec{g}_k^*$ be the minimal solution of (4). Then

$$\begin{aligned}a^*_i = \arg \min _{a \in {\mathcal {A}}} \left( {\varvec{C}}^a_k(i\bullet )\varvec{g}_k^*\right) \end{aligned}$$

exists for $1 \le i \le N$, but must not be unique. By selecting $\varvec{\pi }_i(a^*_i) = 1$, the vector $\varvec{g}_k^*$ becomes the solution of ${\varvec{C}}^{\varvec{{\varPi }}}_k\varvec{g}_k^* = \varvec{r}_k$ and $\varvec{g}^*_k$ becomes a feasible solution of (7). This solution is also optimal because every feasible solution $\varvec{g}_k$ has to observe

$$\begin{aligned} \sum _{m=1}^M\varvec{\pi }_i(m){\varvec{C}}^m_k(i\bullet )\varvec{g}_k \ge \varvec{r}_k \end{aligned}$$

for each $i \in {\mathcal {S}}$ which implies that the choice of the minimum among $a \in {\mathcal {A}}$ results in the smallest vector $\varvec{g}_k$. Observe that the vector $\varvec{g}_k$ depends on ${\varvec{{\varPi }}}$ due to the constraints. We do not use ${\varvec{{\varPi }}}$ as an additional index for $\varvec{g}_k$ to avoid an overloading of notation.

The reformulation of the LP into an NLP (due to the dependencies between $\varvec{g}_k$ and ${\varvec{{\varPi }}}$) is of no use for single problems but allows one to describe the coupling of MDPs via a common policy. The following formulation describes the weighted optimization problem for concurrent MDPs.

$$\begin{aligned} \begin{array}{l} \max _{{\varvec{{\varPi }}}\in {\mathcal {P}}} \left( \min _{\varvec{g}_k \in {{\mathbb {R}}}^{N \times 1}} \sum _{k \in {\mathcal {M}}} \varvec{\alpha }_k \varvec{g}_k\right) \\ \text{ subject } \text{ to } \forall k\in {\mathcal {M}}, \forall i \in {\mathcal {S}}, \forall a \in {\mathcal {A}}:\\ \varvec{\pi }_i{\varvec{A}}_k^i\varvec{g}_k = {\varvec{C}}_k^{\varvec{{\varPi }}}(i \bullet ) \varvec{g}_k \ge \varvec{r}_k(i), \ \sum \limits _{m=1}^M \varvec{\pi }_i(m) = 1, \ \varvec{\pi }_i(a) \ge 0. \end{array} \end{aligned}$$

(8)

It follows from (7) that each feasible vector $\varvec{g}_k$ for MDP k has to observe the constraints. Furthermore it has to be the minimal solution that observes the constraints. The outer maximum operator assures that the maximum of the weighted sum is computed. Vectors $\varvec{\alpha }_k$ encode, by their definition in (6), the weights $\varvec{w}(k)$. The optimization problem has a linear goal function but bilinear constraints which usually define a non-convex feasible region.

The NLP for the dual program can be formulated as

$$\begin{aligned} \begin{array}{ll} \max _{{\varvec{{\varPi }}}\in {\mathcal {P}}} \left( \max _{\varvec{h}_k \in {{\mathbb {R}}}^{N \times 1}} (\varvec{h}_k)^T\varvec{r}_k\right) \\ \text{ subject } \text{ to } \forall i \in {\mathcal {S}}:\\ \sum \limits _{m=1}^M\sum \limits _{l=1}^N\varvec{\pi }_i(m){\varvec{A}}_k^l(m,i)\varvec{h}_k(l) = \left( {\varvec{C}}_k^{\varvec{{\varPi }}}(\bullet i)\right) ^T \varvec{h}_k = \varvec{\alpha }(i) \text{ and } \varvec{h}_k(i) \ge 0. \end{array} \end{aligned}$$

(9)

To show the equivalence to (5) let $\varvec{f}_k^* = \left( (\varvec{f}_k^1)^T,\ldots ,(\varvec{f}_k^M)^T\right) ^T$ be an optimal solution of the dual problem. Then define $\varvec{h}_k = \sum _{j=1}^M \varvec{f}_k^j$ and $\varvec{\pi }_i(j) = \varvec{f}_k^j(i)/\varvec{h}_k(i)$. Then $\varvec{h}_k \ge {\mathbf {0}}$ and

$$\begin{aligned} \begin{array}{lll} \left( {\varvec{C}}_k^{\varvec{{\varPi }}}\right) ^T \varvec{h}_k &{} = \sum \limits _{m=1}^M \left( \left( \begin{array}{ccc} \varvec{\pi }_1(m) \\ &{} \ddots \\ &{}&{} \varvec{\pi }_N(m) \end{array}\right) \left( {\varvec{I}}- \gamma {\varvec{P}}^m_k\right) \right) ^T \varvec{h}_k\\ &{} = \sum \limits _{m=1}^M\left( {\varvec{I}}- \gamma {\varvec{P}}^m_k\right) ^T \left( \begin{array}{ccc} \varvec{\pi }_1(m) \\ &{} \ddots \\ &{}&{} \varvec{\pi }_N(m) \end{array}\right) \varvec{h}_k\\ &{} = \sum \limits _{m=1}^M\left( {\varvec{I}}- \gamma {\varvec{P}}^m_k\right) ^T\varvec{f}_k^m &{} \\ &{} = \varvec{\alpha }\end{array} \end{aligned}$$

which shows that $\varvec{h}_k$ is a feasible solution for the NLP. To show that the solution is also maximal, we assume that a solution $\varvec{{\hat{h}}}_k$ and a policy ${\varvec{{\hat{{\varPi }}}}}$ exist such that the constraints of (9) are observed and $(\varvec{{\hat{h}}}_k)^T\varvec{r}_k^{\varvec{{\hat{\pi }}}} > (\varvec{h}_k)^T\varvec{r}_k$. Then define $\varvec{{\hat{f}}}_k^j(i) = \varvec{{\hat{\pi }}}_i(j)\varvec{{\hat{h}}}_k(i)$, which observes the constraints of (5) and is therefore a feasible solution for the dual LP. Then $\sum _{j=1}^M (\varvec{{\hat{f}}}_k^j)^T\varvec{r}_k = (\varvec{{\hat{h}}}_k)^T\varvec{r}_k \le \sum _{j=1}^M (\varvec{f}_k^j)^T\varvec{r}_k = (\varvec{h}_k)^T\varvec{r}_k$ because $\varvec{f}_k^*$ is assumed to be optimal which implies that $\varvec{{\hat{h}}}_k$ cannot exist and $\varvec{h}_k$ is an optimal solution of (9).

The NLP for the combined MDPs becomes

$$\begin{aligned} \begin{array}{l} \max _{{\varvec{{\varPi }}}\in {\mathcal {P}}} \left( \max _{\varvec{h}_k \in {{\mathbb {R}}}^{N \times 1}} \sum \limits _{k=1}^K (\varvec{h}_k)^T\varvec{r}_k\right) \\ \text{ subject } \text{ to } \forall k\in {\mathcal {M}}, \forall i \in {\mathcal {S}}, \forall a \in {\mathcal {A}}:\\ \sum \limits _{m=1}^M\sum \limits _{l=1}^N\varvec{\pi }_i(m){\varvec{A}}_k^l(m,i)\varvec{h}_k(l) = \left( {\varvec{C}}_k^{\varvec{{\varPi }}}(\bullet i)\right) ^T \varvec{h}_k = \varvec{\alpha }_k(i), \\ \sum \limits _{m=1}^M \varvec{\pi }_i(m) = 1, \ \varvec{\pi }_i(a) \ge 0, \varvec{h}_k(i) \ge 0. \end{array} \end{aligned}$$

(10)

Observe that $\varvec{h}_k(i)$ reflects the mean discounted number of visits in state i of MDP k if the process starts with initial probability $\varvec{\alpha }$ and MDP k is chosen with probability $\varvec{w}(k)$.

Both problems (8) and (10), belong to the class of non-convex quadratically constrained linear programs (QCLP) (Amato et al. 2007) which are a subset of the more general class of quadratically constrained quadratic programs (QCQP) and have some specific properties that can be exploited in optimization algorithms.

3.1 Computational complexity

We now briefly show that the general problem is NP-hard which means that it is commonly agreed that no efficient algorithms exist for this kind of problems. First a decision variant of the optimization problem for concurrent MDPs is defined and then it is shown that this problem is NP-complete.

Definition 1

(Decision problem) Given a set of concurrent MDPs ${\mathcal {M}}$ and real (represented with ${\mathcal {O}}\left( NMK\right) $ bits) vectors and numbers $\varvec{w}\in {{\mathbb {R}}}^K, g \in {{\mathbb {R}}}$, decide if there is a policy ${\varvec{{\varPi }}}\in {\mathcal {P}}$ such that $\sum _{k=1}^{K} \varvec{w}(k) G^{{\varvec{{\varPi }}}}_k \ge g$.

The first part of our NP-completeness proof is to show that the given problem is, in fact, in NP. One can see that the representation of a stationary policy is polynomial as long as the representation of a real number is assumed to consume ${\mathcal {O}}\left( NMK\right) $ bits. Then one can define a non-deterministic algorithm that guesses a policy by guessing ${\mathcal {O}}\left( NMK \times NM\right) $ bits that represent NM real numbers which define a policy ${\varvec{{\varPi }}}$ and then verifies if ${\varvec{{\varPi }}}$ fulfills the relation above.

Now we can prove the more interesting part of the completeness statement.

Theorem 1

The decision problem defined in Definition 1 is NP-complete.

The proof can be found in “Appendix B”. It is based on the reduction of 3-SAT (Garey and Johnson 1978) to the above decision problem with two concurrent MDPs.

4 Computation of optimal policies and rewards

Theorem 1 shows that algorithms to compute the optimal policy either require a potentially very long time or stop prematurely with a non-optimal and thus approximate solution, if a feasible solution has been found at all. We introduce five different approaches to approximate or compute the optimal policy and value vectors and analyze afterwards their performance.

4.1 A solution approach using a mixed integer linear program

If we restrict the policies to ${\mathcal {P}}_{pure}$, then the problem can be formulated as a Mixed Integer Linear Program (MILP). In this case ${\varvec{{\varPi }}}$ is a $N \times M$ matrix in ${{\mathbb {B}}}$ with one element equal to 1 in each row. The MILP solution uses the dual version of the problem given in (5) and (10).

We first define some vectors and matrices. Let

$$\begin{aligned} \begin{array}{ll} \varvec{d}^T=\left( \mathbb {1}^T_M \otimes (\varvec{r}_1)^T,\ldots , \mathbb {1}^T_M \otimes (\varvec{r}_K)^T\right) \in {{\mathbb {R}}}^{1 \times KMN}, \\ {\varvec{B}}_k = \left( \left( {\varvec{I}}-\gamma {\varvec{P}}^1_k\right) ^T,\ldots ,\left( {\varvec{I}}-\gamma {\varvec{P}}^M_k\right) ^T\right) \in {{\mathbb {R}}}^{N \times NM}, \\ {\varvec{B}}= \left( \begin{array}{ccc} {\varvec{B}}_1 \\ &{} \ddots \\ &{} &{} {\varvec{B}}_K \end{array}\right) \in {{\mathbb {R}}}^{KN \times KMN},\\ \varvec{b}= \left( \varvec{\alpha }_1,\ldots ,\varvec{\alpha }_K \right) ^T \in {{\mathbb {R}}}^{KN \times 1},\\ \varvec{y}^T = \left( (\varvec{y}_1^1)^T,\ldots ,(\varvec{y}_1^M)^T,(\varvec{y}_2^1)^T,\ldots ,(\varvec{y}_K^M)^T\right) \in {{\mathbb {R}}}^{1 \times KMN} \text{ where } \varvec{y}_k^j \in {{\mathbb {R}}}^{N \times 1},\\ \varvec{\pi }= \left( \varvec{\pi }_1,\ldots ,\varvec{\pi }_N\right) \text{ where } \varvec{\pi }_n \in {{\mathbb {B}}}^{1 \times M}. \end{array} \end{aligned}$$

(11)

The variable $\varvec{y}_k^m(n)$ describes the discounted number of decisions m in state n of MDP k and characterizes the product $\varvec{\pi }_n(m)\varvec{h}_k(n)$. For the optimal policy ${\varvec{{\varPi }}}^*$ the corresponding vectors are $\varvec{y}_k^*$. The introduction of additional variables for products is a common approach for bilinear functions (Qualizza et al. 2012). The integer optimization problem for the computation of the optimal policy is then defined by

$$\begin{aligned} \begin{array}{l} \max _{\varvec{y}\in {{\mathbb {R}}}^{KMN \times 1}} \varvec{d}^T \varvec{y}\\ \text{ subject } \text{ to } \forall m\in \{1,\ldots ,M\}, \forall k \in \{1,\ldots ,K\}, \forall n \in \{1,\ldots ,N\}:\\ {\varvec{B}}\varvec{y}\le \varvec{b}, \ \sum \limits _{m=1}^M\varvec{\pi }_n(m) = N-1, \ \varvec{\pi }_n(m)\varvec{y}^m_k(n) = {\mathbf {0}}, \ \varvec{y}\ge {\mathbf {0}}. \end{array} \end{aligned}$$

(12)

The program is not an ILP since it contains products of Boolean and real variables $\varvec{\pi }_n(m)\varvec{y}^m_k(n)=0$. However, using a standard trick from ILP modeling (Vielma 2015) each product of this form can be substituted by an additional constraint

$$\begin{aligned} \varvec{y}_k^m(n) + \varvec{\pi }_n(m)U \le U \end{aligned}$$

(13)

where U is an upper bound for the value of $\varvec{y}^m_k(n)$. The value of $\varvec{y}^m_k(n)$ is the number of times the system is in state n and action m is chosen multiplied with the weight $\varvec{w}(k)$. Thus, $\varvec{y}^m_k(n) \le \varvec{h}_k(n)$ has to hold. A simple upper bound for $\varvec{h}_k(n)$ is $(1-\gamma )^{-1}$. For our ILP solution $U = (1-\gamma )^{-1}$ is used; more sophisticated upper bounds will be introduced below.

The resulting ILP contains KMN real variables, MN Boolean variables, N equality constraints, $KMN+KN$ inequality constraints and KMN non-negativity constraints. Modern ILP solvers in principle allow one to compute a global maximum and a policy reaching the maximum. However, for larger problem instances, the effort to close the optimality gap between the best available solution and the computed upper bound might take a long time.

4.2 Using algorithms for non-convex programs

QCQPs have attracted some attention in the literature and until very recently new algorithms for local and global optimization have been published (Castillo et al. 2018; Park and Boyd 2017; Qualizza et al. 2012). Algorithms for global optimization of QCQPs are based on spatial decomposition of the domain of the variables and apply mixed integer linear programs (MILPs) to compute bounds for the optimal solution (Castillo et al. 2018). Unfortunately, spatial decomposition results in an enormous number of new binary variables in the MILP such that the algorithms can only be applied for small problem instances as already argued in Park and Boyd (2017) and thus are not applicable in our context. However, below we adopt some of the ideas applied in the optimization of general QCQPs for the specific problems resulting from concurrent MDPs. Here we first consider the use of general optimization algorithms for non-convex problems which are at most able to compute local optima.

A large number of optimization algorithms for non-convex optimization problems exist. Usually these algorithms cannot exploit the very specific structure of our problem. We present here two generic formulations that can be used as input for different solvers for constrained non-convex problems.

The general representation equals

$$\begin{aligned} \begin{array}{l} \max _{\varvec{z}} f(\varvec{z}) \text{ with } \text{ gradient } f'(\varvec{z})\\ \text{ and } \text{ the } \text{ constraints: } g(\varvec{z}) = {\mathbf {0}} \text{ and } \varvec{z}\ge {\mathbf {0}}. \end{array} \end{aligned}$$

(14)

$f(\cdot )$ is a scalar function, the other functions are vector-valued functions. The above representation can be used as input for standard optimization algorithms like fmincon from MATLAB or sqp from Octave.

Again we solve (10) and define

$$\begin{aligned} \varvec{z}= \left( \varvec{h}_1^T,\ldots , \varvec{h}_K^T, \varvec{\pi }_1, \ldots , \varvec{\pi }_N\right) ^T \in {{\mathbb {R}}}^{N(K+M) \times 1}. \end{aligned}$$

Then

$$\begin{aligned} f(\varvec{z})&= \sum _{k=1}^K \varvec{h}_k^T\varvec{r}_k, \ f'(\varvec{z}) = \left( \varvec{r}_1^T,\ldots ,\varvec{r}_K^T, {\mathbf {0}}\right) \\ g(\varvec{z})&= \left( \begin{array}{c} \varvec{\pi }_1 \mathbb {1}- 1 \\ \vdots \\ \varvec{\pi }_N \mathbb {1}- 1 \\ \left( {\varvec{C}}_1^{\varvec{{\varPi }}}\right) ^T\varvec{h}_1 - \varvec{\alpha }_1\\ \vdots \\ \left( {\varvec{C}}_K^{\varvec{{\varPi }}}\right) ^T\varvec{h}_K - \varvec{\alpha }_K \end{array}\right) \text{ and } \varvec{z}\ge {\mathbf {0}}. \end{aligned}$$

The problem contains $N(K+M)$ variables, $N(K + 1)$ equality constraints, and $N(K+M)$ non-negativity constraints. This formulation of the problem results in a non-convex QCLP. Our observation is that general optimization algorithms often show a bad convergence behavior for this kind of problems. The main reason for the convergence problems of the algorithms seems to be the strong coupling between the variables in ${\varvec{{\varPi }}}$ and $\varvec{h}_k$ with the equality constraints which often hinders the algorithms to find a successful search direction for larger problems.

The above problem description contains redundant information because policy ${\varvec{{\varPi }}}$ completely determines $\varvec{h}_k$. This will now be exploited in a reformulation of the problem with a more complex objective function and less constraints using a problem-specific reformulation. As already mentioned, for each policy ${\varvec{{\varPi }}}$, matrix ${\varvec{C}}^{\varvec{{\varPi }}}_k$ is a non-singular M-matrix which implies that it has a non-negative inverse. Thus, the gain for policy ${\varvec{{\varPi }}}$ can be represented as a function f defined on the domain of policy matrices with

$$\begin{aligned} f({\varvec{{\varPi }}}) = \sum _{k=1}^K \varvec{\alpha }_k {\varvec{{\overline{C}}}}^{\varvec{{\varPi }}}_k \varvec{r}_k. \end{aligned}$$

(15)

To compute the gradient, we have to consider the change of one element in ${\varvec{{\varPi }}}$. Formally, ${\varvec{{\varPi }}}$ can be interpreted as vector $\varvec{\pi }$ of length MN. Changing $\varvec{\pi }_n(m)$ by $\lambda $ means to add $\lambda \varvec{e}_n^T {\varvec{A}}_k^{n}(m \bullet ) = \lambda \varvec{e}_n^T (\varvec{e}_n - \gamma {\varvec{P}}_k^m(n \bullet ))$ to ${\varvec{C}}^{\varvec{{\varPi }}}_k$. This is a rank-one update such that the inverse matrix can be computed as in Hager (1989). If ${\varvec{{\varPi }}}'$ results from ${\varvec{{\varPi }}}$ by changing $\varvec{\pi }_n(m) \mathrel {+}=\lambda $ (with $\lambda \le \varvec{\pi }_n(m)$), then

$$\begin{aligned} f({\varvec{{\varPi }}}') = f({\varvec{{\varPi }}}) - \lambda \sum _{k = 1}^{K}\varvec{\alpha }_k{\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k(\bullet n)\left( \frac{ {\varvec{A}}_k^{n}(m \bullet ) {\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k \varvec{r}_k}{1 + \lambda {\varvec{A}}_k^n(m \bullet ) {\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k(\bullet n)}\right) + o(\lambda ^2) \end{aligned}$$

(16)

for $\lambda $ small enough such that the inverse matrix exists. The derivative is computed, as usual, with $\lim _{\lambda \rightarrow 0} \frac{f({\varvec{{\varPi }}}') - f({\varvec{{\varPi }}})}{\lambda }$ resulting in

$$\begin{aligned} \begin{array}{lll} \nabla f({\varvec{{\varPi }}})= & {} -\left( \sum \limits _{k=1}^K\varvec{\alpha }_k{\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k(\bullet 1) {\varvec{A}}_k^{1}(1 \bullet ){\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k \varvec{r}_k, \ldots , \sum \limits _{k=1}^K\varvec{\alpha }_k{\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k(\bullet N) {\varvec{A}}_k^{N}(M \bullet ){\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k \varvec{r}_k \right) \end{array}\nonumber \\ \end{aligned}$$

(17)

and $g({\varvec{{\varPi }}}) = {\varvec{{\varPi }}}\mathbb {1}-\mathbb {1}$. The functions $f({\varvec{{\varPi }}}), f'({\varvec{{\varPi }}})$ and $g({\varvec{{\varPi }}})$ can be used as input for a non-linear optimization algorithm, and the inequality constraints are no longer needed. The resulting optimization problem has N simple linear constraints, but a complex objective function. As it turns out, it usually shows a much better convergence than the original QCLP, because we can start with a feasible solution, but again, there is no guaranteed convergence of general optimization algorithms applied to the problem formulation.

4.3 Step-wise computation of increasing lower bounds

The previous approaches are standard techniques to solve the resulting optimization problem which reach their limits when problems become larger. The general problem is non-convex and, therefore, hard to solve. However, we now present a step-wise approach which computes increasing lower bounds resulting in a local optimum. Thus, we provide problem-specific local optimization heuristics for the problem.

We first introduce the theoretical base before the new algorithm is introduced. Based on (15) we develop an iterative algorithm. The objective function is linear in $\varvec{h}_k$ and is uniquely determined by the policy ${\varvec{{\varPi }}}$ since $\varvec{h}_k^T = \varvec{\alpha }_k {\varvec{{\overline{C}}}}_k^{\varvec{{\varPi }}}$. The set of policies ${\mathcal {P}}$ is a convex set built by N distributions of order M. Let ${\mathcal {H}}$ be the set of all vectors $\varvec{h}= \left( \varvec{h}_1,\ldots ,\varvec{h}_K\right) ^T$ such that $\varvec{h}_k^T = \varvec{\alpha }_k {\varvec{{\overline{C}}}}_k^{\varvec{{\varPi }}}$ for some policy ${\varvec{{\varPi }}}$. Unfortunately, the set ${\mathcal {H}}$ does not need to be convex for $K > 1$ because for two vectors $\varvec{h},\varvec{h}'\in {\mathcal {H}}$ and some $\lambda \in ({0,1})$ the convex combination $\lambda \varvec{h}+(1-\lambda )\varvec{h}'$ does not have to belong to ${\mathcal {H}}$, even if it can be assured that in each scenario k, the convex combination $\lambda \varvec{h}_k + (1-\lambda )\varvec{h}'_k$ can be realized with some stationary policy. The reason is that matrix inversion that translates between matrices ${\varvec{C}}_k$ and ${\varvec{{\overline{C}}}}_k$ is non-linear in $\lambda $, hence, the required policies in the scenarios usually differ. We will clarify this property by means of small examples in Sect. 5.1. Thus, the best we can expect are local optima which will be computed next.

We now consider local modifications of a policy ${\varvec{{\varPi }}}$. Let $\varvec{\pi }_n(m) > 0, p \in \{1,\ldots ,M\}{\setminus }\{m\}$ and $\lambda \in [0,\varvec{\pi }_n(m)]$. ${\varvec{{\varPi }}}'$ is the policy which results from ${\varvec{{\varPi }}}$ by setting $\varvec{\pi }_n(m)$ to 0 and adding $\varvec{\pi }_n(m)$ to $\varvec{\pi }_n(p)$. It is easy to show that ${\varvec{{\varPi }}}'$ is a valid policy. Let $\varvec{u}_k = \gamma ({\varvec{P}}^{m}_k(n\bullet )-{\varvec{P}}^{p}_k(n\bullet ))$, then ${\varvec{C}}^{{\varvec{{\varPi }}}'}_k = {\varvec{C}}^{\varvec{{\varPi }}}_k-\varvec{\pi }_n(m)\varvec{e}_n^T \varvec{u}_k$ results from a rank-1 update. For the value of the goal function the following relation holds based on the modifications due to rank-1 updates in the inverse matrices (Hager 1989):

$$\begin{aligned} \begin{array}{lll} G^{{\varvec{{\varPi }}}',{\varvec{{\varPi }}}, \lambda }(\varvec{w}) &{} = \sum \limits _{k=1}^K \varvec{\alpha }_k {\varvec{{\overline{C}}}}_k^{{\varvec{{\varPi }}}',{\varvec{{\varPi }}},\lambda }\varvec{r}_k \\ &{} = G^{{\varvec{{\varPi }}}}(\varvec{w}) + \sum \limits _{k=1}^K \underbrace{\lambda \frac{\varvec{\alpha }_k{\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k(\bullet n) \varvec{u}_k{\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k\varvec{r}_k}{1-\lambda \varvec{u}_k {\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k(\bullet n)}}_{g_k(\lambda )}, \end{array} \end{aligned}$$

(18)

where the triple ${\varvec{{\varPi }}}',{\varvec{{\varPi }}},\lambda $ in $G^{{\varvec{{\varPi }}}',{\varvec{{\varPi }}},\lambda }$ and ${\varvec{{\overline{C}}}}_k^{{\varvec{{\varPi }}}', {\varvec{{\varPi }}}, \lambda }$ is a short notation for $\lambda {\varvec{{\varPi }}}'+(1-\lambda ){\varvec{{\varPi }}}$ (for $0 \le \lambda \le \varvec{\pi }_n(m)$). Now define

$$\begin{aligned} \zeta _k = \varvec{\alpha }_k{\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k(\bullet n) \varvec{u}_k{\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k\varvec{r}_k \text{ and } \eta _k = \varvec{u}_k {\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k(\bullet n). \end{aligned}$$

Both values can be negative and positive but $\eta _k < 1/\varvec{\pi }_n(m)$ holds because it is known that the inverse matrix exists and ${\varvec{C}}^{{\varvec{{\varPi }}}'}_k$ is an M-matrix. Therefore

$$\begin{aligned} g_k(\lambda ) = \frac{\lambda \zeta _k}{1-\lambda \eta _k} \end{aligned}$$

describes the change of the value vector in the kth component if the policy changes from ${\varvec{{\varPi }}}$ to $\lambda {\varvec{{\varPi }}}'+(1-\lambda ){\varvec{{\varPi }}}$. To optimize the gain, the following optimization problem has to be solved.

$$\begin{aligned} \max _{\lambda \in [0,\varvec{\pi }_n(m)]}\left( \sum _{k=1}^Kg_k(\lambda )\right) \end{aligned}$$

(19)

The first two derivatives of the functions are given by

$$\begin{aligned} g'_k(\lambda ) = \frac{\zeta _k}{(1-\lambda \eta _k)^2} \text{ and } g''_k(\lambda ) = \frac{2\zeta _k \eta _k}{(1-\lambda \eta _k)^3}. \end{aligned}$$

The first derivative shows that each function $g_k(\lambda )$ is absolutely monotonic (or constant). The optimum of (19) is either in one of the endpoints, i.e. $\lambda \in \{0, \varvec{\pi }_n(m)\}$, or

$$\begin{aligned} \sum _{k=1}^K\frac{\zeta _k}{(1-\lambda \eta _k)^2} = 0\quad \text{ and } \quad \sum _{k=1}^K\frac{2\zeta _k\eta _k}{(1-\lambda \eta _k)^3} < 0 \end{aligned}$$

has to hold. Since the first derivative is a polynomial of degree $2K-2$, the roots can be computed and checked for optimality.

Let $\lambda ^*$ be the maximum resulting from (19), then the new policy ${\varvec{{\varPi }}}'$ results from changing $\varvec{\pi }_n(m)$ to $\varvec{\pi }_n(m)-\lambda ^*$ and $\varvec{\pi }_n(p)$ to $\varvec{\pi }_m(p)+\lambda ^*$. The resulting value of the objective function is given by (18).

For some policy ${\varvec{{\varPi }}}\in {\mathcal {P}}$ we define

$$\begin{aligned} \begin{array}{ll} {\mathcal {N}}({\varvec{{\varPi }}}) = \left\{ {\varvec{{\varPi }}}' | {\varvec{{\varPi }}}' \in {\mathcal {P}}\wedge {\varvec{{\varPi }}}' \text{ results } \text{ from } {\varvec{{\varPi }}} \text{ by } \text{ a } \text{ rank-1 } \text{ update } \right\} \\ \text{ and } \\ {\mathcal {N}}_1({\varvec{{\varPi }}}) = \left\{ {\varvec{{\varPi }}}' | {\varvec{{\varPi }}}'={\varvec{{\varPi }}}- \lambda \varvec{e}^T_n(\varvec{e}_m - \varvec{e}_p) \text{ for } n\in {\mathcal {S}}, m,p \in {\mathcal {A}}, \lambda \in (0,\varvec{\pi }_n(m)] \right\} . \end{array} \end{aligned}$$

Obviously ${\mathcal {N}}_1({\varvec{{\varPi }}}) \subseteq {\mathcal {N}}({\varvec{{\varPi }}})$. ${\mathcal {N}}_1({\varvec{{\varPi }}})$ contains all policies that are considered in (18). The sets ${\mathcal {N}}({\varvec{{\varPi }}})$ and ${\mathcal {N}}_1({\varvec{{\varPi }}})$ can also be defined for policies from ${\mathcal {P}}_{pure}$ rather than ${\mathcal {P}}$.

Theorem 2

If $G^{\varvec{{\varPi }}}< G^{{\varvec{{\varPi }}}'}$ for some ${\varvec{{\varPi }}}' \in {\mathcal {N}}({\varvec{{\varPi }}})$, then a policy ${\varvec{{\varPi }}}'' \in {\mathcal {N}}_1({\varvec{{\varPi }}})$ exists such that $G^{{\varvec{{\varPi }}}}<G^{{\varvec{{\varPi }}}''}$.

Proof

Since ${\varvec{{\varPi }}}'$ results from ${\varvec{{\varPi }}}$ by a rank-one update and both matrices describe valid policies (i.e., they have unit row sums), they can only differ in one row. Assume that they differ in row n which implies that also ${\varvec{C}}_k^{\varvec{{\varPi }}}$ and ${\varvec{C}}_k^{{\varvec{{\varPi }}}'}$ differ also in row n. Let $\varvec{c}= {\varvec{{\varPi }}}(n\bullet ) - {\varvec{{\varPi }}}'(n\bullet )$, then it can be represented as $\varvec{c}= \sum _{h=1}^H \varvec{c}_h$ where $\varvec{c}_h \mathbb {1}= 0$ and $\varvec{c}_h$ contains only two non-zero elements such that ${\varvec{{\varPi }}}-\varvec{e}_n^T\varvec{c}_h \in {\mathcal {N}}_1({\varvec{{\varPi }}})$. Define $\varvec{u}_k^h = {\varvec{C}}_k^{{\varvec{{\varPi }}}} - {\varvec{C}}_k^{{\varvec{{\varPi }}}-\varvec{e}_n^T\varvec{c}_h}$, then ${\varvec{C}}_k^{{\varvec{{\varPi }}}'} = {\varvec{C}}_k^{{\varvec{{\varPi }}}} - \varvec{e}^T_n\sum _{h=1}^H \varvec{u}_k^h$. Then

$$\begin{aligned} \begin{array}{lll} G^{{\varvec{{\varPi }}}'}(\varvec{w}) &{} = G^{{\varvec{{\varPi }}}}(\varvec{w}) + \sum \limits _{k=1}^K \frac{\varvec{\alpha }_k{\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k(\bullet n) \left( \sum \nolimits _{h=1}^H\varvec{u}_k^h\right) {\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k\varvec{r}_k}{1-\varvec{u}_k {\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k(\bullet n)} \\ &{} = G^{{\varvec{{\varPi }}}}(\varvec{w}) + \sum \limits _{h=1}^H\sum \limits _{k=1}^K \frac{\varvec{\alpha }_k{\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k(\bullet n) \varvec{u}_k^h{\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k\varvec{r}_k}{1-\varvec{u}_k {\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k(\bullet n)} \end{array} \end{aligned}$$

The denominator is always positive because every matrix for a valid policy is an M-matrix with a non-negative inverse. Thus, it depends on the numerator whether the new policy has a smaller or larger gain. Since $G^{{\varvec{{\varPi }}}'}(\varvec{w})>G^{{\varvec{{\varPi }}}}(\varvec{w})$, for at least one h the numerator has to be positive which implies

$$\begin{aligned} \begin{array}{lll} G^{{\varvec{{\varPi }}}+\varvec{e}_n^T\varvec{c}_h}(\varvec{w})&= G^{{\varvec{{\varPi }}}}(\varvec{w}) + \sum \limits _{k=1}^K \frac{\varvec{\alpha }_k{\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k(\bullet n) \varvec{u}_k^h{\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k\varvec{r}_k}{1-\varvec{u}_k^h {\varvec{{\overline{C}}}}^{{\varvec{{\varPi }}}}_k(\bullet n)}&\ge G^{\varvec{{\varPi }}}(\varvec{w}) \end{array} \end{aligned}$$

The inequality holds because the denominator is again positive and the numerator is by assumption positive. Since $\varvec{c}_h$ contains only two non-zero elements ${\varvec{C}}-\varvec{e}_n^T\varvec{c}_h \in {\mathcal {N}}_1({\varvec{{\varPi }}})$ and $G^{{\varvec{{\varPi }}}+\varvec{e}_n^T\varvec{c}_h}(\varvec{w}) > G^{\varvec{{\varPi }}}(\varvec{w})$ which completes the proof. $\square $

The previous steps naturally define Algorithm 1, a local optimization algorithm which can be alternatively formulated as a policy iteration algorithm for the concurrent MDP. The algorithm computes general stationary policies. It will be shown by the following examples that it is often sufficient to compute pure policies. Algorithm 2 is an alternative version of the algorithm which searches for a locally optimal pure policy.

Theorem 3

Algorithm 1 converges towards a policy ${\varvec{{\varPi }}}\in {\mathcal {P}}$ which is locally optimal, i.e., $G^{\varvec{{\varPi }}}(\varvec{w})\ge G^{{\varvec{{\varPi }}}'}(\varvec{w})$ for all ${\varvec{{\varPi }}}' \in {\mathcal {N}}({\varvec{{\varPi }}})$.

Algorithm 2 converges towards a policy ${\varvec{{\varPi }}}\in {\mathcal {P}}_{pure}$ which is locally optimal, i.e., $G^{\varvec{{\varPi }}}(\varvec{w})\ge G^{{\varvec{{\varPi }}}'}(\varvec{w})$ for all ${\varvec{{\varPi }}}' \in {\mathcal {N}}({\varvec{{\varPi }}})\cap {\mathcal {P}}_{pure}$.

Proof

We consider Algorithm 1, the proof for Algorithm 2 is similar. The algorithm starts with a valid policy and transforms a valid policy into another valid policy (line 10) with a larger gain. In the nested for-loops (lines 7–9) all policies from the set ${\mathcal {N}}_1({\varvec{{\varPi }}})$ are analyzed, where ${\varvec{{\varPi }}}$ is the current policy under investigation. If a policy with a larger gain is found, then this policy is selected as current policy. The nested for loops assure that all policies from ${\mathcal {N}}_1$ are checked until the first one that improves the gain is found and the algorithm stops if no policy in ${\mathcal {N}}_1({\varvec{{\varPi }}})$ with larger gain exists. By Theorem 2 this implies that no policy with a larger gain exists in ${\mathcal {N}}({\varvec{{\varPi }}})$ and ${\varvec{{\varPi }}}$ is locally optimal. Thus, the algorithm generates an increasing sequence of gains and the corresponding policies and stops if a locally optimal policy is found. Since the optimal gain is bounded, convergence to a local optimum is guaranteed. $\square $

The worst case effort of a single policy improvement step in Algorithm 1 is in $O(KN^3M^2)$. The nested for-loops can test up to $NM^2$ candidates for a new policy and each test has an effort in $O(KN^2)$ to evaluated (18). An improvement step in Algorithm 2 has a worst case effort of $O(KN^3M)$ because the number of pure policies in the neighborhood is restricted to $N(M-1)$ for pure policies. The required number of improvements to reach a local optimum is in the worst case exponential in the parameters N, M and K. However, experiments indicate that only a small number of steps is required to reach a local optimum which is a similar behavior to that of the simplex algorithm for the solution of LPs.

4.4 Computation of upper bounds

Apart from the MILP solver, all algorithms compute only local optima which are lower bounds for the global optimum. If a MILP solver stops prematurely after finding at least one feasible solution, then it provides a lower bound resulting from the best feasible solution and an upper bound resulting from a relaxation of the problem. Now we present a similar approach by introducing the computation of upper bounds. These bounds can be combined with the local optimization approaches to compute lower bounds. The difference between both bounds is denoted as the optimality gap.

Bounds are computed from a relaxation of the NLP (9). Let $G^*(\varvec{w})$ be the optimal gain, ${\varvec{{\varPi }}}^*$ an optimal policy for this program and $\varvec{h}^*=(\varvec{h}^*_1,\ldots ,\varvec{h}^*_K)$ the optimal value vector. As usual we describe policy ${\varvec{{\varPi }}}^*$ by vectors $\varvec{\pi }_i^*$ for $i \in \{1,\ldots ,N\}$. Furthermore, we use the variables $y_k^m(n) = \varvec{\pi }_n(m)\varvec{h}_k(n)$. $\varvec{y}_k^T=((\varvec{y}_k^1)^T,\ldots ,(\varvec{y}_k^M)^T)$ with $\varvec{y}_k^m=(\varvec{y}_k^m(1),\ldots ,\varvec{y}_k^m(N))^T$ which have already been defined before. For the optimal policy, value vectors are denoted as $\varvec{y}_k^*$. The relaxed version of (9) becomes

$$\begin{aligned} \begin{array}{lll} \max _{\varvec{y}_1,\ldots \varvec{y}_K} \left( \sum _{k=1}^K(\varvec{r}_k)^T\left( \sum _{m=1}^M \varvec{y}_k^m\right) \right) \\ s.t. \left( \begin{array}{ccc} {\varvec{B}}_1 \\ &{} \ddots \\ &{} &{} {\varvec{B}}_K \end{array}\right) \left( \begin{array}{c} \varvec{y}_1\\ \vdots \\ \varvec{y}_K \end{array}\right) = \left( \begin{array}{c} \varvec{\alpha }_1\\ \vdots \\ \varvec{\alpha }_K \end{array}\right) , \ \varvec{y}_k \ge {\mathbf {0}} \text{ for } 1 \le k \le K. \end{array} \end{aligned}$$

(20)

Matrices ${\varvec{B}}_k$ are defined in (11). (20) defines an LP which is a relaxation of the NLP because the relation between $y_k^m(n), \varvec{h}_k(n)$ and $\varvec{\pi }_n(m)$ is neglected. The optimal solution of (20) corresponds to the optimal solutions for the MDPs $1,\ldots ,K$. To obtain stronger relations, additional constraints have to be added. Assume that bounds $h_k^-(n) \le h_k^*(n) \le h_k^+(n)$ are known. Then the following set of constraints can be added for $k \in \{1,\ldots ,K\}, n \in \{1,\ldots ,N\}, m \in \{1,\ldots ,M\}$ (Qualizza et al. 2012):

$$\begin{aligned} \varvec{h}_k^+(n) \varvec{\pi }_n(m) - \varvec{y}_k^m(n) \ge 0, ~~&(1-\varvec{\pi }_n(m))\varvec{h}_k^+(n) - \sum _{r=1,r \ne m}^M \varvec{y}_k^r(n) \ge 0 \nonumber \\ \varvec{h}_k^-(n) \varvec{\pi }_n(m) - \varvec{y}_k^m(n) \le 0,&(1-\varvec{\pi }_n(m))\varvec{h}_k^-(n) - \sum _{r=1,r \ne m}^M \varvec{y}_k^r(n) \le 0 \nonumber \\ \sum _{m=1}^M \varvec{\pi }_n(m) = 1,&\varvec{\pi }_n(m) \ge 0 \end{aligned}$$

(21)

Experimental evaluations show that it is often sufficient to consider only the first bound in each of the first two rows. The other two bounds have no or only a minor effect.

Theorem 4

If $\varvec{h}_k^+(n) = \varvec{h}_k^*(n)$ for all $k \in \{1,\ldots ,K\}, n \in \{1,\ldots ,N\}, m \in \{1,\ldots ,M\}$, then the solution of the LP defined by (20) with the additional constraints (21) equals $G^*(\varvec{w})$ and the vectors $\varvec{y}_k^*$ define an optimal policy with $\varvec{\pi }^*_n(m) = \varvec{y}_k^{m*}(n) / \sum _{r=1}^M \varvec{y}_k^{r*}(n)$.

Proof

$\varvec{\pi }^*$ and $\varvec{y}_k^{m*} = \varvec{h}_k^*(n)\varvec{\pi }_n(m)$ fulfill the constraints and define a feasible solution for the LP. Now assume that another feasible solution $\varvec{\pi }_n, \varvec{h}_k$ and $\varvec{y}_k$ exists which results in a larger gain. Then

$$\begin{aligned} \sum _{k=1}^K\sum _{n=1}^N\varvec{r}_k(n) \sum _{m=1}^M\varvec{y}_k^m(n) > \sum _{k=1}^K\sum _{n=1}^N\varvec{r}_k(n) \sum _{m=1}^M\varvec{y}_k^{m*}(n) \end{aligned}$$

which implies for some k, m

$$\begin{aligned} \sum _{m=1}^M\varvec{y}_k^m(n) > \sum _{m=1}^M\varvec{y}_k^{m*}(n) = \varvec{h}_k^*. \end{aligned}$$

However, due to the additional constraints we have

$$\begin{aligned} \sum _{m=1}^M \varvec{h}_k^*(n)\varvec{\pi }_n(m) - \sum _{m=1}^M \varvec{y}_k^m(n) \ge 0 \ \Rightarrow \ \sum _{m=1}^M \varvec{y}_k^{m*}(n) \ge \sum _{m=1}^M \varvec{y}_k^m(n) \end{aligned}$$

which implies that $\varvec{\pi }_n$ cannot exist. $\square $

Theorem 5

If $\varvec{h}_k^+(n) \ge \varvec{h}_k^*(n)$ for all $k \in \{1,\ldots ,K\}, n \in \{1,\ldots ,N\}, m \in \{1,\ldots ,M\}$, then the solution of the LP defined by (20) with the additional constraints (21) is an upper bound for $G^*(\varvec{w})$.

Proof

Since $\varvec{\pi }^*$ and $\varvec{y}_k^{m*} = \varvec{h}_k^*(n)\varvec{\pi }_n(m)$ fulfill the constraints and define a feasible solution for the LP, the optimal solution of the LP has to be at least as large as the gain resulting from this feasible solution. $\square $

Simple bounds for $\varvec{h}_k^*(n)$ are $\varvec{h}_k^-(n) = 0$ and $\varvec{h}_k^+(n) = (1-\gamma )^{-1}$. To compute tighter bounds, we first notice that $\varvec{h}_k(n)$ equals the discounted number of visits in state n of MDP k, if the process starts with distribution $\varvec{\alpha }$. Bounds can then be computed from the following LP problem.

$$\begin{aligned} \begin{array}{ll} \varvec{h}_k^-(n) = \varvec{w}(k) \min \limits _{\varvec{y}_k^1,\ldots ,\varvec{y}_k^M} \sum \limits _{m=1}^M \varvec{y}_k^m(n) \text{ and } \varvec{h}_k^+(n) = \varvec{w}(k) \max \limits _{\varvec{y}_k^1,\ldots ,\varvec{y}_k^M} \sum \limits _{m=1}^M \varvec{y}_k^m(n) \\ \text{ s.t. } \left( \begin{array}{ccc} \left( {\varvec{I}}- \gamma {\varvec{P}}_k^1\right) ^T &{} \cdots &{} \left( {\varvec{I}}- \gamma {\varvec{P}}_k^M\right) ^T \end{array}\right) \left( \begin{array}{c} \varvec{y}_k^1 \\ \vdots \\ \varvec{y}_k^M \end{array}\right) \ge \varvec{\alpha }_k \end{array} \end{aligned}$$

(22)

The bounds might be further improved if a solution from one of the local optimization algorithms presented in the previous paragraphs is available. Let ${\varvec{{\varPi }}}^L$ be the policy and $\varvec{h}_k^L$ the value vector. Then define $G^L(\varvec{w}) =\sum _{k=1}^K (\varvec{r}_k)^T\varvec{h}_k^L$. Since the upper bound is at least as large as the best known solution, the following constraint can be added to the LP problem (22).

$$\begin{aligned} (\varvec{r}_k)^T\varvec{h}_k \ge G^L(\varvec{w}) - \sum _{l=1, l \ne k}^K G_k^*(\varvec{w}) \end{aligned}$$

(23)

where $G_k^*(\varvec{w}) = \max _{{\varvec{{\varPi }}}\in {\mathcal {P}}} \left( G_k^{\varvec{{\varPi }}}(\varvec{w})\right) $.

For the computation of an upper bound starting from an available locally optimal solution, first the bounds $\varvec{h}_k^\pm $ have to be computed. This requires the solution of $2KN+1$ LP problems with NM inequality constraints. Alternatively, the trivial bounds, 0 and $(1-\gamma )^{-1}$, can be used. Finally, an LP with $KNM+NM$ variables, $KN + 4KNM$ inequality and N equality constraints has to be solved. Since the problems have a very regular structure, large instances can be solved with current LP solvers.

Let $G^U(\varvec{w})$ be the solution of the LP problem (20) with the additional constraints (21). The difference $G^U(\varvec{w})-G^L(\varvec{w})$ defines the optimality gap between the best known solution and the theoretical upper bound. To reduce the gap, additional cutting planes have to be computed which can be done from appropriate LP problems. Observe that the constraints in (20) and (21) are independent of the rewards and it is known that for any subsets ${\widehat{{\mathcal {M}}}} \subseteq {\mathcal {M}}$ and ${\widehat{{\mathcal {S}}}}_k \subseteq {\mathcal {S}}$ for $k \in {\widehat{{\mathcal {M}}}}$ the following relation holds.

$$\begin{aligned} \sum _{k \in {\widehat{{\mathcal {M}}}}}\sum _{i \in {\widehat{{\mathcal {S}}}}_k}\varvec{h}_k^+(i) \ge \sum _{k \in {\widehat{{\mathcal {M}}}}}\sum _{i \in {\widehat{{\mathcal {S}}}}_k} \sum _{m \in {\mathcal {A}}} \varvec{y}_k^{*m}(i) \ge \sum _{k \in {\widehat{{\mathcal {M}}}}}\sum _{i \in {\widehat{{\mathcal {S}}}}_k}\varvec{h}_k^-(i) \end{aligned}$$

(24)

We can define the following LP to obtain better bounds.

$$\begin{aligned} \begin{array}{ll} v^-/v^+ = \min {\!/\!}\max \sum _{k \in {\widehat{{\mathcal {M}}}}}\sum _{i \in {\widehat{{\mathcal {S}}}}_k} \sum _{m \in {\mathcal {A}}} \varvec{y}_k^{m}(i)\\ \text{ s.t. } \text{ the } \text{ constraints } \text{ defined } \text{ in } \text{(20), } \text{(21) } \text{ and } \sum \limits _{k \in {\mathcal {M}}}(\varvec{r}_k^T)^T\sum \limits _{m \in {\mathcal {A}}} \varvec{y}_k^m \ge G^L(\varvec{w}) \end{array} \end{aligned}$$

(25)

If $v^- > \sum _{k \in {\widehat{{\mathcal {M}}}}}\sum _{i \in {\widehat{{\mathcal {S}}}}_k}\varvec{h}^-_k(i)$ or $v^+ < \sum _{k \in {\widehat{{\mathcal {M}}}}}\sum _{i \in {\widehat{{\mathcal {S}}}}_k}\varvec{h}^+_k(i)$, the constraints

$$\begin{aligned} \begin{array}{ll} v^- \le \sum \limits _{k \in {\widehat{{\mathcal {M}}}}}\sum \limits _{i \in {\widehat{{\mathcal {S}}}}_k} \sum \limits _{m \in {\mathcal {A}}} \varvec{y}_k^{*m}(i)&~ ~ v^+ \ge \sum \limits _{k \in {\widehat{{\mathcal {M}}}}}\sum \limits _{i \in {\widehat{{\mathcal {S}}}}_k} \sum \limits _{m \in {\mathcal {A}}} \varvec{y}_k^{*m}(i) \end{array} \end{aligned}$$

can be added to the LP. If we choose in (24) a single MDP k and a single state i, then the objective function of (25) becomes

$$\begin{aligned} v^-/v^+ = \min {\!/\!}\max \sum \limits _{m \in {\mathcal {A}}} \varvec{y}_k^{m}(i). \end{aligned}$$

(26)

Then $\varvec{h}^+_k(i) = v^+$ and $\varvec{h}^-_k(i) = v^-$ can be used as better bounds in the original LP. This step does not increase the size of the LP because only the spread of the bounds is reduced.

Algorithm 3 uses an iterative approach that tries to improve the bounds $\varvec{h}_k^+(i)$ or $\varvec{h}_k^-(i)$ with the largest difference to $\varvec{h}_k^m(n)$. It is not guaranteed that the gap between upper and lower bound can be closed by reducing upper and increasing lower bounds of single variables only. Therefore, the algorithm should be stopped if no progress is made or if the number of iterations reaches a threshold. In this case more complex cutting planes as defined in (25) have to be used. However, the number of potential cuts grows then with NK! and it is hard to decide in advance which of more complex cuts results in a successful reduction of the optimality gap. Experience with a large number of models shows that Algorithm 3 usually reduces the optimality gap, without completely closing it.

5 Examples

We analyze the algorithms on different examples. All experiments were run on a PC with an Intel 3.6 GHz processor and 16 GB main memory. Algorithms are implemented in MATLAB where the solvers fmincon and intlinproc are used.^{Footnote 2} We use the following abbreviations for the algorithms.

ILP for the mixed integer linear programming approach presented in Sect. 4.1.
QCLP for the non-linear programming approach presented in Sect. 4.2 that solves the quadratically constrained linear program.
NLP for the non-linear programming approach presented in Sect. 4.2 solving the non-linear program with the general objective function.
IMGP for Algorithm 1.
IMPP for Algorithm 2.
Up-simple for the upper bound with the local bound $(1-\gamma )^{-1}$.
Up-general for the upper bound with local bound computed from (22).

We first consider two small non-convex problem instances, then randomly generated problem instances are evaluated and afterwards a control problem for queues is presented.

5.1 Small examples with an optimal random policies

Although the general problem is non-convex as shown, most instances of the problem have a fairly regular surface such that local search algorithms converge to global optimum and often the best pure policy is optimal or almost optimal. We show this in the following paragraph where randomly generated models are analyzed. Here, we present two small examples which have local optima.

We begin with a set of deterministic MDPs with two states and two actions. The following matrices and vectors define the MDPs.

$$\begin{aligned} \begin{array}{llll} {\varvec{P}}^a_1 = \left( \begin{array}{cc} 0 &{} 1 \\ 1 &{} 0 \end{array}\right) , &{}\quad {\varvec{P}}^b_1 = \left( \begin{array}{cc} 1 &{} 0 \\ 0 &{} 1 \end{array}\right) , &{}\quad \varvec{r}_1 = \left( \begin{array}{c} 2/3 \\ 1/3 \end{array}\right) \\ {\varvec{P}}^a_2 = \left( \begin{array}{cc} 0 &{} 1 \\ 0 &{} 1 \end{array}\right) , &{}\quad {\varvec{P}}^b_2 = \left( \begin{array}{cc} 0 &{} 1 \\ 1 &{} 0 \end{array}\right) , &{}\quad \varvec{r}_2 = \left( \begin{array}{c} 4/5 \\ 1/5 \end{array}\right) \\ {\varvec{P}}^a_3 = \left( \begin{array}{cc} 0 &{} 1 \\ 0 &{} 1 \end{array}\right) , &{}\quad {\varvec{P}}^b_3 = \left( \begin{array}{cc} 0 &{} 1 \\ 0 &{} 1 \end{array}\right) , &{}\quad \varvec{r}_3 = \left( \begin{array}{c} 2/5 \\ 3/5 \end{array}\right) \end{array} \end{aligned}$$

For $\varvec{p}= (0.5, 0.5)$, the optimal pure policy equals

$$\begin{aligned} {\varvec{{\varPi }}}_{{ det}} = \left( \begin{array}{cc} 0 &{}\quad 1 \\ 0 &{}\quad 1 \end{array}\right) \end{aligned}$$

with gain $G^{{\varvec{{\varPi }}}_{{ det}}}=4.7$ and the optimal randomized policy is

$$\begin{aligned} {\varvec{{\varPi }}}_{{ rand}} = \left( \begin{array}{cc} 0 &{}\quad 1 \\ 0.2261 &{}\quad 0.7739 \end{array}\right) \end{aligned}$$

with gain $G^{{\varvec{{\varPi }}}_{{ rand}}}=4.9611$. Figure 1 shows the gain which is obtained if in state 1 action b is selected and in state 2 action a is selected with probability $p_2$ ranging from 0 through 1. It can be seen that two local optima exist, namely, the global optimum at $p_2 = 0.2261$ and another local optimum for $p_2 = 1$.

As a second example we consider a model with 4 MDPs with 3 states and 2 actions which is characterized by the following matrices and vectors.

$$\begin{aligned} \begin{array}{llll} {\varvec{P}}_1^a = \left( \begin{array}{c@{\quad }c@{\quad }c} 0 &{} 1 &{} 0\\ 0 &{} 0 &{} 1 \\ 0 &{} 0 &{} 1 \end{array} \right) , &{} {\varvec{P}}_1^b = \left( \begin{array}{c@{\quad }c@{\quad }c} 0 &{} 1 &{} 0 \\ 0 &{} 1 &{} 0\\ 0 &{} 1 &{} 0 \end{array}\right) , &{} \varvec{r}_1 = \left( \begin{array}{c} 7/16 \\ 1/2 \\ 1/16 \end{array}\right) \\ {\varvec{P}}_2^a = \left( \begin{array}{c@{\quad }c@{\quad }c} 0 &{} 0 &{} 1 \\ 0 &{} 0 &{} 1 \\ 0 &{} 1 &{} 0 \end{array} \right) , &{} {\varvec{P}}_2^b = \left( \begin{array}{c@{\quad }c@{\quad }c} 1 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 1 \end{array}\right) , &{} \varvec{r}_2 = \left( \begin{array}{c} 1/4 \\ 1/4 \\ 1/2 \end{array}\right) \\ {\varvec{P}}_3^a = \left( \begin{array}{c@{\quad }c@{\quad }c} 0 &{} 0 &{} 1\\ 0 &{} 0 &{} 1\\ 1 &{} 0 &{} 0 \end{array} \right) , &{} {\varvec{P}}_3^b = \left( \begin{array}{c@{\quad }c@{\quad }c} 0 &{} 1 &{} 0 \\ 1 &{} 0 &{} 0\\ 1 &{} 0 &{} 0 \end{array}\right) , &{} \varvec{r}_3 = \left( \begin{array}{c} 1/9 \\ 2/3 \\ 2/9 \end{array}\right) \\ {\varvec{P}}_4^a = \left( \begin{array}{c@{\quad }c@{\quad }c} 1 &{} 0 &{} 0\\ 0 &{} 1 &{} 0 \\ 0 &{} 1 &{} 0 \end{array} \right) , &{} {\varvec{P}}_4^b = \left( \begin{array}{c@{\quad }c@{\quad }c} 0 &{} 0 &{} 1\\ 0 &{} 0 &{} 1\\ 0 &{} 0 &{} 1 \end{array}\right) , &{} \varvec{r}_4 = \left( \begin{array}{c} 3/8 \\ 3/8 \\ 1/4 \end{array}\right) \end{array} \end{aligned}$$

The optimal deterministic policy equals for this example

$$\begin{aligned} {\varvec{{\varPi }}}_{{ det}} = \left( \begin{array}{c@{\quad }c} 0 &{} 1\\ 0 &{} 1\\ 0 &{} 1 \end{array}\right) \end{aligned}$$

with gain $G^{{\varvec{{\varPi }}}_{{ det}}} = 3.5208$ and the optimal randomized policy is

$$\begin{aligned} {\varvec{{\varPi }}}_{{ rand}} = \left( \begin{array}{c@{\quad }c} 0.1778 &{} 0.8223\\ 0.056 &{} 0.944\\ 0 &{} 1 \end{array}\right) \end{aligned}$$

and gain $G^{{\varvec{{\varPi }}}_{{ rand}}} = 3.623$. The surface of the gain function for different values $p_1$ (probability of choosing a in state 1) and $p_2$ (probability of choosing a in state 2), and selecting b in state 3 is shown in Fig. 2. The surface has four local optima. The deterministic policy with $p_1 = p_2 = 1$ with gain 3.227 is locally optimal, the randomized policies with $p_1=0.18, p_2=1$ and gain 3.4 and with $p_1=1, p_2=0.06$ and gain 3.451 are also locally optimal. Finally, the global optimum at $p_1 = 0.1778$ and $p_2 = 0.056$ is, of course, also locally optimal.

5.2 Random matrices

The five algorithms are compared on sets of randomly generated problems. After fixing K, M, N and $\gamma $, stochastic matrices ${\varvec{P}}^m_k$ are generated from uniformly [0, 1] distributed random values, vectors $\varvec{w}$ and $\varvec{\alpha }$ are also generated from [0, 1] uniformly distributed numbers. To obtain distributions, the rows of ${\varvec{P}}^m_k$ and the vectors $\varvec{w}, \varvec{\alpha }$ are normalized. Vectors $\varvec{r}_k$ are generated from uniformly [0, 10] distributed numbers.

For the local optimization algorithms, we consider one initial policy without restarts; for IMPP the starting point is the policy which deterministically chooses the first action in all states, and for IMGP we consider the optimal policy for the Kth MDP as the starting point.

The first set of experiments that is presented here is generated for $\gamma =0.9$ and varying values of K, M, N. Results are presented in Table 1. For each configuration the five algorithms run on 100 randomly generated problem instances. ti is the average CPU time per instance, and diff is the average deviation from the best value considering all runs where the deviation is at least 0.1%. div contains the number of problem instances where the algorithm was not able to find a feasible solution. This cannot happen in the algorithms IMPP and IMGP where all solutions are feasible, the algorithms improve policies step by step. In ILP it could happen that no feasible solution is found by the ILP solver, if the algorithm does not find an integer solution. However, this did not happen in our examples and it also did not happen for NLP, therefore div is only shown for QCLP.

The algorithms QCLP and NLP perform at most 100,000 function evaluations which is much more than the predefined value in MATLAB, other solver parameters are not modified. There are three horizontal lines in the table. After the first line, the time for the ILP solver is limited to 600 s CPU time to avoid very long solution runs. This also implies that the solver stops prematurely with a non-optimal solution. However, the results indicate that the difference to the optimal policies is small in theses cases. After the second horizontal line we do not use QCLP any longer since it is too costly and not able to compute feasible solutions for large problem instances. After the third line also NLP was no longer used because it became too costly. For the largest configuration only IMPP and IMGP can be used. In this case even ILP solvers failed because they were not able to find an integer solution within the given time budget of 600 s, this holds for the ILP solver in MATLAB as well as for CPLEX, which is one of the most efficient available ILP solvers. Analyzing the different runs we can make the following observations:

Even if the problem is not convex and the optimum needs not to be a pure policy, the optimal pure policy is usually very close to the computed optimum and in most cases the computed pure policy is optimal.
If the problem instances become larger, then, in all cases we considered, the optimal policy that was found is pure.
The local optimization algorithms IMPP and IMGP are fairly robust and fast. ILP is slightly less stable and requires much time for larger instances.
Among the non-linear solvers NLP is much more efficient and reliable than QCLP. However, for large problem instances also NLP fails.

In a second series of experiments we increase $\gamma $ to 0.999. The corresponding results are shown in Table 2. It can be seen that the results are very similar, the choice of $\gamma $ seems to have only a minor effect on the behavior of the solvers. The only exception is the solver QCLP which almost completely fails because it was not able to compute feasible solutions most of the times.

A third series of experiments concerned itself with deterministic concurrent MDP models. The discount factor was set to $\gamma = 0.9$, and the instances that were generated had the property that the transition probabilities in the individual MDPs were either one or zero. We observe that deterministic models are harder to optimize with exact methods: the ILP solver exceeds the time budget on smaller models, QCLP has much more difficulties to converge, and the NLP solver yields more often only locally optimal policies. In contrast, performance of local optimization heuristics does not suffer much in comparison to exact methods. We believe that this effect is due to larger differences between deterministic models, that is, the difference between transition probability vectors is not only large across different actions in a state but also across the MDPs in the concurrent model. On larger instances where only local heuristics can be executed efficiently, we observe another interesting effect: if IMPP and IMGP both use the maximal time budget, the resulting policy of IMPP can be slightly better. We conjecture that this can be attributed to, first, smaller steps by IMGP in policy space, and second, larger structural differences between the MDPs in the concurrent model which introduce more local optima to the optimization problem (Table 3).

Table 1 Summary of results for randomly generated instances ($\gamma = 0.9$, 100 problem instances)

Full size table

Table 2 Summary of results for randomly generated instances ($\gamma = 0.999$, 100 problem instances)

Full size table

Table 3 Summary of results for randomly generated deterministic MDPs ($\gamma = 0.9$, 100 problem instances)

Full size table

In another series of experiments we compare the upper bounds with the computed results for IMPP and IMGP. Again randomly generated examples are used. Table 4 shows the maximal and the average relative difference between the upper bounds and the policies found by the algorithms. For the iterative improvement, up to 100 iterations are allowed to improve lower and upper bounds for the values in vector $\varvec{h}_k$. It can be seen that the average differences are small but in a few cases a larger gap occurs. However, these large gaps only occur for small problem instances and the difference between lower bound computed with IMGP and the upper bound is at most 3.1% in all cases and after iterative improvement, it is even smaller. For large models, the maximal deviation from the upper bound is below 1% which is almost negligible.

Table 4 Difference between the bounds and the optimal policies for general MDPs found by IMPP and IMGP ($\gamma = 0.9, 100$ problem instances)

Full size table

5.3 A multi-server multi-queue model

To exemplify the differences between the individual scenarios, we provide a more involved example. We consider a simple queue with a capacity of m jobs and c servers. Inter-arrival and service times are exponentially distributed with rate $\lambda $ and $\mu $, respectively. Customers arriving at a system where the queue is full are lost. Each server can be in one of three states on, starting, and off. A server in state on can be idle or working, depending whether customers to serve are available or not. To reduce energy consumption, servers can be switched off, which implies that a server changes its state from on to off. To reduce the population in the system and the response time for customers, servers in state off can be switched on, which means that a server changes its state from off to starting. Afterwards it takes an exponentially distributed time with rate $\nu $ until it changes the state to on and is ready to serve customers. Decisions to switch servers on or off can be made upon the arrival or departure of jobs. Models like this one may be used as abstract models for server farms (Gandhi et al. 2010). The reward is defined by the sum of the population in the system and the required energy. This measure, which is often used to combine performance and energy consumption (Wierman et al. 2012), is minimized. We observe that the methods defined above can be used for minimization of the objective function if the reward vectors are negated and a constant is added to ensure non-negative rewards.

Usually, these models are analyzed under some heuristic control strategy by assuming that all parameters are completely known. Here, we consider the situation that some parameters are uncertain and we analyze an optimal strategy using MDPs. We assume that the uncertain parameters keep their values for some time which is too short to anticipate them in an adaptive control strategy.

The model can be naturally defined as an MDP with $(m+1)(c+2)(c+1)/2$ states. The number of decisions depends on the state. To restrict the number of decisions we assume that in a state at most 2 servers can be switched on or off which implies that up to 5 decisions are available in a state.

As an example we consider a system with $\mu =1, \nu =0.1, m=9, c=4$ and a uniformly distributed initial vector. The energy consumption in the states off, starting and on equals 0.1, 0.8 and 1.0, respectively. The resulting MDP has 150 states. $\gamma $ is set to 0.999. We assume that the system may be used in low load $\lambda _{\text {low}}=0.05$, medium load $\lambda _{\text {med}} = 0.5$ and high load $\lambda _{\text {high}} = 0.95$. The three rates define three different scenarios. The model describes a continuous time MDP. However, by using uniformization (Serfozo 1979) it can be transformed into an equivalent discrete time MDP.

We compute optimal policies for the three scenarios. These policies can be computed from a simple MDP and can be computed in less than a second. Furthermore, policies ${\varvec{{\varPi }}}_{{ weight}}$ are computed for different weight vectors. For ${\varvec{{\varPi }}}_{{ weight}}$IMPP takes about 2 s, IMGP takes about 8 s. ILP requires the complete time budget of 600 s, NLP takes 200 through 300 s, if it converges, and QCLP does most times not converge. The policies for the single scenarios can be expressed by the weight vectors (1, 0, 0), (0, 1, 0) and (0, 0, 1), respectively.

The different policies are then analyzed for the three scenarios and the accumulated gains are computed. Results are summarized in Table 5. The first column contains the weight vector for which the policy is computed and the following three columns include the gain for the scenarios when the policy is applied. In the last column the sums of gains over all scenarios are shown.

Table 5 Rewards for different weight vectors and general policies in the three scenarios

Full size table

Results in Table 5 have been computed with algorithm IMGP. To speed up convergence of the algorithm, policy computation for one weight vector is initialized with the optimal policy for the previous weight vector, where policies are computed according to the order in the table. Apart from the policies for the unit weight vectors most policies are not pure. E.g., for $\varvec{w} = (1/3, 1/3, 1/3)$ the best pure policy results in the gains 2176, 6244, and 9507 for the three scenarios which sums up to 17,926. Thus, according to the sum of gains, a general policy, with a gain of 17,749, improves the result by 177 units or 1% compared to a pure policy. Yet the differences in the gains for the different scenarios are significant. It can be seen that the weighted policies are better if the scenario changes during policy execution and the policy is fixed. Of course, the optimal policy depends on the probability of observing each scenario. With equal probabilities, the weight vector (1 / 3, 1 / 3, 1 / 3) results in the best policy which improves the gain between 7.4 and 17.8% compared to the policies that have been computed for single scenarios.

6 Conclusions

In this paper, we discuss a stochastic multi-scenario optimization problem for MDPs, which amounts to optimizing a weighted sum of expected discounted rewards in several MDPs with shared action and state spaces under a common policy. It turns out that this specific problem is computationally hard, which motivates the usage of general tools from mathematical optimization and heuristics. A further property of our problem is that the set of deterministic policies may not contain the optimal policy.

In this work, we have designed and compared several optimization approaches to the aforementioned problem, namely, mixed-integer programming, general non-linear optimization, quadratically constrained linear programming, and local search heuristics. It turns out that from the exact methods used, mixed-integer programming and the NLP formulation yield good and often optimal results for small problem instances. However, the methods are limited in their performance and take a significant amount of time on instances with more than 300 states.

On the other hand, the local search heuristics that we have designed seem to perform well not only with respect to time complexity but also with respect to the quality of solutions. Specifically, the local search heuristics showed an acceptable performance even on instances with 1000 states. The deviation from the optimal solutions also has been small: the largest deviation was around 0.4%; for larger instances the difference dropped to almost zero.

A further investigation direction was the difference between randomized and deterministic policies. Here, a similar picture can be seen: the largest deviation occurred on small instances and was 3.7%, for larger instances we could observe smaller deviations of at most 0.6%.

This allows us to conclude that, while the general problem is computationally hard, heuristics seem to perform fairly well. Furthermore, even with stochastic policies being optimal in the general case, restricting oneself to deterministic policies does not seem to impose great limitations with respect to the policy performance.

Future work We think of several possible extensions of this work. First, one may extend the results from the expected discounted reward measure to the expected average reward measure. Furthermore, the model may be extended by replacing MDPs with more general formalisms such as BMDPs (Givan et al. 2000), generic MDPs with uncertainties (Satia and Lave 1973; White and Eldeib 1994), or stochastic games (Filar and Vrieze 1997). More specifically, our results can be applied as a discretization method for continuous uncertainty sets with a probability measure on them. Furthermore, it is possible to extend the methods to the finite horizon case and compare the results with those derived in Steimle et al. (2018).

Another possibility lies in considering more general, not necessarily stationary policies to reduce the uncertainty on the scenario. It might be possible to infer the transition probabilities after a limited amount of time steps and use then a policy that is better suited for the specific scenario.

Notes

This corresponds also to the corresponding slack variable being zero.
Alternatively, an Octave implementation with solvers sqp and glpk is also available and shows a similar performance.

References

Amato C, Bernstein DS, Zilberstein S (2007) Solving POMDPs using quadratically constrained linear programs. In: Proceedings of the 20th international joint conference on artificial intelligence, IJCAI 2007. Hyderabad, India, January 6–12, 2007, pp 2418–2424
Berman A, Plemmons RJ (1994) Nonnegative matrices in the mathematical sciences. Classics in applied mathematics. SIAM, Philadelphia
Book MATH Google Scholar
Bertsimas D, Mišić VV (2017) Robust product line design. Oper Res 65(1):19–37
Article MathSciNet MATH Google Scholar
Bertsimas D, Silberholz J, Trikalinos T (2016) Optimal healthcare decision making under multiple mathematical models: application in prostate cancer screening. Health Care Manag Sci 21:105–118
Article Google Scholar
Björklund H, Vorobyov S (2007) A combinatorial strongly subexponential strategy improvement algorithm for mean payoff games. Discrete Appl Math 155(2):210–229. https://doi.org/10.1016/j.dam.2006.04.029
Article MathSciNet MATH Google Scholar
Caro F, Das-Gupta A (2015) Robust control of the multi-armed bandit problem. Ann Oper Res. https://doi.org/10.1007/s10479-015-1965-7
Castillo AC, Castro PM, Mahalec V (2018) Global optimization of MIQCPs with dynamic piecewise relaxations. J Glob Optim 71(4):691–716. https://doi.org/10.1007/s10898-018-0612-7
Article MathSciNet MATH Google Scholar
Colvin M, Maravelias CT (2010) Modeling methods and a branch and cut algorithm for pharmaceutical clinical trial planning using stochastic programming. Eur J Oper Res 203(1):205–215
Article MATH Google Scholar
d’Epenoux F (1963) A probabilistic production and inventory problem. Manag Sci 10(1):98–108. https://doi.org/10.1287/mnsc.10.1.98
Article Google Scholar
Dupacová J, Consigli G, Wallace SW (2000) Scenarios for multistage stochastic programs. Ann Oper Res 100(1–4):25–53. https://doi.org/10.1023/A:1019206915174
Article MathSciNet MATH Google Scholar
Ehrgott M (2005) Multicriteria optimization, 2nd edn. Springer, Berlin. https://doi.org/10.1007/3-540-27659-9
Book MATH Google Scholar
Feinberg EA, Schwartz A (eds) (2002) Handbook of Markov decision processes. Kluwer, Boston
MATH Google Scholar
Filar J, Vrieze K (1997) Competitive Markov decision processes. Springer, New York
MATH Google Scholar
Gandhi A, Gupta V, Harchol-Balter M, Kozuch MA (2010) Optimality analysis of energy-performance trade-off for server farm management. Perform Eval 67(11):1155–1171
Article Google Scholar
Garey MR, Johnson DS (1978) Computers and intractability: a guide to the theory of NP-completeness. Freeman, San Francisco
MATH Google Scholar
Givan R, Leach SM, Dean TL (2000) Bounded-parameter Markov decision processes. Artif Intell 122(1–2):71–109
Article MathSciNet MATH Google Scholar
Hager WW (1989) Updating the inverse of a matrix. SIAM Rev 31(2):221–239
Article MathSciNet MATH Google Scholar
Iyengar GN (2005) Robust dynamic programming. Math Oper Res 30(2):257–280
Article MathSciNet MATH Google Scholar
Kaelbling LP, Littman ML, Cassandra AR (1998) Planning and acting in partially observable stochastic domains. Artif Intell 101(1–2):99–134
Article MathSciNet MATH Google Scholar
Klamroth K, Köbis E, Schöbel A, Tammer C (2013) A unified approach for different concepts of robustness and stochastic programming via non-linear scalarizing functionals. Optimization 62(5):649–671
Article MathSciNet MATH Google Scholar
Mercier L, Hentenryck PV (2008) Amsaa: a multistep anticipatory algorithm for online stochastic combinatorial optimization. In: Perron L, Trick MA (eds) Integration of AI and OR techniques in constraint programming for combinatorial optimization problems, 5th international conference, CPAIOR 2008, Paris, France, May 20–23, 2008, Proceedings. Lecture Notes in Computer Science, vol 5015, pp 173–187. Springer
Nesterov Y, Nemirovskii A (1994) Interior-point polynomial algorithms in convex programming. Society for Industrial and Applied Mathematics, Philadelphia
Book MATH Google Scholar
Nilim A, Ghaoui LE (2005) Robust control of Markov decision processes with uncertain transition matrices. Oper Res 53(5):780–798
Article MathSciNet MATH Google Scholar
Papadimitriou CH, Tsitsiklis JN (1987) The complexity of Markov decision processes. Math Oper Res 12(3):441–450
Article MathSciNet MATH Google Scholar
Park J, Boyd S (2017) Heuristics for nonconvex quadratically constrained quadratic programming. CoRR arXiv:1703.07870v2
Puterman ML (2005) Markov decision processes. Wiley, London
MATH Google Scholar
Qualizza A, Belotti P, Margot F (2012) Linear programming relaxations of quadratically constrained quadratic programs. In: Lee J, Leyffer S (eds) Mixed integer nonlinear programming, vol 154. Springer, New York
Chapter MATH Google Scholar
Raskin J, Sankur O (2014) Multiple-environment Markov decision processes. CoRR arXiv:1405.4733
Rockafellar RT, Wets RJ (1991) Scenarios and policy aggregation in optimization under uncertainty. Math Oper Res 16(1):119–147
Article MathSciNet MATH Google Scholar
Roijers DM, Scharpff J, Spaan MTJ, Oliehoek FA, de Weerdt M, Whiteson S (2014) Bounded approximations for linear multi-objective planning under uncertainty. In: Chien SA, Do MB, Fern A, Ruml W (eds) Proceedings of the twenty-fourth international conference on automated planning and scheduling, ICAPS 2014, Portsmouth, New Hampshire, USA, June 21–26, 2014. http://www.aaai.org/ocs/index.php/ICAPS/ICAPS14/paper/view/7929
Ruszczyński A, Shapiro A (2009) Lectures on stochastic programming. SIAM, Philadelphia. https://doi.org/10.1137/1.9780898718751
Book MATH Google Scholar
Satia JK, Lave RE (1973) Markovian decision processes with uncertain transition probabilities. Oper Res 21(3):728–740
Article MathSciNet MATH Google Scholar
Serfozo RF (1979) An equivalence between continuous and discrete time Markov decision processes. Oper Res 27(3):616–620
Article MathSciNet MATH Google Scholar
Sigaud O, Buffet O (eds) (2010) Markov decision processes in artificial intelligence. Wiley-ISTE, London
MATH Google Scholar
Singh SP, Cohn D (1997) How to dynamically merge Markov decision processes. In: Jordan MI, Kearns MJ, Solla SA(eds) Advances in neural information processing systems 10, [NIPS Conference, Denver, Colorado, USA, 1997]. The MIT Press, pp 1057–1063
Singh SP, Jaakkola TS, Jordan MI (1994) Learning without state-estimation in partially observable Markovian decision processes. In: Cohen WW, Hirsh H (eds) Machine learning, proceedings of the eleventh international conference, Rutgers University, New Brunswick, NJ, USA, July 10–13, 1994, pp 284–292
Steimle LN, Kaufman DL, Denton BT (2018) Multi-model Markov decision processes. Technical report, Optimization-online
Vielma JP (2015) Mixed integer linear programming formulation techniques. SIAM Rev 57(1):3–57
Article MathSciNet MATH Google Scholar
Walraven E, Spaan MTJ (2015) Planning under uncertainty with weighted state scenarios. In: Meila M, Heskes T (eds) Proceedings of the thirty-first conference on uncertainty in artificial intelligence, UAI 2015, July 12–16, 2015, Amsterdam, The Netherlands, pp 912–921. AUAI Press
White CC, Eldeib HK (1994) Markov decision processes with imprecise transition probabilities. Oper Res 42(4):739–749
Article MathSciNet MATH Google Scholar
White CC, White DJ (1989) Markov decision processes. Eur J Oper Res 39(6):1–16
Article MathSciNet MATH Google Scholar
Wierman A, Andrew LL, Tang A (2012) Power-aware speed scaling in processor sharing systems: optimality and robustness. Perform Eval 69(12):601–622
Article Google Scholar
Wiesemann W, Kuhn D, Rustem B (2013) Robust Markov decision processes. Math Oper Res 38(1):153–183
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Informatik IV, TU Dortmund, Dortmund, Germany
Peter Buchholz & Dimitri Scheftelowitsch

Authors

Peter Buchholz
View author publications
You can also search for this author in PubMed Google Scholar
Dimitri Scheftelowitsch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter Buchholz.

Appendices

Appendix A: Action dependent rewards

Assume that rewards of a concurrent MDP depend on actions and successor states. Then ${\varvec{R}}_k^a(s,s')$ is the reward that is obtained in state s of MDP k if action a is chosen and $s'$ is the successor state. Let

$$\begin{aligned} \left( {\mathcal {S}}, \varvec{\alpha }, \left( {\varvec{P}}^a_k\right) _{a \in {\mathcal {A}}}, \left( {\varvec{R}}^a_k\right) _{a \in {\mathcal {A}}}\right) \end{aligned}$$

(27)

be a concurrent MDP with action-dependent rewards. We assume that this MDP should be analyzed for the discounted reward with discount factor $\gamma \in ({0,1})$. The transformation into an MDP with rewards that do not depend on the action results in the following MDP which depends on $\gamma $.

$$\begin{aligned} \begin{array}{l} \left( {\tilde{{\mathcal {S}}}}, \varvec{{\tilde{\alpha }}}, \left( {\varvec{{\tilde{P}}}}^a_k\right) _{a \in {\mathcal {A}}}, \varvec{{\tilde{r}}}\right) , \text{ where } \\ {\tilde{{\mathcal {S}}}} = {\mathcal {S}}\cup \left\{ (s,a,s') | s, s' \in {\mathcal {S}}, a \in {\mathcal {A}}, {\varvec{P}}_k^a(s,a,s') > 0\right\} \\ \varvec{{\tilde{\alpha }}}\in {{\mathbb {R}}}_{\ge 0}^{|{\tilde{{\mathcal {S}}}}|\times 1}, \varvec{{\tilde{\alpha }}}\mathbb {1}= 1 \text{ with } \varvec{{\tilde{\alpha }}}(\varvec{\sigma }) = \left\{ \begin{array}{ll} \varvec{\alpha }(s) &{} \text{ if } \varvec{\sigma } = s, \\ 0 &{} \text{ otherwise, } \end{array}\right. \\ {\varvec{{\tilde{P}}}}_k^a \in {{\mathbb {R}}}^{|{\tilde{{\mathcal {S}}}}| \times |{\tilde{{\mathcal {S}}}}|}_{\ge 0}, {\varvec{{\tilde{P}}}}_k^a\mathbb {1}= \mathbb {1} \text{ with } {\varvec{{\tilde{P}}}}_k^a(\varvec{\sigma },\varvec{\sigma '}) = \left\{ \begin{array}{ll} {\varvec{P}}_k^a(s,s') &{} \text{ if } \varvec{\sigma } = s \text{ and } \varvec{\sigma '} = (s,a,s'),\\ 1.0 &{} \text{ if } \varvec{\sigma } = (s, a, s') \text{ and } \varvec{\sigma '} = s', \\ 0 &{} \text{ otherwise, } \end{array}\right. \\ \varvec{{\tilde{r}}}\in {{\mathbb {R}}}^{1 \times |{\tilde{{\mathcal {S}}}}|} \text{ with } \varvec{{\tilde{r}}}(\varvec{\sigma }) = \left\{ \begin{array}{ll} \frac{{\varvec{R}}^a_k(s,s')}{\sqrt{\gamma }} &{} \text{ if } \varvec{\sigma } = (s, a, s'),\\ 0 &{} \text{ otherwise. } \end{array}\right. \end{array}\nonumber \\ \end{aligned}$$

(28)

Observe that the state space ${\tilde{{\mathcal {S}}}}$ consists of states $s \in {\mathcal {S}}$ and states $(s,a,s')$. The modified MDP alternates between states $s \in {\tilde{{\mathcal {S}}}} \cap {\mathcal {S}}$ and states $(s,a,s') \in {\tilde{{\mathcal {S}}}} {\setminus } {\mathcal {S}}$. Rewards are only gained in the latter states.

For some policy ${\varvec{{\varPi }}}$ defined on ${\mathcal {S}}$ we define a policy ${\varvec{{\tilde{{\varPi }}}}}$ on ${\tilde{{\mathcal {S}}}}$ as follows

$$\begin{aligned} \varvec{{\tilde{\pi }}}_{\varvec{\sigma }}(a) = \left\{ \begin{array}{ll} \varvec{\pi }_s(a) &{}\quad \text{ if } \varvec{\sigma }=s \in {\mathcal {S}}, \\ 1 &{} \quad \text{ if } \varvec{\sigma } = (s, a, s'),\\ 0 &{} \quad \text{ otherwise. } \end{array}\right. \end{aligned}$$

Thus, a policy for the MDP with action-dependent rewards can be uniquely translated into a policy for the MDP with action-independent rewards. Define the following sequences of vectors

$$\begin{aligned} \begin{array}{lll} \varvec{g}^0 = {\mathbf {0}}\in {{\mathbb {R}}}^{|{\mathcal {S}}| \times 1}, &{} \varvec{g}^h(s) = \sum \limits _{s' \in {\mathcal {S}}} \sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s')\left( {\varvec{R}}_k^a(s,s')+ \gamma \varvec{g}^{h-1}(s')\right) \\ \varvec{{\tilde{g}}}^0 = {\mathbf {0}}\in {{\mathbb {R}}}^{|{\tilde{{\mathcal {S}}}}| \times 1}, &{} \varvec{{\tilde{g}}}^h(\varvec{\sigma }) = \sum \limits _{\varvec{\sigma '} \in {\tilde{{\mathcal {S}}}}} \sum \limits _{a \in {\mathcal {A}}} \varvec{{\tilde{\pi }}}_s(a){\varvec{{\tilde{P}}}}_k^a(\varvec{\sigma },\varvec{\sigma '})\left( \varvec{{\tilde{r}}}(\varvec{\sigma })+ \sqrt{\gamma } \varvec{{\tilde{g}}}^{h-1}(\varvec{\sigma '})\right) \end{array} \end{aligned}$$

The vectors $\varvec{g}^h$ and $\varvec{{\tilde{g}}}^h$ contain the expected discounted rewards over the course of h transitions under the policy ${\varvec{{\varPi }}}$ resp. ${\varvec{{\tilde{{\varPi }}}}}$. Now we show the main relation between $\varvec{g}^h$ and $\varvec{{\tilde{g}}}^h$.

Theorem 6

For all $s \in {\mathcal {S}}$ and all $h \in {{\mathbb {N}}}_0$ it is $\varvec{g}^h(s) = \varvec{{\tilde{g}}}^{2h}(s) = \varvec{{\tilde{g}}}^{2h+1}(s)$.

Proof

We prove the correspondence of the values in the vectors for h and 2h by induction. For $h=0$ we have by definition of the vectors $\varvec{g}^0(s) = \varvec{{\tilde{g}}}^0(s) = 0$. Now assume that the correspondence holds for $h \ge 0$, then

$$\begin{aligned} \begin{array}{lll} \varvec{{\tilde{g}}}^{2h+2}(s) &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{{\tilde{\pi }}}_s(a){\varvec{{\tilde{P}}}}_k^a(s,(s,a,s')) \left( \varvec{{\tilde{r}}}(s) + \sqrt{\gamma } \varvec{{\tilde{g}}}^{2h+1}(s,a,s')\right) \\ &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s')\left( \sum \limits _{s''\in {\mathcal {S}}}\sum \limits _{a' \in {\mathcal {A}}} \varvec{{\tilde{\pi }}}_{(s,a,s')}(a'){\varvec{{\tilde{P}}}}_k^{a'}((s,a,s'), s'') \right. \\ &{} \quad \left. \times \left( \sqrt{\gamma } \varvec{{\tilde{r}}}(s,a,s') + \gamma \varvec{{\tilde{g}}}^{2h}(s'') \right) \right) \\ &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s') \left( \sqrt{\gamma } \varvec{{\tilde{r}}}(s,a,s') + \gamma \varvec{{\tilde{g}}}^{2h}(s') \right) \\ &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s')\left( \sqrt{\gamma } \frac{{\varvec{R}}_k^a(s,s')}{\sqrt{\gamma }} + \gamma \varvec{g}^h(s')\right) \\ &{} = \varvec{g}^{h+1}(s) \end{array} \end{aligned}$$

which implies that it holds for $h+1$. Now we consider $\varvec{{\tilde{g}}}^{2h+1}(s)$. For $h = 0$ we have

$$\begin{aligned} \begin{array}{lll} \varvec{{\tilde{g}}}^1(s) = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{{\tilde{P}}}}_k^a(s,(s,a,s')) \left( \varvec{{\tilde{r}}}(s) + \sqrt{\gamma }\varvec{{\tilde{g}}}^{0}(s,a,s')\right) = 0 \end{array} \end{aligned}$$

because $\varvec{{\tilde{r}}}(s) = 0$ and $\varvec{{\tilde{g}}}^0={\mathbf {0}}$. Now assume that the result holds for $2h+1$, we show that it holds for $2h+3$.

$$\begin{aligned} \begin{array}{lll} \varvec{{\tilde{g}}}^{2h+3}(s) &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{{\tilde{\pi }}}_s(a){\varvec{{\tilde{P}}}}_k^a(s,(s,a,s')) \left( \varvec{{\tilde{r}}}(s) + \sqrt{\gamma }\varvec{{\tilde{g}}}^{2h+2}(s,a,s')\right) \\ &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s')\left( \sum \limits _{s''\in {\mathcal {S}}}\sum \limits _{a' \in {\mathcal {A}}} \varvec{{\tilde{\pi }}}_{(s,a,s')}(a'){\varvec{{\tilde{P}}}}_k^{a'}((s,a,s'), s'')\right. \\ &{} \quad \times \left. \left( \sqrt{\gamma }\varvec{{\tilde{r}}}(s,a,s') + \gamma \varvec{{\tilde{g}}}^{2h+1}(s'') \right) \right) \\ &{} = \sum \limits _{s'\in {\mathcal {S}}}\sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a){\varvec{P}}_k^a(s,s')\left( \sqrt{\gamma }\frac{{\varvec{R}}_k^a(s,s')}{\sqrt{\gamma }} +\gamma \varvec{g}^h(s')\right) \\ &{} = \varvec{g}^{h+1}(s) \end{array} \end{aligned}$$

$\square $

Let $G^{\varvec{{\varPi }}}(\gamma ) = \varvec{\alpha }\lim _{h \rightarrow \infty } \varvec{g}^h$ and ${\tilde{G}}^{\varvec{{\tilde{{\varPi }}}}}(\sqrt{\gamma }) = \varvec{{\tilde{\alpha }}}\lim _{h \rightarrow \infty } \varvec{{\tilde{g}}}^h$ the discounted gains for the two MDPs with discount factors $\gamma \in ({0,1})$ and $\sqrt{\gamma }$, respectively.

Theorem 7

For any policy ${\varvec{{\varPi }}}$ and any discount factor $\gamma \in ({0,1})$ the relation

$$\begin{aligned} G^{\varvec{{\varPi }}}(\gamma ) = {\tilde{G}}^{\varvec{{\tilde{{\varPi }}}}}(\sqrt{\gamma }) \end{aligned}$$

holds.

Proof

First, we show the existence of $G^{\varvec{{\varPi }}}(\gamma )$. Since ${\varvec{R}}^a_k(s, s')$ is finite for all $a \in {\mathcal {A}}$ and for all $s, s' \in {\mathcal {S}}$ and can be bounded by a real value $R \ge \left| {{\varvec{R}}^a_k(s, s')}\right| $, the sequence $\left( \varvec{g}^h\right) _{h \in {{\mathbb {N}}}_0}$ adheres to

$$\begin{aligned} \left| {\varvec{g}^{h+i}(s) - \varvec{g}^i(s)}\right|&= \sum \limits _{s' \in {\mathcal {S}}} \sum \limits _{a \in {\mathcal {A}}} \varvec{\pi }_s(a) {\varvec{P}}^a_k(s, s') \left| {\varvec{R}}^a_k(s, s') + \gamma \varvec{g}^{h+i - 1}(s') - {\varvec{R}}^a_k(s, s') \right. \\&\quad \left. - \gamma \varvec{g}^{i - 1}(s') \right| \\&\le \gamma \max \limits _{s' \in {\mathcal {S}}} \left| { \varvec{g}^{h+i - 1}(s') - \varvec{g}^{i - 1}(s') }\right| . \end{aligned}$$

By induction, it follows that

$$\begin{aligned} \left| {\varvec{g}^{h+i}(s) - \varvec{g}^i(s)}\right|&\le \gamma ^i \max \limits _{s' \in {\mathcal {S}}} \left| {\varvec{g}^h(s')}\right| \\&\le \gamma ^i R \sum \limits _{j = 0}^h \gamma ^j \\&= \gamma ^i R \frac{1 - \gamma ^h}{1 - \gamma } \\&\le \gamma ^i \frac{R}{1 - \gamma } \end{aligned}$$

for $i \in {{\mathbb {N}}}$ which implies the Cauchy convergence criterion. Then Theorem 6 implies that $\varvec{g}(s) = \lim _{h \rightarrow \infty } \varvec{g}^h(s) = \lim _{h \rightarrow \infty } \varvec{{\tilde{g}}}^{2h}(s) = \lim _{h \rightarrow \infty } \varvec{{\tilde{g}}}^h(s) = \varvec{{\tilde{g}}}(s)$ for all $s \in {\mathcal {S}}$ and we have

$$\begin{aligned} G^{\varvec{{\varPi }}}(\gamma )&= \sum \limits _{s \in {\mathcal {S}}} \varvec{\alpha }(s) \varvec{g}(s) \\&= \sum \limits _{s \in {\mathcal {S}}} \varvec{{\tilde{\alpha }}}(s) \varvec{{\tilde{g}}}(s)&\text {since }\varvec{{\tilde{\alpha }}}(s) = \varvec{\alpha }(s)\text { for }s \in {\mathcal {S}}\\&= \sum \limits _{\sigma \in {\tilde{{\mathcal {S}}}}} \varvec{{\tilde{\alpha }}}(\sigma ) \varvec{{\tilde{g}}}(\sigma )&\text {since }\varvec{{\tilde{\alpha }}}(\sigma ) = 0\text { for }\sigma \not \in {\mathcal {S}}\\&= {\tilde{G}}^{\varvec{{\varPi }}}(\sqrt{\gamma }). \end{aligned}$$

$\square $

Analogously the following result can be shown.

Theorem 8

For any policy ${\varvec{{\tilde{{\varPi }}}}}$ in the action-independent MDP $\left( {{\tilde{{\mathcal {S}}}}, \varvec{{\tilde{\alpha }}}, \left( {{\varvec{{\tilde{P}}}}^a_k}\right) _{a \in {\mathcal {A}}}, \varvec{{\tilde{r}}}}\right) $, the policy ${\varvec{{\varPi }}}$ in the action-dependent MDP derived by projection of ${\varvec{{\tilde{{\varPi }}}}}$ on ${\mathcal {S}}$ is subject to

$$\begin{aligned} G^{\varvec{{\varPi }}}(\gamma ) = {\tilde{G}}^{\varvec{{\tilde{{\varPi }}}}}(\sqrt{\gamma }) \end{aligned}$$

This implies that every policy (including the optimal policy) in the MDP with action-independent rewards yields the same expected discounted reward as its projection in the MDP with action-dependent rewards. Together with Theorem 7 this means that there exists a one-to-one correspondence between policies in both MDPs which yields the same expected discounted reward, and optimal policies in one MDP are also optimal in the other one. This completes the transformation from the action-dependent reward model.

Appendix B: Proof of Theorem 1

Theorem 1

The decision problem defined in Definition 1 is NP-complete.

We perform a reduction from 3-SAT. Given a 3-SAT instance with n variables $x_1, \ldots , x_n$ and m clauses $C_1, \ldots , C_m$ where each clause contains three literals, we construct a concurrent MDP ${\mathcal {M}}= \{1, 2\}$ consisting of two MDPs, a vector $\varvec{w}\in {{\mathbb {R}}}^2$ and a real number $g \in {{\mathbb {R}}}$ such that the instance will be satisfiable if and only if there is a policy ${\varvec{{\varPi }}}$ for the concurrent MDP that yields the value g.

The first part of our construction are the states. They are arranged in three groups.

First, we create a specially designated sink state $s_0$ which yields 0 reward in both MDPs.
Then, we transfer the variables of the Boolean satisfiability problem into states of the MDPs: for each variable x we create two states $s_{x}, s_{x}'$ in both MDPs. The reward is 0 in states of the form $s_{x}$ and 1 in states of the form $s_{x}'$.
Last, we create states for clauses: for each clause C, a state $s_C$ is created. Again, the reward in $s_C$ is zero for all clauses.

The second part of the construction are the actions. We create three actions ${\mathcal {A}}= \{1, 2, 3\}$ with the following semantics.

In sink state $s_0$, it is ${\varvec{P}}^a_k(s_0, s_0) = 1$ for all $a \in {\mathcal {A}}$ and $k \in {\mathcal {M}}$.
In the variable states, we define ${\varvec{P}}^1_1(s_x, s_{x}') = {\varvec{P}}^2_1(s_x, s_0) = {\varvec{P}}^3_1(s_x, s_0) = 1$ and ${\varvec{P}}^1_2(s_x, s_0) = {\varvec{P}}^2_2(s_x, s_{x}') = {\varvec{P}}^3_2(s_x, s_{x}') = 1$, that is, we define actions in $s_x$ to lead to different outcomes in the MDPs. The motivation is to force a mutually exclusive choice of values for the Boolean variables in the concurrent MDP. In the auxiliary variable states, it is ${\varvec{P}}^a_k(s_x', s_x) = 1$ for all actions $a \in {\mathcal {A}}$ and MDPs $k \in {\mathcal {M}}$; the idea behind these states is to exploit non-linearity of the problem. The construction is visualized in Fig. 3 where the upper part corresponds to the first MDP and the lower part corresponds to the second MDP in ${\mathcal {M}}$.
In the clause states, we define actions as follows. In a clause $C = L_1 \vee L_2 \vee L_3$, the chosen action represents the literal that evaluates to true. Hence, we define ${\varvec{P}}^a_k(C, s)$ by setting ${\varvec{P}}^a_k(C, s) = 1$ in the cases
- $L_a = x, k = 1, s = s_x$
- $L_a = \lnot x, k = 1, s = s_0$
- $L_a = \lnot x, k = 2, s = s_x$
- $L_a = x, k = 2, s = s_0$

A graphical sketch of this setup can be seen in Fig. 4. Again, the upper part of the drawing corresponds to the first MDP in ${\mathcal {M}}$ while the lower part corresponds to the second MDP.

The idea behind this construction is to infer functions $\beta :\{1, \ldots , n\} \rightarrow \{0, 1\}$ that map variables to values and $\nu :\{1, \ldots , m\} \rightarrow \{1, 2, 3\}$ that map the clauses to the satisfying variables. This is done to create a mapping from policies to variable assignments in the SAT problem. Furthermore, we define the initial distribution $\alpha $ with $\alpha (s_C) = {1}/{m}$ for all clauses C for both MDPs and weights $\varvec{w}= ({1}/{2}, {1}/{2})$. Concerning the value, we set an auxiliary constant $q := \frac{1}{1 - \gamma ^2}$ and the required value $g := \frac{\gamma ^2 q}{2}$ where $\gamma $ is a non-zero discount factor in the concurrent MDP.

We prove the validity of the reduction. First, we show that if there is an assignment $\beta :\{1, \ldots , n\} \rightarrow \{0, 1\}$ that satisfies the SAT instance, then there also exists a policy $\pi $ such that $\sum _{k=1}^{K} \varvec{w}(k) G^{{\varvec{{\varPi }}}}_k \ge g$. We construct the policy in two steps. In the first step, we set $\pi _{s_x}(1) = 1 \Leftrightarrow \beta (x) = 1$ for all variables x. In the second step, it follows from the existence of a satisfying assignment that in each clause, a literal is satisfied, defining a function $\nu :\{1, \ldots , m\} \rightarrow \{1, 2, 3\}$ that defines the number of a satisfied literal in every clause. Thus, we set $\pi _{s_C}(a) = 1 \Leftrightarrow \nu (C) = a$.

We verify that the constructed policy yields the given value. As in each clause the satisfying literal is chosen, the value of this state will be 0 in one MDP and $\gamma ^2 q$ in the other one.

Now we show that if there is no satisfying assignment, then the value of the concurrent MDP will be lower than g. Given any assignment $\beta $ and any assignment $\nu $, the induced policy will lead from at least one clause state to the sink state $s_0$ with nonzero probability in both MDPs, yielding a lower value. However, we must take care of stationary but not pure policies that still might induce the desired value. One can observe that if the stationary policy is not pure in a state $s_x$ for a variable x, then the cumulative discounted reward in this state is $\frac{p\gamma }{1 - p^2\gamma ^2}$ for some real $0< p < 1$. Deriving the value of a clause state from which one can arrive to this variable state, we get, summing over both MDPs, a summand

$$\begin{aligned} \frac{1}{2} \left( \frac{p \gamma ^2}{1 - p^2 \gamma ^2} + \frac{(1 - p) \gamma ^2}{1 - (1 - p)^2 \gamma ^2} \right) \end{aligned}$$

Let $f(p) = \frac{p}{1 - p^2 \gamma ^2} + \frac{1 - p}{1 - (1 - p)^2 \gamma ^2}$. Computing the derivative, we obtain

$$\begin{aligned} f'(p) = \frac{2 \gamma ^2 p}{ \left( 1 -\gamma ^2 p^2 \right) ^2 } - \frac{2 \gamma ^2 (1 - p)^2}{ \left( 1 -\gamma ^2 (1 - p)^2 \right) ^2 } + \frac{1}{1 - \gamma ^2 p^2} - \frac{1}{1 - \gamma ^2 (1 - p)^2} \end{aligned}$$

which has its roots at

$$\begin{aligned} \frac{1}{2}, \frac{1 \pm \sqrt{1 - 4\gamma ^{-2} + 4\gamma ^{-2} \sqrt{4 - \gamma ^{2}} }}{2}, \frac{1 \pm i \sqrt{1 - 4\gamma ^{-2} + 4\gamma ^{-2} \sqrt{4 - \gamma ^{2}} }}{2}. \end{aligned}$$

The only roots of interest are the real ones, and thus, we investigate the pair

$$\begin{aligned} \frac{1 \pm \sqrt{1 - 4\gamma ^{-2} + 4\gamma ^{-2} \sqrt{4 - \gamma ^{2}} } }{2}. \end{aligned}$$

(29)

It can be seen that for $0< \gamma < 1$, the value $\sqrt{4 - \gamma ^2}$ is at least $\sqrt{3} > 1$, and the root term in (29) is thus greater than one. This means that the whole term (29) is either greater than one or negative. Hence, the possible extreme points of f in [0, 1] may lie at 0, 1, or ${1}/{2}$. We can see that $f(0) = f(1) = q$ while $f({1}/{2}) = \frac{1}{1 - {1}/{4} \gamma ^2} < q$. Hence, a non-pure policy in a variable state will have a lower cumulative discounted reward. Concerning the clause states, we observe that a non-pure policy cannot yield higher rewards than a pure one, as the expected discounted reward in a clause state is linear in the expected discounted rewards in the following variable states; the clause states are not visited again.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Buchholz, P., Scheftelowitsch, D. Computation of weighted sums of rewards for concurrent MDPs. Math Meth Oper Res 89, 1–42 (2019). https://doi.org/10.1007/s00186-018-0653-1

Download citation

Received: 21 August 2017
Accepted: 17 October 2018
Published: 31 October 2018
Issue Date: 19 February 2019
DOI: https://doi.org/10.1007/s00186-018-0653-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Computation of weighted sums of rewards for concurrent MDPs

Abstract

Similar content being viewed by others

Tuning the Discount Factor in Order to Reach Average Optimality on Deterministic MDPs

Simple Strategies in Multi-Objective MDPs

Concurrent MDPs with Finite Markovian Policies

1 Introduction

1.1 Related work

1.2 Contribution of the paper

1.3 Structure of the work

1.4 Notation

2 Basic problem

3 Mathematical structure of the optimization problem

3.1 Computational complexity

Definition 1

Theorem 1

4 Computation of optimal policies and rewards

4.1 A solution approach using a mixed integer linear program

4.2 Using algorithms for non-convex programs

4.3 Step-wise computation of increasing lower bounds

Theorem 2

Proof

Theorem 3

Proof

4.4 Computation of upper bounds

Theorem 4

Proof

Theorem 5

Proof

5 Examples

5.1 Small examples with an optimal random policies

5.2 Random matrices

5.3 A multi-server multi-queue model

6 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Action dependent rewards

Theorem 6

Proof

Theorem 7

Proof

Theorem 8

Appendix B: Proof of Theorem 1

Theorem 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation