Abstract
The paper deals with a class of discounted discrete-time Markov control models with non-constant discount factors of the form \(\tilde{\alpha } (x_{n},a_{n},\xi _{n+1})\), where \(x_{n},a_{n},\) and \(\xi _{n+1}\) are the state, the action, and a random disturbance at time \(n,\) respectively, taking values in Borel spaces. Assuming that the one-stage cost is possibly unbounded and that the distributions of \(\xi _{n}\) are unknown, we study the corresponding optimal control problem under two settings. Firstly we assume that the random disturbance process \(\left\{ \xi _{n}\right\} \) is formed by observable independent and identically distributed random variables, and then we introduce an estimation and control procedure to construct strategies. Instead, in the second one, \(\left\{ \xi _{n}\right\} \) is assumed to be non-observable whose distributions may change from stage to stage, and in this case the problem is studied as a minimax control problem in which the controller has an opponent selecting the distribution of the corresponding random disturbance at each stage.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
We consider discounted discrete-time Markov control models with non-constant discount factors of the form
where \(x_{n}\) and \(a_{n}\) are the state and the action at time \(n,\) respectively, and \(\left\{ \xi _{n}\right\} \) is a sequence of random variables representing a random disturbance at each time \(n.\) The discount factors (1) play the following role during the evolution of the system. At the initial state \(x_{0},\) the controller chooses an action \(a_{0}.\) Then a cost \(c(x_{0},a_{0})\) is incurred, and the system moves to a new state \(x_{1}\) according to a transition law, and the random disturbance \(\xi _{1}\) comes in. Once the system is in state \(x_{1}\) the controller selects an action \(a_{1}\) and incurs a discounted cost \( \tilde{\alpha }(x_{0},a_{0},\xi _{1})c(x_{1},a_{1}).\) Next the system moves to a state \(x_{2}\) and the process is repeated. That is, on the record of the states–actions and random disturbances, the controller incurs, for the stage \(n\ge 1,\) the discounted cost
Thus, the costs are discounted in a multiplicative discount rate, and therefore, assuming \(c(\cdot ,\cdot )\) possibly unbounded, our objective is to study the optimality under the performance index defined by the accumulation throughout the evolution of the system of these costs.
The main motivation in considering non-constant discount factors comes from the point of view of the applications. Indeed, in economic and financial models, the discount factors are typically functions of the interest rates, which in turn are uncertain. Such uncertainty may be due to the amount of currency, and/or the decision-makers’ actions, and furthermore, to external random noises whose distributions really are unknown. Therefore, in these cases, we have random state–action-dependent discount factors which, in this paper, are supposed to be determined by a function as in (1). Then, assuming that the distributions of the random variables \(\xi _{n}\) are unknown, we study the corresponding control problem under two settings.
Firstly, we assume that the random disturbance process \(\left\{ \xi _{n}\right\} \) is formed by observable, independent and identically distributed (i.i.d.), and independent of the state–action pairs, random variables. The common (unknown) distribution, denoted by \(\theta \), is estimated from historical observations of \(\xi _{n}\) using the empirical distribution. Then, we combine this estimation scheme with a suitable minimization process to construct asymptotically optimal strategies. It is worth remarking that the hypotheses of observability as well as the non-dependence on the state–action pairs of the random disturbance process are crucial to implement such statistical estimation and control procedure, which is also known as certainty equivalence principle (see, e.g., Hernández-Lerma 1989; Mandl 1974). However, there are situations, perhaps most of them in economic and financial models, where (i) the random variable \(\xi _{n}\) really represents a random noise which is impossible to observe; or (ii) \(\left\{ \xi _{n}\right\} \) is a stochastic process which is difficult to handle inside a controlled system. The last situation might occur, for instance, when \(\left\{ \xi _{n}\right\} \) represents the interest rate. Indeed, in dynamical financial systems (see, e.g., Brigo and Mercurio 2007; Heath et al. 1992; Vasicek 1977) generally the evolution of the interest rate is modeled by a stochastic difference equation (differential equation if the system is analyzed in continuous time), then when inserting such equation in a control system through the discount factor, as it would be our case, the study of the resulting optimal control problem becomes difficult. Considering the situations (i) and (ii), our second setting consists in supposing that \(\left\{ \xi _{n}\right\} \) is a sequence of independent and possibly non-observable random variables whose distributions may change from stage to stage. The only information possessing the controller is that at each stage, the corresponding distribution belongs to an appropriate set of probability measures \(\Theta .\) In this case, the optimal control problem is studied as a minimax control problem known as game against nature. Indeed, we suppose that the controller has an opponent, namely, the “nature”, which, at each stage, is selecting a distribution from the set \(\Theta \) for the corresponding random disturbance. Hence, the controller is interested in selecting actions directed to minimize the maximum discounted cost—with random state–action-dependent discount factors—generated on the set \(\Theta .\) Thus, the second objective is to show the existence of minimax strategies.
Observe that under the minimax approach it is possible to study the optimal control problem in general scenarios. These include the cases when the random disturbances are observable or unobservable, with constant or non-constant distribution throughout the evolution of the system. Moreover, modeling a control problem as a minimax system simplifies the mathematical analysis since it avoids dealing directly with the disturbance process, as is the case of point (ii). Although these facts constitute certain advantages, it is important to keep in mind that, under this formulation, we are obtaining minimax strategies instead of optimal strategies, which is the price we must pay. In particular, if \(\xi _{n}\) are i.i.d. and observable random variables, as the conditions in the estimation and control scheme, the minimax procedure works if we assume that the common and unknown distribution \(\theta \) belongs to a set \(\Theta \). Clearly, in this case, we obtain minimax strategies instead of asymptotically optimal strategies.
Our general approach to analyze these two problems is based on the following. We introduce a minimax value iteration algorithm which converges geometrically to the minimax value function. Such value function is characterized as the unique solution of a minimax equation, and then, by imposing appropriate conditions, we prove the existence of minimax strategies. In addition, taking into account that an optimal control problem is a particular case of a minimax problem, we apply the minimax results to study the estimation and control problem, for which we first prove that the Markov strategies are sufficient.
Among the performance indices to study a stochastic optimal control problem, the discounted criterion with constant and non-random discount factor is the best understood. It has been widely studied under different approaches: dynamic programming (see Bertsekas 1987; Hernández-Lerma 1989; Hernández-Lerma and Lasserre 1996, 1999; Puterman 1994 and references therein); convex analysis (Altman 1999; Borkar 1998; Piunovskiy 1997); linear programming (Altman 1999; Hernández-Lerma and Lasserre 1996, 1999; Hernández-Lerma and González-Hernández 2000; Piunovskiy 1997); Lagrange multipliers (López-Martínez and Hernández-Lerma 2003); adaptive procedures (Gordienko and Minjárez-Sosa 1998; Hilgert and Minjárez-Sosa 2001, 2006); minimax systems (González-Trejo et al. 2003; Iyengar 2005; Jaskiewicz and Nowak 2011). However, although infrequently, there have been important works dealing with the problem with non-constant discount factors under several settings. For instance, in Feinberg and Shwartz (1994) is studied the problem assuming \(K<\infty \) different discount factors \(\alpha _{1},\alpha _{2},\ldots ,\alpha _{K}\) which are independent of the state–action pairs (see also Carmon and Shwartz 2009; Feinberg and Shwartz 1995, 1999). In fact, in Carmon and Shwartz (2009) is presented an extension to the case \(K=\infty .\) In addition, performance indices with multiplicative discount rates as (2) have been treated in Hinderer (1979) and Schäl (1975). Specifically, in Hinderer (1979), the discount factor is defined as a function of the state–action history of the system, while in Schäl (1975) it is state–action-dependent. In both papers is assumed bounded one-stage costs from below. Recently, in Wei and Guo (2011), some results in Schäl (1975) were extended to the unbounded cost case considering state-dependent discount factors. On the other hand, randomized discounted criteria have been analyzed in González-Hernández et al. (2007, 2008, 2009, 2013, 2014) addressing several issues: existence of optimal strategies, adaptive control, approximation algorithms, and problems with constraints. In these cases, the discount factor is modeled as a stochastic process, independent of the state–action pairs, which is defined in terms of a suitable discrete-time Markov process.
According to the description of the literature, our work presents an alternative form to study discounted optimal problems with non-constant discount factors. That is, to the best of our knowledge, discounted criteria with random state–action-dependent discount factors, and moreover with unknown disturbance distribution, have not been studied.
The organization of the paper is as follows. In Sect. 2, we present the control models we are concerned with. Next, Sect. 3 contains the optimality criteria, and then the minimax and the estimation and control problems are introduced. General assumptions as well as some preliminary results are stated in Sect. 4. The minimax and the asymptotically optimal strategies are constructed in Sects. 5 and 6, respectively. Finally, in Sect. 7 are given some examples to illustrate our results.
2 The control models
In this section, we present the control models that will be analyzed in the paper. We first introduce the Markov model corresponding to the estimation and control problem, and next we describe the minimax control model to study the case of non-observable random disturbance. We will use the following notation.
Notation
Given a Borel space \(Z\)—that is, a Borel subset of a complete separable metric space—\(\mathcal {B}(Z)\) denotes the Borel \(\sigma \)-algebra and “ measurability” always means measurability with respect to \(\mathcal {B}(Z)\). The class of all probability measures on \(Z\) is denoted by \(\mathbb {P}(Z)\). Given two Borel spaces \(Z\) and \(Z^{\prime }\), a stochastic kernel \(\varphi (\cdot |\cdot )\) on \(Z\) given \(Z^{\prime }\) is a function such that \(\varphi (\cdot |z^{\prime })\) is in \(\mathbb {P}(Z)\) for each \(z^{\prime }\in Z^{\prime }\), and \(\varphi (B|\cdot )\) is a measurable function on \(Z^{\prime }\) for each \(B\in \mathcal {B}(Z).\) Moreover, \(\mathbb { R}_{+}\) stands for the nonnegative real numbers’ subset and \(\mathbb {N} (\mathbb {N}_{0}, \mathrm{resp.})\) denotes the positive \((\)nonnegative, resp.\()\) integers’ subset. The class \(\mathbb {P}(Z)\) is endowed with the topology of weak convergence. That is, a sequence \(\left\{ \mu _{n}\right\} \) in \(\mathbb {P} (Z)\) converges weakly to \(\mu \) \((\mu _{n}\rightarrow \mu )\) if
for all bounded and continuous function \(u.\) In this case, we have that if \( Z \) is a Borel space, then so is \(\mathbb {P}(Z)\).
2.1 Markov control model
We consider the control model with random state–action-dependent discount factors
satisfying the following conditions. The state space \(\mathbb {X}\), the action space \(\mathbb {A}\), and the discount factor disturbance space \(S\) are Borel spaces. To each \(x\in \mathbb {X}\), we associate a nonempty measurable subset \(A(x)\) of \(\mathbb {A}\) denoting the set of admissible controls (or actions) when the system is in state \(x.\) The set
of admissible state–action pairs is assumed to be a Borel subset of the Cartesian product of \(\mathbb {X}\) and \(\mathbb {A}\). The transition law \(Q(\cdot \mid \cdot )\) is a stochastic kernel on \( \mathbb {X}\) given \(\mathbb {K}_{A}\), and \(\tilde{\alpha }:\mathbb {K}_{A}\times S\rightarrow (0,1)\) is a function as in (1) representing the discount factors, where \(\left\{ \xi _{n}\right\} \) is a sequence of observable i.i.d. random variables on a probability space \( (\Omega ,\mathcal {F},P)\) with values in \(S\) and unknown distribution \( \theta \in \mathbb {P}(S).\) Finally, the cost-per-stage \(c\) is a measurable real-valued function on \(\mathbb {K}_{A}\), possibly unbounded.
Throughout the paper, the probability space \((\Omega ,\mathcal {F},P)\) is fixed and a.s. means almost surely with respect to \(P.\)
In this context, since \(\theta \) is unknown and the random disturbance process \(\left\{ \xi _{n}\right\} \) is observable, before choosing the action \(a_{n}\) at stage \(n\in \mathbb {N},\) the controller uses the empirical distribution to get an estimate \(\hat{\theta }_{n}\) of \(\theta .\) That is, \(\left\{ \hat{\theta }_{n}\right\} \subset \!\mathbb {P}(S)\) is obtained by the process:
where \(1_{B}(\cdot )\) denotes the indicator function of the set \(B\in \mathcal {B}(S).\) Next, he/she combines this with the history of the system to select a control \(a=a_{n}(\hat{\theta }_{n})\in A(x_{n}).\) Then, a discounted cost as in (2) is incurred, and the system moves to a new state \(x_{n+1}=x^{\prime }\) according to the transition law
Once the transition to state \(x_{n+1}\) occurs, the process is repeated. The costs are accumulated throughout the evolution of the system in an infinite horizon using a discounted cost criterion with random state–action-dependent discount factors defined below.
2.2 Minimax control model
Now we are interested in the situation where the random variable \(\xi _{n}\) represents a random noise which is impossible to observe and, furthermore, its distribution may change from stage to stage. In opposite to the Markov model (3), in this case, the controller cannot estimate, by means of statistical methods, the unknown distribution, and under this scenario we model the control problem as a minimax system. That is, we assume that the controller has an opponent which selects the distribution \(\theta _{n}\) for \( \xi _{n}\) at each time \(n\in \mathbb {N}.\) Specifically, we consider a minimax control model of the form
where \(\mathbb {X},\mathbb {A},Q,\tilde{\alpha },c,\) and \(\mathbb {K}_{A}\) are as in (3) and (4), and \(\Theta \subset \mathbb {P} (S)\) is a Borel subset of probability measures on \(S\), which represents the opponent action space. The set \(\mathbb {K}\in \mathcal {B}(\mathbb {X}\times \mathbb {A}\times \Theta )\) is the constraint set for the opponent. Hence, we suppose that \(\left\{ \xi _{n}\right\} \) is a sequence of independent and possibly non-observable random variables on \((\Omega ,\mathcal {F},P)\) taking values on \(S,\) with corresponding distribution \(\theta _{n}\in \) \(\Theta .\) That is,
The model (7) represents a controlled stochastic systems which can be seen as a game against the nature whose evolution is as follows. At time \(n\in \mathbb {N}\), the system is in state \(x_{n}\in \mathbb {X},\) the controller chooses an action \(a_{n}\in A(x_{n})\) and the opponent, the “nature”, picks a distribution \(\theta _{n}\in \Theta \) for the random disturbance \(\xi _{n}\). Then the controller incurs a discounted cost
where \(\alpha :\mathbb {K}\rightarrow (0,1)\) is the mean discount factor function
Next, the process moves to a new state according to the transition law \(Q\) and the process is repeated. Thus, the goal of the controller is to minimize the maximum cost incurred by the nature. The corresponding minimax control problem will be defined below in a precise form.
3 Optimality criteria
As will be stated below, some properties on the Markov control model (3) as the optimality equation, can be deduced from the results on minimax control model (7) by letting \(\Theta =\left\{ \theta \right\} ,\) where \(\theta \) is the common but unknown distribution of the process \(\left\{ \xi _{n}\right\} .\) Taking into account this fact, and for a clear presentation, we first define the minimax criterion, and hereupon we introduce the performance index corresponding to the model (3).
3.1 Minimax control problem
Let \(\mathbb {H}_{0}:=\mathbb {X},\) \(\mathbb {H}_{0}^{\prime }:=\mathbb {K}_{A},\) and for \(n\in \mathbb {N}\) let \(\mathbb {H}_{n}:=\mathbb {K}^{n}\times \mathbb {X }\) and \(\mathbb {H}_{n}^{\prime }:=\mathbb {K}^{n}\times \mathbb {K}_{A}\). Generic elements of \(\mathbb {H}_{n}\) and \(\mathbb {H}_{n}^{\prime }\) take the form \(h_{n}=(x_{0},a_{0},\theta _{1},\ldots ,x_{n-1},a_{n-1},\theta _{n},x_{n})\) and \(h_{n}^{\prime }=(h_{n},a_{n}),\) respectively. A strategy for the controller is a sequence \(\pi =\{\pi _{n}\}\) of stochastic kernels on \( \mathbb {A}\) given \(\mathbb {H}_{n}\) such that \(\pi _{n}(A(x_{n})|h_{n})=1\ \) for all \(h_{n}\in \mathbb {H}_{n}\) and \(n\in \mathbb {N}_{0}.\) If there exists a sequence \(\varphi =\left\{ \varphi _{n}\right\} \) of stochastic kernels on \(\mathbb {A}\) given \(\mathbb {X}\) such that \(\pi _{n}\left( \cdot |h_{n}\right) =\varphi _{n}\left( \cdot |x_{n}\right) \) then \(\pi \) is called a Markov strategy. We denote by \(\Pi _{\mathbb {A}}\) the set of all strategies for the controller and by \(\Pi _{\mathbb {A}}^{M}\subset \Pi _{ \mathbb {A}}\) the subset of Markov strategies. A strategy \(\varphi \in \Pi _{ \mathbb {A}}^{M}\) is deterministic if there exists a sequence \(\left\{ f_{n}\right\} \) of functions in the set
such that \(\varphi _{n}\left( \cdot |x_{n}\right) \) is concentrated at \( f_{n}(x_{n})\) for each \(n\in \mathbb {N}_{0}.\) If \(f_{n}=f\in \mathbb {F}_{ \mathbb {A}}\) then \(\varphi \) is said to be a deterministic stationary strategy for the controller. If necessary, see for instance Hernández-Lerma and Lasserre (1996) for further information on those strategies. Following a standard convention, we denote by \(\mathbb {F}_{\mathbb {A} }\subset \Pi _{\mathbb {A}}^{M}\) the set of deterministic stationary strategies for the controller, and we denote \(\pi \in \mathbb {F}_{\mathbb {A} } \) by \(f.\)
The strategies for the opponent are defined similarly. That is, a strategy for the opponent is a sequence \(\pi ^{\prime }=\{\pi _{n}^{\prime }\}\) of stochastic kernels on \(\Theta \) given \(\mathbb {H}_{n}^{\prime }\) such that \( \pi _{n}^{\prime }\left( \Theta |h_{n}^{\prime }\right) =1\ \) for all \(h_{n}^{\prime }\in \mathbb {H}_{n}^{\prime }\) and \(n\in \mathbb {N}_{0}.\) We denote by \(\Pi _{\Theta }\) the set of all strategies for the opponent, and by \(\mathbb {F} _{\Theta }\subset \Pi _{\Theta }\) the set of all deterministic stationary strategies. We identify a deterministic stationary strategy \(\pi ^{\prime }\in \) \(\mathbb {F}_{\Theta }\) with some measurable function \(g:\mathbb {X} \times \mathbb {A}\rightarrow \Theta \) such that \(\pi _{n}^{\prime }\left( \cdot |h_{n}^{\prime }\right) \) is concentrated in \(g(x_{n},a_{n})\in \Theta \) for all \( h_{n}^{\prime }\in \mathbb {H}_{n}^{\prime }\) and \(n\in \mathbb {N}_{0}.\)
To ease the notation, for each \(f\in \mathbb {F}_{\mathbb {A}}\), we write
According to (8), we are assuming that a cost \(C\) incurred at stage \(n\) is equivalent to a cost \(\Gamma _{n}C\) at time \(0,\) where
and \(\Gamma _{0}=1.\) Hence, for each pair of strategies \((\pi ,\pi ^{\prime })\in \Pi _{\mathbb {A}}\times \Pi _{\Theta }\) and initial state \(x\in \mathbb {X}\), we define the total expected discounted cost—with random state–action-dependent discount factors—as
where \(E_{x}^{\pi \pi ^{\prime }}\) denotes the expectation operator with respect to the probability measure \(P_{x}^{\pi \pi ^{\prime }}\) induced by \( (\pi ,\pi ^{\prime })\in \Pi _{\mathbb {A}}\times \Pi _{\Theta },\) given \( x_{0}=x\) [for the construction of \(P_{x}^{\pi \pi ^{\prime }}\) see, for instance, Dynkin and Yushkevich (1979)].
Thus, the minimax control problem to the controller is to find a strategy \( \pi ^{*}\in \Pi _{\mathbb {A}}\) such that
In this case, the strategy \(\pi ^{*}\) is said to be minimax, whereas \( V^{*}\) is the minimax value function.
3.2 The estimation and control problem
Since the disturbance process \(\left\{ \xi _{n}\right\} \) is a sequence of observable i.i.d. random variables, the actions or controls applied at time \( n\in \mathbb {N}_{0}\) by the controller are selected on the knowledge of the observed history \(h_{n}=(x_{0},a_{0},\xi _{1},\ldots ,x_{n-1},a_{n-1},\xi _{n},x_{n}).\) That is, we consider \(\mathbb {H}_{0}:=\mathbb {X}\) and \(\mathbb { H}_{n}:=(\mathbb {K}_{A}\times S)^{n}\times \mathbb {X}\) for \(n\in \mathbb {N}\) , as the spaces of admissible histories up to time \(n.\) Then the strategies for the controller under this context are defined similarly as the minimax criterion. For notational convenience, we will keep denoting by \(\Pi _{ \mathbb {A}}\) and \(\mathbb {F}_{\mathbb {A}}\) the sets of all strategies and stationary strategies for the controller, respectively.
Now, taking into account (2), when using a strategy \(\pi \in \Pi _{\mathbb {A}},\) given the initial state \(x_{0}=x\in \mathbb {X}\), we define the total expected discounted cost—with random state–action-dependent discount factors—as
where
Then, the optimal control problem associated with the control model (3) is to find an optimal strategy \(\pi ^{*}\in \Pi _{\mathbb {A}}\) such that \(V(x,\pi ^{*})=V(x)\) for all \(\ x\in \mathbb {X},\) where
is the optimal value function.
However, as is proved in Lemma 15 in Sect. 6, the Markov strategies are sufficient to solve the optimal control problem. That is, for each \(\pi \in \Pi _{\mathbb {A}}\) there exists \(\varphi \in \Pi _{\mathbb {A}}^{M}\) such that
Therefore, our problem is to find a strategy \(\varphi ^{*}\in \Pi _{ \mathbb {A}}^{M}\) such that
Furthermore (see Lemma 16, Sect. 6), on the class of Markov strategies \(\Pi _{\mathbb {A}}^{M}\) we can write the performance index (13) in terms of the unknown distribution \(\theta \) of the disturbance process \(\left\{ \xi _{n}\right\} \) as (see Lemma 16 , below)
where \(\Gamma _{n}\) is as in (10). Observe that in this case
The optimal control problem is studied by combining the empirical estimation process (5) of the distribution \(\theta \) with minimization procedures. However, as the performance index (13) depends strongly on the controls selected at the first stages, precisely when the information about the unknown distribution is poor, we cannot ensure, in general, the existence of optimal strategies. Hence, the optimality of strategies constructed under this context will be studied in the following asymptotic sense.
Definition 1
A strategy \(\pi \in \Pi _{A}\) is said to be asymptotically optimal for the control model \(\mathcal {M}\) if, for all \(x\in \mathbb {X},\)
where
is the total expected discounted cost—with random state–action-dependent discount factors—from stage \(n\) onward, and
The notion of asymptotic optimality was introduced by Schäl (1987) to study a problem of estimation and control in dynamic programming with constant random discount factors (see also Gordienko and Minjárez-Sosa 1998; Hilgert and Minjárez-Sosa 2001).
4 Assumptions and preliminary results
We shall require the following general boundedness and continuity assumptions on the control models. Observe that Assumption 2(a) allows an unbounded cost function \(c(x,a)\) provided that it is majorized by a “bounding” function \( W.\) Under these assumptions, we next establish some preliminary facts which will be used to prove our main results.
Assumption 2
-
(a)
The cost function \(c(x,a)\) is lower semi-continuous (l.s.c.) on \(\mathbb {K}_{A}\). Moreover, there exist a continuous function \(W: \mathbb {X}\rightarrow [1,\infty )\) and a positive constant \(c_{0}\) such that
$$\begin{aligned} \left| c(x,a)\right| \le c_{0}W(x)\ \ \ \ \ \ \forall (x,a)\in \mathbb {K}_{A}. \end{aligned}$$(20) -
(b)
The transition law \(Q\) is weakly continuous, that is, for each continuous and bounded function \(u:\mathbb {X}\rightarrow \mathbb {R}\), the function
$$\begin{aligned} (x,a)\longmapsto \int \limits _{\mathbb {X}}u\left( y\right) Q\left( \mathrm{d}y\mid x,a\right) \end{aligned}$$(21)is continuous on \(\mathbb {K}_{A}\).
-
(c)
The function
$$\begin{aligned} (x,a)\longmapsto \int \limits _{\mathbb {X}}W\left( y\right) Q\left( \mathrm{d}y\mid x,a\right) \end{aligned}$$is continuous on \(\mathbb {K}_{A}\).
-
(d)
The multifunction \(x\rightarrow A(x)\) is upper semi-continuous (u.s.c.), and the set \(A(x)\) is compact for each \(x\in \mathbb {X}\).
-
(e)
The function \(\tilde{\alpha }(x,a,s)\) is continuous on \(\mathbb { K}_{A}\times S,\) and
$$\begin{aligned} \alpha ^{*}:=\sup _{(x,a,s)\in \mathbb {K}_{A}\times S}\tilde{\alpha } (x,a,s)<1. \end{aligned}$$(22) -
(f)
There exists a positive constant \(b\) such that
$$\begin{aligned} 1\le b<(\alpha ^{*})^{-1}, \end{aligned}$$and for all \((x,a)\in \mathbb {K}_{A}\)
$$\begin{aligned} \ \int \limits _{\mathbb {X}}W(y)Q(\mathrm{d}y\mid x,a)\le bW(x). \end{aligned}$$(23)
We denote by \(\mathbb {B}_{W}\) the Banach space of all measurable functions \( u:\mathbb {X}\rightarrow \mathbb {R}\) with the \(W\)-norm
and by \(\mathbb {L}_{W}\) the subspace of l.s.c. functions in \(\mathbb {B}_{W}.\) We will repeatedly use the following inequalities. For any \(u\in \mathbb {B} _{W}\),
and
for all \((x,a)\in \mathbb {K}_{A}.\) The inequality (25) is a consequence of the definition (24), whereas (26) follows from (23) and (25).
Remark 3
-
(a)
As is well known (see Hernández-Lerma and Lasserre 1996), the Assumption 2(b) can be substituted by the following equivalent condition: For each l.s.c. and bounded below function \(u:\mathbb {X}\rightarrow \mathbb {R}\), the function in (21) is l.s.c. on \(\mathbb {K}_{A}\).
-
(b)
Under Assumption 2(e), the monotone convergence theorem yields that the function \(\alpha :\mathbb {K}\rightarrow (0,1)\) defined in (9) is continuous. Indeed, let \(\left\{ (x_{n},a_{n},\theta _{n})\right\} \in \mathbb {K}\) be a sequence converging to \((x,a,\theta )\in \mathbb {K}\) and denote \(\tilde{\alpha }_{*}(s):=\liminf _{n\rightarrow \infty }\tilde{\alpha }(x_{n},a_{n},s)\) and \( \tilde{\alpha }_{k}(s):=\inf _{j\ge k}\tilde{\alpha }(x_{j},a_{j},s)\). Observe that \(\tilde{\alpha }_{k}(s)\nearrow \tilde{\alpha }_{*}(s),\) as \( k\rightarrow \infty .\) Then, for all \(n\ge k,\) \(\tilde{\alpha } (x_{n},a_{n},\cdot )\ge \tilde{\alpha }_{k}(\cdot ).\) Hence, since \(\theta _{n}\rightarrow \theta \) weakly,
$$\begin{aligned} \liminf \limits _{n\rightarrow \infty }\int \limits _{S}\tilde{\alpha } (x_{n},a_{n},s)\theta _{n}(\mathrm{d}s)\ge \liminf \limits _{n\rightarrow \infty }\int \limits _{S}\tilde{\alpha }_{k}(s)\theta _{n}(\mathrm{d}s)=\int \limits _{S}\tilde{ \alpha }_{k}(s)\theta (\mathrm{d}s). \end{aligned}$$Now, letting \(k\rightarrow \infty ,\) by Assumption 2(e) and the monotone convergence theorem we obtain
$$\begin{aligned} \liminf \limits _{n\rightarrow \infty }\int \limits _{S}\tilde{\alpha } (x_{n},a_{n},s)\theta _{n}(\mathrm{d}s)\ge \int \limits _{S}\tilde{\alpha }_{*}(s)\theta (\mathrm{d}s)=\int \limits _{S}\tilde{\alpha }(x,a,s)\theta (\mathrm{d}s). \end{aligned}$$(27)Similarly we can prove the inequality
$$\begin{aligned} \limsup \limits _{n\rightarrow \infty }\int \limits _{S}\tilde{\alpha } (x_{n},a_{n},s)\theta _{n}(\mathrm{d}s)\le \int \limits _{S}\tilde{\alpha } (x,a,s)\theta (\mathrm{d}s), \end{aligned}$$which combined with (27) implies that \(\alpha :\mathbb {K} \rightarrow (0,1)\) is a continuous function.
-
(c)
From Assumption 2(f), for all \(\ x\in \mathbb {X}\),\(\ n\in \mathbb {N}_{0}\), and \(\ \left( \pi ,\pi ^{\prime }\right) \in \Pi _{\mathbb {A}}\times \Pi _{\Theta },\)
$$\begin{aligned} E_{x}^{\pi \pi ^{\prime }}\left[ W\left( x_{n+1}\right) \right] \le bE_{x}^{\pi \pi ^{\prime }}\left[ W\left( x_{n}\right) \right] . \end{aligned}$$Iterating this inequality we obtain
$$\begin{aligned} E_{x}^{\pi \pi ^{\prime }}\left[ W\left( x_{n+1}\right) \right] \le b^{n+1}W\left( x\right) ,\ \ \ x\in \mathbb {X}. \end{aligned}$$(28)Therefore, from Assumptions 2(a), (e), and (f), for each \( x\in \mathbb {X}\), and \(\left( \pi ,\pi ^{\prime }\right) \in \Pi _{\mathbb {A} }\times \Pi _{\Theta }\),
$$\begin{aligned} \left| V\left( x,\pi ,\pi ^{\prime }\right) \right|&\le E_{x}^{\pi \pi ^{\prime }}\left[ \sum _{n=0}^{\infty }\left( \alpha ^{*}\right) ^{n}\left| c(x_{n},a_{n})\right| \right] \\&\le c_{0}W\left( x\right) \sum _{n=0}^{\infty }\left( b\alpha ^{*}\right) ^{n}=\dfrac{c_{0}W\left( x\right) }{1-b\alpha ^{*}}. \end{aligned}$$This implies that \(V^{*}\) \(\in \mathbb {B}_{W.}\) In fact we have,
$$\begin{aligned} \left\| V^{*}\right\| _{W}\le \dfrac{c_{0}}{1-b\alpha ^{*}}. \end{aligned}$$(29)
We next introduce the following family of operators. For each function \(u\) on \(\mathbb {X}\) and \((x,a,\theta )\in \mathbb {K}\) we define:
and
We conclude this section summarizing some useful properties of the operators (30)–(32).
Lemma 4
If Assumption 2 holds, then:
-
(a)
\(\left| \hat{T}_{a}u(x)\right| \le \hat{M}W(x)\) for all \((x,a)\in \mathbb {K}_{A},u\in \mathbb {B}_{W},\) and some \(\hat{M}<\infty . \)
-
(b)
The mapping \((x,a,\theta )\rightarrow T_{(a,\theta )}u(x)\) is l.s.c. on \(\mathbb {K}\) for all \(u\in \mathbb {L}_{W}.\)
-
(c)
The operator \(T\) is a contraction on \(\mathbb {B}_{W}\) with modulus \(b\alpha ^{*}<1.\) In addition, if \(\Theta \) is a compact set, then
-
(d)
\(T\) maps \(\mathbb {L}_{W}\) into itself.
-
(e)
For each \(u\in \mathbb {L}_{W},\) there exists \(f^{*}\in \mathbb {F}_{\mathbb {A}}\) such that
$$\begin{aligned} Tu(x)=\sup _{\theta \in \Theta }T_{(f^{*},\theta )}u(x),\ \ x\in \mathbb {X }. \end{aligned}$$
Proof
-
(a)
This part follows from (20), (22), (25), and (26) by taking \(\hat{M}:=c_{0}+\alpha ^{*}b\left\| u\right\| _{W}.\)
-
(b)
Observe that if \(u\in \mathbb {L}_{W}\) then, from (24) and Assumption 2(a), the function \(v\left( x\right) :=u\left( x\right) +\left\| u\right\| _{W}W\left( x\right) \) is nonnegative and l.s.c. Thus, Assumption 2(b) (see Remark 3(a)) implies that the mapping \((x,a)\rightarrow \int _{ \mathbb {X}}v(y)Q(\mathrm{d}y\mid x,a)\) is l.s.c. on \(\mathbb {K}_{A}\), which, together with Assumption 2(c) implies that the mapping \( (x,a)\rightarrow \int _{\mathbb {X}}u(y)Q(\mathrm{d}y\mid x,a)\) is l.s.c. on \(\mathbb {K} _{A}.\) Then, Assumptions 2(a), (e) yield the lower semi-continuity of \(T_{(a,\theta )}u(x)\) on \(\mathbb {K}\).
-
(c)
Let \(u,u^{\prime }\in \mathbb {B}_{W}\). Then, from definition (32) of the operator \(T,\) we have, for each \(x\in \mathbb { X}\),
$$\begin{aligned} \left| Tu(x)-Tu^{\prime }(x)\right|&\le \sup _{a\in A(x)}\sup _{\theta \in \Theta }\alpha (x,a,\theta )\int \limits _{\mathbb {X} }\left| u\left( y\right) -u^{\prime }\left( y\right) \right| Q(\mathrm{d}y\mid x,a) \\&\le \alpha ^{*}\left\| u-u^{\prime }\right\| _{W}\int \limits _{ \mathbb {X}}W(y)Q(\mathrm{d}y\mid x,a) \\&\le \alpha ^{*}b\left\| u-u^{\prime }\right\| _{W}W(x), \end{aligned}$$where the last two inequalities follows from (25) and (26 ). Therefore,
$$\begin{aligned} \left\| Tu-Tu^{\prime }\right\| _{W}\le \alpha ^{*}b\left\| u-u^{\prime }\right\| _{W}. \end{aligned}$$ -
(d)
Let \(\left\{ \left( x_{m},a_{m}\right) \right\} \subset \mathbb {K}_{A}\) be a sequence such that \(\left( x_{m},a_{m}\right) \rightarrow \left( x,a\right) \in \mathbb {K}_{A}\), and \(\theta \in \Theta \) be arbitrary. From the compactness of the set \(\Theta \) there exists a sequence \(\left\{ \theta _{m}\right\} \) such that \(\theta _{m}\rightarrow \theta \). Then, from the part (b) of the lemma,
$$\begin{aligned} \underset{m\rightarrow \infty }{\lim \inf }\hat{T}_{a_{m}}u(x_{m})&= \underset{m\rightarrow \infty }{\lim \inf }\sup _{\theta \in \Theta }T_{(a_{m},\theta )}u(x_{m}) \\&\ge \left. \underset{m\rightarrow \infty }{\lim \inf }T_{(a_{m},\theta _{m})}u(x_{m})\right. \\&\ge \left. T_{(a,\theta )}u(x).\right. \end{aligned}$$Since \(\theta \) is arbitrary, we have
$$\begin{aligned} \underset{m\rightarrow \infty }{\lim \inf }\hat{T}_{a_{m}}u(x_{m})\ge \hat{T }_{a}u(x), \end{aligned}$$which implies that \(\hat{T}_{a}u(x)\) is l.s.c. on \(\mathbb {K}_{A}\). Therefore, from the part (a), the function \(\hat{T}_{a}u(x)+\hat{M}W\left( x\right) \) is nonnegative and l.s.c. on \(\mathbb {K}_{A}\). Thus, from Assumption 2(d) and due to a well-known result in Schäl (1975) [see also Proposition D.5 in Hernández-Lerma and Lasserre (1996) and Rieder (1978)], we have that
$$\begin{aligned} \inf _{a\in A\left( x\right) }\left\{ \hat{T}_{a}u(x)+\hat{M}W\left( x\right) \right\} =Tu\left( x\right) +\hat{M}W\left( x\right) \end{aligned}$$(33)is a l.s.c. function on \(\mathbb {X}\), which, in turn implies the lower semi-continuity of \(Tu\left( x\right) \) on \(\mathbb {X}\). Finally, because \( \left\| Tu\right\| _{W}<\infty \) [see part (a) of the Lemma and (32)] we have that \(T\) maps \(\mathbb {L}_{W}\) into itself.
-
(e)
Since \(\hat{T}_{a}u(x)+\hat{M}W\left( x\right) \) is a nonnegative and l.s.c. function, from Assumption 2(d) and by applying standard arguments on the existence of minimizers [see, for instance, Rieder (1978); Schäl (1975)] there exists \(f^{*}\in \mathbb {F}_{A}\) such that
$$\begin{aligned} \inf _{a\in A\left( x\right) }\left\{ \hat{T}_{a}u(x)+\hat{M}W\left( x\right) \right\} =\hat{T}_{f^{*}}u(x)+\hat{M}W\left( x\right) , \end{aligned}$$which, together with (33) proves the part (e).
\(\square \)
5 Minimax strategies
In this section, we present the results corresponding to the minimax criterion. First, we define the minimax value iteration function and prove the geometric convergence to the minimax value function. Then we show the existence of minimax strategies
It is easy to prove that \(\mathbb {L}_{W}\subset \mathbb {B}_{W}\) is a closed set, which implies that it is a complete subset of \(\mathbb {B}_{W}\). Then, from Lemma 4(c) and the Banach’s Fixed Point Theorem, there exists a unique function \(\tilde{u}\in \mathbb {L}_{W}\) such that for all \(x\in \mathbb {X}\),
and
Now we define the sequence of minimax value iteration function \(\left\{ v_{n}\right\} \) in \(\mathbb {L}_{W}\) as
Observe that from (35) and (36), taking \(u=v_{0}\) we get
We state our minimax result as follows.
Theorem 5
If Assumption 2 holds and \(\Theta \) is a compact set, then:
-
(a)
The minimax value function (12) is the unique solution in \( \mathbb {L}_{W}\) satisfying
$$\begin{aligned} V^{*}(x)=TV^{*}(x),\ \ x\in \mathbb {X}. \end{aligned}$$(38) -
(b)
For each \(n\in \mathbb {N},\)
$$\begin{aligned} \left\| v_{n}-V^{*}\right\| _{W}\le c_{0}\dfrac{(b\alpha ^{*})^{n}}{1-b\alpha ^{*}}. \end{aligned}$$ -
(c)
There exists \(f^{*}\in \mathbb {F}_{\mathbb {A}}\) such that
$$\begin{aligned} V^{*}(x)=\sup _{\theta \in \Theta }T_{(f^{*},\theta )}V^{*}(x)= \hat{T}_{f^{*}}V^{*}(x),\ \ x\in \mathbb {X}, \end{aligned}$$(39)and moreover, \(f^{*}\) is a minimax strategy for the controller, that is,
$$\begin{aligned} V^{*}(x)=\sup _{\pi ^{\prime }\in \Pi _{\Theta }}V(x,f^{*},\pi ^{\prime }). \end{aligned}$$
Proof
From (34) and (37), parts (a) and (b) will be proved if we show that \(\tilde{u}=V^{*}.\) To this end, let \(f\in \mathbb {F}_{\mathbb {A}}\) be a selector such that
which exists because of Lemma 4(e). Then, from (30)
Now, for an arbitrary strategy \(\pi ^{\prime }\in \Pi _{\Theta }\) for the opponent, iteration of the inequality (40) yields
Combining (22), (25), and (28), we have
Hence, letting \(m\rightarrow \infty \) in (41), from (11) we get
As \(\pi ^{\prime }\in \Pi _{\Theta }\) is arbitrary, (42) and (12) yield
On the other hand, since \(\alpha \) is a continuous function in \(\theta \) (see Remark 3(b)), from (34) and the compactness of \(\Theta \), for each \((x,a)\in \) \(\mathbb {K}_{\mathbb {A}},\) there exists \(g:\) \(\mathbb {K}_{\mathbb {A}}\rightarrow \Theta \), such that \( g\left( x,a\right) \in \Theta \) satisfies
Similarly as in (41) and (42), for a strategy \(\pi \in \Pi _{ \mathbb {A}},\) iteration of (44) yields
Thus, since
and, in addition, \(\pi \in \Pi _{\mathbb {A}}\) is arbitrary, (45) implies
Therefore, combining (43) and (46) we get \(\tilde{u} =V^{*}\) which proves the parts (a) and (b) of the theorem.
Finally, the existence of \(f^{*}\in \mathbb {F}_{\mathbb {A}}\) satisfying (39) follows from (38) and Lemma 4(e). Moreover, similarly as in (42), we have that for an arbitrary strategy \(\pi ^{\prime }\in \Pi _{\Theta }\),
which implies that
\(\square \)
6 Asymptotically optimal strategies
We consider the control model (3). In this case we are supposing that \(\left\{ \xi _{n}\right\} \) is a sequence of observable i.i.d. random variables with unknown distribution \(\theta \in \mathbb {P}(S),\) and our objective is to study the optimal control problem (16) which, taking \( \Theta =\left\{ \theta \right\} ,\) can be seen as a particular case of the minimax control problem (12).
Considering this fact and (17) (see (11)), the operator \(T\) defined in (32) takes the form
for each function \(u\) on \(\mathbb {X}\). Hence, we have the following consequences of Theorem 5.
Proposition 6
Suppose that Assumption 2 holds. Then:
-
(a)
The value function (14) \((\)see (16)\()\) satisfies
$$\begin{aligned} V(x)=TV(x),\ \ x\in \mathbb {X}, \end{aligned}$$(47)and moreover (see (29)),
$$\begin{aligned} \left\| V\right\| _{W}\le \dfrac{c_{0}}{1-b\alpha ^{*}}. \end{aligned}$$(48) -
(b)
There exists \(f^{*}\in \mathbb {F}_{\mathbb {A}}\) such that
$$\begin{aligned} V(x)=T_{(f^{*},\theta )}V(x)=c\left( tx,f^{*}\right) +\alpha \left( x,f^{*},\theta \right) \int \limits _{\mathbb {X}}V(y)Q(\mathrm{d}y\mid x,f^{*}),\ \ x\in \mathbb {X}. \end{aligned}$$
Since \(\theta \) is unknown, the solution given for the Proposition 6 is not accessible for the controller. Under these circumstances, using the empirical distribution \(\hat{\theta }_{n}\) to estimate \(\theta \) (see (5))\(,\) our objective is to show the existence of asymptotically optimal strategies.
6.1 Empirical estimation
It is well known that \(\hat{\theta }_{n}\) converges weakly to \(\theta \) a.s. Hence, for each \((x,a)\in \mathbb {K}_{A}\)
That is, as \(n\rightarrow \infty ,\)
However, this convergence is not sufficient for our objective. Specifically we require uniform convergence on the set \(\mathbb {K}_{A}.\) Then, to state the suitable estimation process we need to impose the following assumption.
Assumption 7
The family of functions
is equicontinuous on \(S.\)
Then, as a consequence of Theorem 6.4 in Ranga Rao (1962) we have the following result.
Lemma 8
Under Assumption 7, as \(n\rightarrow \infty ,\)
Remark 9
An obvious sufficient condition for Assumption 7 is that the disturbance set \(S\) is countable with the discrete topology.
The uniform convergence (49) is used to obtain a (non-stationary) value iteration algorithm to approximate the value function (16), which will be a key point to construct asymptotically optimal strategies in the next subsection.
Let \(\left\{ V_{n}\right\} \) be a sequence of functions defined as
A straightforward calculation shows that (see (48))
That is, \(V_{n}\in \mathbb {B}_{W}\) a.s., for all \(n\in \mathbb {N}.\)
Proposition 10
Proof
From (47) and (50), by adding and subtracting the term \(\alpha (x,a,\hat{\theta }_{n})\int _{\mathbb {X} }V(y)Q(\mathrm{d}y\mid x,a),\) we have, for each \(x\in \mathbb {X}\) and \(n\in \mathbb {N} ,\)
where the last inequality comes from (22)–(26). Therefore, from (48),
Now, let \(l:=\limsup \nolimits _{n\rightarrow \infty }\left\| V-V_{n}\right\| _{W}<\infty \) (see (48) and (51)). Taking \(\lim \sup \) on both sides of (52), from Lemma 8 we get \(l<\alpha ^{*}bl.\) Then, observing that \( 0<\alpha ^{*}b<1\) (see Assumption 2(f)) we obtain \(l=0\) which proves the proposition. \(\square \)
6.2 Asymptotically optimal strategies
Consider the value iteration function \(V_{n}\) defined in (50). Observe that under Assumption 2, applying standard arguments on the existence of minimizers (see Proposition 6), for each \(n\in \mathbb {N},\) there exists \(f_{n}\in \mathbb {F}_{A}\) such that
Now, let \(\hat{\pi }=\left\{ \hat{\pi }_{n}\right\} \in \Pi _{\mathbb {A}}^{M}\) be the strategy determined by the sequence \(\left\{ f_{n}\right\} ,\) that is, \(\hat{\pi }_{n}(\cdot |h_{n})\) is concentrated on \(f_{n}(x_{n})\) for all \( h_{n}\in \mathbb {H}_{n}\) and \(n\in \mathbb {N},\) with \(\hat{\pi }_{0}\) any fixed action. Then, our objective is to show that \(\hat{\pi }\) is an asymptotically optimal strategy.
Before to state the result, we need to impose the following technical requirement.
Assumption 11
There exist positive constants \(d_{0}<\infty ,\) \(\beta _{0}<1,\) and \(p>1\) such that for all \((x,a)\in \mathbb {K}_{A},\)
Remark 12
-
(a)
Applying Jensen’s inequality to (54) we have, for all \((x,a)\in \mathbb {K}_{A},\)
$$\begin{aligned} \int \limits _{\mathbb {X}}W(y)Q(\mathrm{d}y|x,a)\le \beta ^{\prime }W(x)+d^{\prime }, \end{aligned}$$(55)where \(\beta ^{\prime }=\beta _{0}^{1/p}\) and \(d^{\prime }=d_{0}^{1/p}.\) Moreover, as a consequence of both inequalities (54) and (55) we have
$$\begin{aligned} \sup _{n\in \mathbb {N}_{0}}E_{x}^{\pi }\left[ W^{p}\left( x_{n}\right) \right] <\infty \end{aligned}$$(56)and
$$\begin{aligned} \sup _{n\in \mathbb {N}_{0}}E_{x}^{\pi }\left[ W\left( x_{n}\right) \right] <\infty . \end{aligned}$$(57)Indeed, first note that from (54)
$$\begin{aligned} E_{x}^{\pi }\left[ W^{p}(x_{n})]\le \beta _{0}E_{x}^{\pi }[W^{p}(x_{n-1})\right] \,\,+d_{0}\,,\ \ n\in \mathbb {N}. \end{aligned}$$Then, iterating this inequality and using the fact \(\beta _{0}<1\) we obtain
$$\begin{aligned} E_{x}^{\pi }\left[ W^{p}(x_{n})\right] \le \beta _{0}^{n}W^{p}(x)\,\,+\left( 1+\beta _{0}+\cdots +\beta _{0}^{n-1}\right) d_{0}\le W^{p}(x)+d_{0}/(1-\beta _{0}), \end{aligned}$$which, in turns implies (56). Similarly, (57) follows from (55).
-
(b)
In addition, since \(W(\cdot )\ge 1,\) observe that if \(\left( \beta _{0}^{1/p}+d_{0}^{1/p}\right) \alpha ^{*}<1,\) the relation (55) implies Assumption 2(f).
Theorem 13
Under Assumptions 2, 7, and 11 the strategy \(\hat{\pi }\) is asymptotically optimal.
The proof of Theorem 13 is based on the following characterization of asymptotic optimality which is an adaptation of the Theorem 4.6.2 in Hernández-Lerma and Lasserre (1996).
Lemma 14
Under Assumption 2, a policy \(\pi \in \Pi _{A}\) is asymptotically optimal for the control model \(\mathcal {M}\) if for all \(x\in \mathbb {X}\),
where
Proof
First observe the following facts. For each \(x\in \mathbb {X},\) \(\pi \in \Pi _{A},\) and \(n\in \mathbb {N},\) from Assumptions 2(e), (f), and relations (28) and (48),
Hence, because \(\alpha ^{*}b\in (0,1),\)
In addition, since \(\Phi \) is nonnegative (see (47)), for each \(x\in \mathbb {X}\) and \(\pi \in \Pi _{A},\)
Now, for each \(x\in \mathbb {X},\ \pi \in \Pi _{A},\) and \(t\in \mathbb {N},\)
where \(h_{t}\) is the history of the system up to time \(t.\) Then, from (18), (19), (60), and applying properties of conditional expectation, for each \(n\ge t,\) \(x\in \mathbb {X},\) and \(\pi \in \Pi _{A},\)
where the last equality follows from (58). Finally, (61), Definition 1 and (59) yield the desired result. \(\square \)
Proof of Theorem 13
According to Lemma 14, to prove the theorem it is sufficient to show
To this end, for each \(n\in \mathbb {N}\), we define the function \(\Phi _{n}:\) \(\mathbb {K}_{A}\rightarrow \mathbb {R}\) as
Thus, from (53)
Now, if \(\left\{ (x_{n},a_{n})\right\} \) is the sequence of state–action pairs corresponding to the application of the strategy \(\hat{\pi },\) observe that from (62)
where
Hence, the remainder of the proof consists in proving
By adding and subtracting the term \(\alpha \left( x,a,\hat{\theta }_{n}\right) \int _{ \mathbb {X}}V\left( y\right) Q(\mathrm{d}y|x,a),\) using (25), (26), and (48), it is easy to see that, for each \(n\in \mathbb {N},\)
Thus, from Lemma 8 and Proposition 10
Note that from the property (69) below we also have that
Furthermore, observe that \(\sup _{n\in \mathbb {N}}\mathcal {L}_{n}\le L_{1}<\infty \) for some constant \(L_{1},\) and from (64) we have the convergence in probability
Also, from (56)
This implies (see, for instance, Lemma 7.6.9 in Ash 1972)) that \(\{W(x_{n}) \mathcal {L}_{n}\}\) is \(P_{x}^{\hat{\pi }}\)-uniformly integrable.
On the other hand, for arbitrary positive numbers \(m_{1}\) and \(m_{2},\) we have,
which implies, from Chebyshev’s inequality, that
This relation, together with (65), yields the convergence in probability
Therefore, since \(\{W(x_{n})\mathcal {L}_{n}\}\) is \(P_{x}^{\hat{\pi }}\) -uniformly integrable, we obtain (63) which implies the asymptotic optimality of the strategy \(\hat{\pi }.\) \(\square \)
6.3 Sufficiency of Markov strategies
We conclude proving the relations (15) and (17). To this end, we will use the following well-known properties of the probability measure \(P_{x}^{\pi }\) (see, e.g., Hernández-Lerma and Lasserre 1996, 1999). For each \(\pi =\left\{ \pi _{n}\right\} \in \Pi _{\mathbb {A}}\) and \( x\in \mathbb {X},\)
where \(\delta _{x}\) is the Dirac measure concentrated at \(x\) and \( h_{n}=(x_{0},a_{0},\xi _{1},\ldots ,x_{n-1},a_{n-1},\xi _{n},x_{n})\in \mathbb {H} _{n}:=(\mathbb {K}_{A}\times S)^{n}\times \mathbb {X}\) for \(n\in \mathbb {N}.\)
Lemma 15
For each \(\pi \in \Pi _{\mathbb {A}}\) there exists \( \varphi \in \Pi _{\mathbb {A}}^{M}\) such that
Proof
For each \(\pi \in \Pi _{\mathbb {A}}\), \(x\in \mathbb {X},\) and \(n\in \mathbb {N}_{0},\) we define the finite measures \( M_{x,n}^{\pi }\) on \(\mathbb {X}\times \mathbb {A}\) and \(m_{x,n}^{\pi }\) on \( \mathbb {X}\) as
and
Observe that \(m_{x,n}^{\pi }\) is the marginal of \(M_{x,n}^{\pi }\) on \( \mathbb {X}\). Then, by Corollary 7.27.2 in Bertsekas and Shreve (1978), there exists a stochastic kernel \(\varphi _{n}\) on \(\mathbb {A}\) given \(\mathbb {X}\) such that, for \(X\in \mathcal {B}(\mathbb {X})\) and \(A\in \mathcal {B}(\mathbb {A }),\)
Since \(M_{x,n}^{\pi }\) is concentrated on \(\mathbb {K}_{A}\), we can select versions of \(\varphi _{n}\), \(n\in \mathbb {N}_{0},\) such that \(\varphi _{n}(A(y)|y)=1,\) for \(y\in \mathbb {X}.\) Thus, \(\varphi :=\left\{ \varphi _{n}\right\} \in \Pi _{\mathbb {A}}^{M}.\) Therefore, to prove the lemma, it is enough to prove the equality
Indeed, first note that from (70) and applying standard arguments on integration theory as linearity and the monotone convergence theorem, we can obtain
for any measurable function \(g:\) \(\mathbb {X}\times \mathbb {A}\rightarrow \mathfrak {R}.\) Then, from (73) and (74) with \(g=c,\) we get
which, from (13) proves the lemma.
We then proceed to prove (73) by induction. First observe that from ( 66)
Now we assume that (73) holds for some \(n\in \mathbb {N}_{0}.\) Then, using properties of conditional expectation, from (68) and (69),
Then, taking \(g(y,a)=Q\left( X|y,a\right) \int \nolimits _{S}\tilde{\alpha } (y,a,s)\theta (\mathrm{d}s),\) from (74) we have, for each \(\pi \in \Pi _{ \mathbb {A}},\)
In particular, letting \(\pi =\varphi \), we obtain
which, together with (75), yields
Now we use this fact to prove (73). From (72) and (76 ),
On the other hand, observe that, similar to (74), we have that for each \(\pi \in \Pi _{\mathbb {A}}\) and \(n\in \mathbb {N}_{0},\)
for any measurable function \(h:\) \(\mathbb {X}\rightarrow \mathfrak {R}.\) Then, from ( 70) and (67),
Taking \(h(y)=1_{\left\{ y\in X\right\} }\int \nolimits _{A}\varphi _{n+1}(\mathrm{d}a|y),\) from (78) we get
Thus, (77) and (79) yield (73), which proves the lemma. \(\square \)
Lemma 16
For each \(x\in \mathbb {X}\) and \(\varphi \in \Pi _{\mathbb {A }}^{M}\), the relation (17) holds, that is
where \(\Gamma _{n}=\prod \nolimits _{k=0}^{n-1}\alpha (x_{k},a_{k},\theta ),\ \) if\(\ \ n\in \mathbb {N}\) and \(\Gamma _{0}=1.\)
Proof
The relation (17) is consequence of the properties of the probability measure \(P_{x}^{\varphi }\) (66)–(69) and Lemma 15. Indeed, for each \(x\in \mathbb {X}\) and \( \varphi \in \Pi _{\mathbb {A}}^{M}\),
Applying similar arguments, it is easy to see that, for \(n=2,3,\ldots ,\)
which implies (17). \(\square \)
7 Example
Besides the problems with constant discount factor, our theory, additionally, covers some particular cases with non-constant discount factor, for instance, state-dependent discount factor systems, and problems with random discount factor but independent of the state–action process, see Wei and Guo (2011) and González-Hernández et al. (2008, 2013, 2014). In these works are introduced application examples corresponding to each particular case. Now we present further examples which satisfy our hypotheses.
7.1 A cash-balance model
We consider a simple discrete-time cash-balance model introduced in Hordjik and Yushkevich (1999) (see also Wei and Guo 2011). The problem consists in to control the level of a firm’s cash balance to meet its demand for cash at minimum total discounted cost.
We define the following variables:
-
\(x_{n}\) is the cash balance at time \(n;\)
-
\(a_{n}\) is the withdrawal of size \(-a_{n}\) (if \(a_{n}<0\)) of money in cash, or a supply in amount \(a_{n}\) (if \(a_{a}>0\)), at time \(n;\)
-
\(w_{n}\) is the demand for cash during the stage \(n.\) A positive demand means cash outflow and a negative demand means a cash inflow.
Then, the cash-balance process \(\left\{ x_{n}\right\} \) evolves on the state space \(\mathbb {X}=\mathbb {R}\) according to the recursive equation
We assume that \(\left\{ w_{n}\right\} \) is a sequence of i.i.d. random variables with standard normal distribution
and furthermore the set of admissible actions when the cash balance is \(x\in \mathbb {R}\) is \(A(x)=\left[ -\left| x\right| ,\left| x\right| \right] .\) The one-stage cost function is an arbitrary l.s.c. on
such that
for some positive constant \(c_{0}.\)
Under these conditions, clearly Assumptions 2(a) and (d) are satisfied with
To verify Assumptions 2(b) and (c), let us, first, note that the transition law (6) takes the form
Hence, from the continuity of the density \(\rho \) as well as of the function \((x,a,w)\rightarrow x+a+w\), it is easy to prove that the functions
and
are continuous on \(\mathbb {K}_{A}\), for each continuous and bounded function \(u:\mathbb {R}\rightarrow \mathbb {R}\) [see, e.g., Examples C.6 and C.8, Appendix C in Hernández-Lerma and Lasserre (1996)], which yield Assumptions 2(b) and (c).
We will now proceed to verify Assumptions 2(e) and (f). Let us assume that the random disturbance process \(\left\{ \xi _{n}\right\} \) is a sequence of independent random variables taking values on \(S:=[0,1]\) with unknown distribution \(\theta \in \Theta =\mathbb {P}(S).\) In addition, the discount factor function is defined as
for a constant \(\gamma >4.\) Hence, clearly
which implies that Assumption 2(e) holds.
Furthermore, from (81) and (82), observe that
Now, because \(A(x)=\left[ -\left| x\right| ,\left| x\right| \right] ,\) it is easy to see that
Thus, from (84), Assumption 2(f) is satisfied with \(b=4.\)
Therefore, from Theorem 5, there exists a minimax strategy.
7.2 An autoregressive control model
We consider a controlled process of the form
\(x_{0}\) given, with state space \(\mathbb {X}=[0,\infty ),\) and compact action set \(A(x)=\mathbb {A}\subset \mathbb {R},\) \(x\in \mathbb {X},\) and \(G:\mathbb {A} \rightarrow (0,\lambda ]\) is a given measurable function with \(\lambda <1.\) The random disturbance process \(\left\{ w_{n}\right\} \) is formed by i.i.d. and nonnegative random variables with a continuous density \(\rho \) and finite expectation, say \(E\left[ w_{0}\right] =\bar{w}<\infty .\) Hence, the transition law is
In addition, the one-stage cost \(c\) is an arbitrary l.s.c. function on \( \mathbb {K}_{A}=\left\{ (x,a):x\ge 0,\right. \left. x\in \mathbb {A}\right\} \) such that
for some constants \(\bar{b}>1\) and \(p>1.\)
Similarly as previous example, defining \(W(x):=(x+\bar{b})^{1/p}\) we have that Assumptions 2(a)–(d) hold. Moreover, for all \( (x,a)\in \mathbb {K}_{A},\)
which implies Assumption 11 with \(\beta _{0}:=\lambda \) and \( d_{0}:=\bar{b}+\bar{w}.\) From Remark 12(a), we also have
We again assume that the discount factor function is as (83), but where the constant \(\gamma \) satisfies
In addition, \(\xi _{n},\) \(n\ge 0,\) are i.i.d. random variables taking values on \(S=[0,1]\) with unknown distribution \(\theta \in \mathbb {P}(S).\) Then, under these conditions, we have
and, because \(W(\cdot )\ge 1,\) from (85) we get
Therefore, defining \(b:=(\lambda ^{1/p}+\left( \bar{b}+\bar{w}\right) ^{1/p}),\) (86) and (87) yield Assumptions 2(e) and (f).
Finally, observe that the derivative of \(\tilde{\alpha }\) with respect to \(s\) satisfies
This fact implies that the family of functions
is equi-Lipschitz, and therefore equicontinuous on \(S.\) Then Assumption 7 is satisfied.
Therefore, from Theorem 13, there exists an asymptotically optimal strategy.
References
Altman E (1999) Constrained Markov decision processes. Chapman and Hall, London
Ash RB (1972) Real analysis and probability. Academic Press, New York
Bertsekas DP, Shreve SE (1978) Stochastic optimal control: the discrete-time case. Academic Press, New York
Bertsekas DP (1987) Dynamic programming: deterministic and stochastic models. Prentice-Hall, Englewood Cliffs
Borkar VS (1998) A convex analytic approach to Markov decision processes. Probab Theory Relat Fields 78:583–602
Brigo D, Mercurio F (2007) Interest rate models: theory and practice. Springer, New York
Carmon Y, Shwartz A (2009) Markov decision processes with exponentially representable discounting. Oper Res Lett 37:51–55
Dynkin EB, Yushkevich AA (1979) Controlled Markov processes. Springer, New York
Feinberg EA, Shwartz A (1994) Markov decision models with weighted discounted criteria. Math Oper Res 19:152–168
Feinberg EA, Shwartz A (1995) Constrained Markov decision models with weighted discounted rewards. Math Oper Res 20:302–320
Feinberg EA, Shwartz A (1999) Constrained dynamic programming with two discount factors: applications and an algorithm. IEEE Trans Autom Control 44:628–631
González-Hernández J, López-Martínez RR, Pérez-Hernández R (2007) Markov control processes with randomized discounted cost in Borel space. Math Methods Oper Res 65:27–44
González-Hernández J, López-Martínez RR, Minjárez-Sosa JA (2008) Adaptive policies for stochastic systems under a randomized discounted criterion. Bol Soc Mat Mex 14:149–163
González-Hernández J, López-Martínez RR, Minjárez-Sosa JA (2009) Approximation, estimation and control of stochastic systems under a randomized discounted cost criterion. Kybernetika 45:737–754
González-Hernández J, López-Martínez RR, Minjárez-Sosa JA, Gabriel-Arguelles JR (2013) Constrained Markov control processes with randomized discounted cost criteria: occupation measures and extremal points. Risk Decis Anal 4:163–176
González-Hernández J, López-Martínez RR, Minjárez-Sosa JA, Gabriel-Arguelles JR (2014) Constrained Markov control processes with randomized discounted rate: infinite linear programming approach. Optim Control Appl Methods 35:575–591
González-Trejo TJ, Hernández-Lerma O, Hoyos-Reyes LF (2003) Minimax control of discrete-time stochastic systems. SIAM J Control Optim 41:1626–1659
Gordienko EI, Minjárez-Sosa JA (1998) Adaptive control for discrete-time Markov processes with unbounded costs: discounted criterion. Kybernetika 34:217–234
Heath D, Jarrow R, Morton A (1992) Bond pricing and the term structure of interest rates: a new methodology. Econometrica 60:77–105
Hernández-Lerma O (1989) Adaptive Markov control processes. Springer, New York
Hernández-Lerma O, Lasserre JB (1996) Discrete-time Markov control processes: basic optimality criteria. Springer, New York
Hernández-Lerma O, Lasserre JB (1999) Further topics on discrete-time Markov control processes. Springer, New York
Hernández-Lerma O, González-Hernández J (2000) Constrained Markov control processes in Borel spaces: the discounted case. Math Methods Oper Res 52:271–285
Hilgert N, Minjárez-Sosa JA (2001) Adaptive policies for time-varying stochastic systems under discounted criterion. Math Methods Oper Res 54:491–505
Hilgert N, Minjárez-Sosa JA (2006) Adaptive control of stochastic systems with unknown disturbance distribution: discounted criteria. Math Methods Oper Res 63:443–460
Hinderer K (1979) Foundations of non-stationary dynamic programming with discrete time parameter. In: Lecture Notes Oper. Res., vol 33. Springer, New York
Hordjik A, Yushkevich AA (1999) Blackwell optimality in the class of all policies in Markov decision chains with Borel state space an unbounded rewards. Math Methods Oper Res 50:421–448
Iyengar GN (2005) Robust dynamic programming. Math Oper Res 30:257–280
Jaskiewicz A, Nowak AS (2011) Stochastic games with unbounded payoffs: applications to robust control in economics. Dyn Games Appl 1:253–279
López-Martínez RR, Hernández-Lerma O (2003) The Lagrange approach to constrained Markov processes: a survey and extension of results. Morfismos 7:1–26
Mandl P (1974) Estimation and control in Markov chains. Adv Appl Probab 6:40–60
Piunovskiy AB (1997) Optimal control of random sequences in problems with constraints. Kluwer, Dordrecht
Puterman ML (1994) Markov decision processes. In: Discrete stochastic dynamic programming. Wiley, New York
Ranga Rao R (1962) Relations between weak and uniform convergence of measures with applications. Ann Math Stat 33:659–680
Rieder U (1978) Measurable selection theorems for optimization problems. Manuscripta Math 24:115–131
Schäl M (1975) Conditions for optimality and for the limit on \(n\)-stage optimal policies to be optimal. Z Wahrs Verw Gebiete 32:179–196
Vasicek O (1977) An equilibrium characterisation of the term structure. J Financ Econ 5:177–180
Schäl M (1987) Estimation and control in discounted stochastic dynamic programming. Stochastics 20:51–71
Wei Q, Guo X (2011) Markov decision processes with state-dependent discount factors and unbounded rewards/costs. Oper Res Lett 39:274–368
Acknowledgments
Work supported by Consejo Nacional de Ciencia y Tecnología (CONACYT, MEXICO) under Grant CB2010/154612.
Author information
Authors and Affiliations
Corresponding author
Additional information
Work supported by Consejo Nacional de Ciencia y Tecnología (CONACYT) under Grant CB2010/154612.
Rights and permissions
About this article
Cite this article
Minjárez-Sosa, J.A. Markov control models with unknown random state–action-dependent discount factors. TOP 23, 743–772 (2015). https://doi.org/10.1007/s11750-015-0360-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11750-015-0360-5
Keywords
- Discounted optimality
- Non-constant discount factors
- Estimation and control procedures
- Minimax control systems