1 Introduction

This paper deals with a class of discrete-time controlled stochastic systems composed of a large number of N interacting objects which share a common environment. Denoting by \(X_{n}^{N}(t)\) the state of the object n at time t, its evolution is determined by a difference equation, homogeneous in N, of the form

$$\begin{aligned} X_{n}^{N}(t+1)=F\left( X_{n}^{N}(t),C^{N}(t),a_{t},\xi _{t}\right) ,\ \ t=0,1,..., \end{aligned}$$
(1.1)

where F is a known function, \(C^{N}(t)\) is the context of the environment, \(a_{t}\) is the control or action selected by a central controller, and \(\xi _{t}\) is the random disturbance. It is assumed that \(\left\{ \xi _{t}\right\} \) is an observable sequence of independent and identically distributed random variables with a density \(\rho \) which is unknown by the controller. In addition, at each stage, a cost resulting from the movement of the objects and the selected control is generated. In this sense, we propose a suitable Markov control model to study this class of systems, in which the controller aims to select actions to minimize a given discounted cost criterion.

The facts that N is too large (\(N\sim \infty \)) as well as the lack of knowledge of density \(\rho \), lead to formulate an alternative scheme to analyze the corresponding optimal control problem. Indeed, our approach to follow will be framed in the context of the mean field theory, under which, instead of analyzing a single object, we focus on the number or proportion of objects occupying certain state at each stage. This defines a control model \(\mathcal {M}_{N}\) whose states are precisely the proportions of objects evolving according to a suitable stochastic difference equation, depending on N. Then, by taking limit as N goes to infinity, we obtain the so-called mean field control model \(\mathcal {M}\), whose states are probability measures, resulting of the limit of the aforementioned proportions, which in turn satisfy a deterministic difference equation. In this way \(\mathcal {M}\) can be considered as an approximating model for \(\mathcal {M} _{N},\) in the sense that any optimal control policy \(\pi ^{*}\) associated to \(\mathcal {M}\) can be used to control the original process (the \(N-\)system) on \(\mathcal {M}_{N}\), and therefore the objective is to measure its optimality deviation. Clearly, the good performance of \(\pi ^{*}\) on \(\mathcal {M}_{N}\) depends of the accuracy of the mean field limit of \(\mathcal {M}_{N}\) to \(\mathcal {M}\) as \(N\rightarrow \infty .\)

Because the dynamic of the objects depends on the unknown density \(\rho \), we also have the dependence on \(\rho \) of the mean field process. Thus, besides the analysis of the limit behavior as \(N\rightarrow \infty \), the controller must implement a statistical estimation procedure for \(\rho \) in order to get some information about the dynamic of the objects. To this end, at each stage t\(\rho \) is estimated from historical observations \(\xi _{0},\xi _{1} ,...\xi _{t},\) collected during the evolution of the system. Such estimation procedure is then combined with the minimization task for obtaining control policies. However, as is well-known, the discounted criterion depends strongly on the decisions selected in the first stages, precisely where the information on \(\rho \) is rather poor or deficient. This fact implies that, in general, under discounted criteria, procedures of estimation and control do not provide optimal policies (see, e.g., [10, 11, 13, 19]). Thus, in this paper we seek optimality results in a weaker sense: the so-named eventually asymptotically optimality.

In the last years, mean field theory has become a useful tool to study systems composed of a large number of objects (particles or agents) under several scenarios: discrete- and continuous-time systems of interacting objects, mean field control problems, and mean field games; all of them according to different optimality criteria, and with applications, for instance, in statistical physics, finance, operations research, among others—see, e.g., [13, 59, 1418, 20] and the references therein.

In particular, the motivation of our results comes from the work [7], in which the authors consider the dynamic of each object to be represented by means of a known stochastic kernel K that depends on both, the environment and the actions selected by the controller. The main purpose of that work is to study, among other things, the speed of convergence of the N-system as \(N\rightarrow \infty \) as well as to obtain bounds for the gap between the cost of the N-system and the corresponding one associated to the mean field model. In contrast, in this paper we assume that the N-system is modeled by the stochastic difference equation established in (1.1 ) where the density of the random disturbances becomes unknown for the controller. This constitutes the main feature of our model and the novelty of our paper. That is, our approach consists in to analyze estimation and control schemes on the mean field model \(\mathcal {M}\), and then study the optimality, on the model \(\mathcal {M}_{N}\), of the resulting policies. Hence, through a joint analysis of the mean field limit (\(N\rightarrow \infty \)) as well as of the estimation process (\(t\rightarrow \infty \)), we construct control policies that are nearly optimal for the control model \(\mathcal {M}_{N}\), in a asymptotic sense as N goes to infinity. This class of policies are called eventually asymptotically optimal policies.

The paper is organized as follows. In Sect. 2 we present the system of N objects together with its corresponding Markov control model, whereas Sect. 3 is devoted to study the mean field control model we are concerned with. In both sections we provide optimality results ensuring the existence of minimizers based on the dynamic programming method. In Sect. 4 we introduce the estimation and control procedure in the mean field model to construct control policies. Finally, we conclude, in Sect. 5, with the analysis of the mean field convergence, providing, among other facts, the so-called eventually asymptotically optimal policies. Throughout the paper, we shall be developing a class of consumption-investment problem to illustrate our assumptions and results .

Notation As usual, \(\mathbb {N}\) (respectively \(\mathbb {N}_{0})\) denotes the set of positive (resp. nonnegative) integers; \(\mathbb {R}\) (resp. \(\mathbb {R}_{+})\) denotes the set of real (resp. nonnegative real) numbers.

On the other hand, given a Borel space Z (that is, a Borel subset of a complete and separable metric space) its Borel \(\sigma -\)algebra is denoted by \(\mathcal {B}(Z),\) and the attribute “ measurable”will be applied for either Borel measurable sets or Borel measurable functions.

Let \(\mathbb {M}(Z)\) be the set of finite signed measures on Z. If \(Z\subset \mathbb {R}\) is finite, e.g. \(Z=\left\{ 1,2,...,z\right\} \), we will identify any \(p\in \mathbb {M}(Z)\) by the vector \(p:=(p(1),p(2),...,p(z))\). In particular, consider \(\mathbb {P}(Z)\subset \mathbb {M}(Z)\) the set of probability measures on Z. In this case, any \(p\in \mathbb {P}(Z)\) can be expressed in terms of its probability distribution \(\{p(i):i\in Z\}\) where \(p(i)\ge 0,\) \(i\in Z,\) and \(\sum _{i=1}^{z}p(i)=1\). Observe that, under the topology of \(\mathbb {R}\), \(Z=\{1,2,\cdots ,z\}\) becomes a Borel set, and so is \(\mathbb {P}(Z)\). As usual, \(|\cdot |\) will denote the norm on \(\mathbb {R}\).

We shall define the norm on \(\mathbb {M}(Z)\times \mathbb {R}^{d}\), for Z finite under the corresponding \(L_{\infty }\) norm; that is, for each vector \((p,c)\in \mathbb {M}(Z)\times \mathbb {R}^{d}\):

$$\begin{aligned} \left\| (p,c)\right\| _{\infty }:=\max \left\{ \left\| p\right\| _{\infty }^{1},\left\| c\right\| _{\infty }^{2}\right\} , \end{aligned}$$

where \(\left\| p\right\| _{\infty }^{1}:=\max \left\{ |p(1)|,...,|p(z)|\right\} \), and \(\left\| c\right\| _{\infty }^{2} :=\max \left\{ \left| c_{1}\right| ,...,\left| c_{d}\right| \right\} ,\) with \(c:=(c_{1},\cdots ,c_{d})\). Furthermore, for a given Borel space A, \(d_{A}\) will represent its associated metric. For all \((p,c,a),(p^{\prime },c^{\prime },a^{\prime })\in \mathbb {P}(Z)\times \mathbb {R}^{d}\times A\) the corresponding \(L_{\infty }-\)distance takes the form

$$\begin{aligned} \left\| (p,c,a)-(p^{\prime },c^{\prime },a^{\prime })\right\| _{\infty } ^{3}:=\max \left\{ \left\| p-p^{\prime }\right\| _{\infty }^{1},\left\| c-c^{\prime }\right\| _{\infty }^{2},d_{A}(a,a^{\prime })\right\} , \end{aligned}$$

whereas for a matrix \(A_{n\times n}\), we will denote its corresponding norm \(\Vert \cdot \Vert _{\infty }^{0}\) as

$$\begin{aligned} \Vert A\Vert _{\infty }^{0}:=\max _{i,j}|A_{ij}|. \end{aligned}$$

Let Z and A be Borel spaces. A stochastic kernel \(Q\left( \cdot |\cdot \right) \) is a function \(Q:\mathcal {B}(Z)\times A\rightarrow [0,1]\), such that \(Q\left( \cdot |a\right) \) is a probability measure on \(\mathcal {B}(Z)\) for each fixed \(a\in A,\) and \(Q\left( B|\cdot \right) \) is a measurable function on A for each fixed \(B\in \mathcal {B}(Z).\) Finally, \(\mathbb {B}(Z)\) denotes the class of real-valued bounded functions on Z endowed with the supremum norm \(\left\| v\right\| :=\sup _{z\in Z}\left| v(z)\right| ,\) while \(\mathbb {C}_{b}(Z)\) is the subspace of \(\mathbb {B}(Z)\), consisting of all real-valued bounded continuous functions defined on Z.

We assume the existence of a fixed probability space \((\Omega ,\mathcal {F},P)\), and for the attribute a.s. we mean almost sure with respect to P.

2 The N-Objects Markov Control Model

We consider a discrete-time controlled stochastic system composed by a large number N of interacting objects defined as follows. Let \(X_{n}^{N}(t)\), \(n=1,2,\ldots ,N\), \(t\in \mathbb {N}_{0}\) be the state of the object n at time t taking values in a given set \(S=\left\{ 1,2,\ldots ,s\right\} \subseteq \mathbb {N}\). There is a controller (or decision-maker) who, at each stage, can influence the behavior of the objects by means of actions or controls \(a_{t}\) selected from a given Borel set A. Moreover, the objects are assumed to share a common environment which also influences the behavior of the system. Let \(C^{N}(t)\in \mathbb {R}^{d}\) be the context of the environment at time \(t\in \mathbb {N}_{0}.\) Once the environment is specified, the behavior as well as the evolution of the objects are considered to be independent each other. More specifically, the evolution of the process \(\left\{ X_{n}^{N}(t)\right\} _{t\in \mathbb {N}_{0}}\) is given according to the stochastic difference equation, homogeneous in N,  defined in (1.1); that is,

$$\begin{aligned} X_{n}^{N}(t+1)=F\left( X_{n}^{N}(t),C^{N}(t),a_{t},\xi _{t}\right) ,\ \ t=0,1,\ldots , \end{aligned}$$
(2.1)

where \(F:S\times \mathbb {R}^{d}\times A\times \mathbb {R}\rightarrow S\) is a given (known) function and \(\left\{ \xi _{t}\right\} \) is a sequence of independent and identically distributed (i.i.d.) real random variables with a common density \(\rho \) which is unknown for the controller, and defined on the underlying probability space \((\Omega ,\mathcal {F},P)\). As a consequence of the above definitions, it is possible to define the transition law \(K_{\rho }\) of each object in terms of the function F,  as follows: For all \(n=1,2,\ldots ,N\)

$$\begin{aligned} K_{ij}^{\rho }(a,c)&:=P \Big [ X_{n}^{N}(t+1)=j \vert X_{n} ^{N}(t)=i,a_{t}=a,C^{N}(t)=c \Big ] \nonumber \\&=\int _{\mathbb {R}}I{_{j}\left[ F(i,c,a,z)\right] \rho (z)dz},\ \ i,j\in S,\ (a,c)\in A\times \mathbb {R}^{d}. \end{aligned}$$
(2.2)

where \(I_{B}\) stands for the indicator function of the set B. This relation defines the transition law by means of the stochastic kernel \(K_{\rho } =K_{\rho }(a,c)=\left[ K_{ij}^{\rho }(a,c)\right] .\) Notice that \(K_{\rho }\) represents the common conditional distribution of the states.

Throughout this work it is assumed that the objects are observable through their states, so that the controller can only determine the number of objects in each of the states \(i\in S\). In this sense, the behavior of the system can be reformulated by means of the proportions of the objects at each state. Namely, let \(M_{i}^{N}(t)\) be the proportion of objects in state \(i\in S\) at time t defined as

$$\begin{aligned} M_{i}^{N}(t):=\frac{1}{N}\sum \limits _{n=1}^{N}I_{\left\{ X_{n}^{N} (t)=i\right\} },\ i\in S. \end{aligned}$$

Further, we denote by \(\vec {M}^{N}(t)\) the vector whose components are the proportions; that is,

$$\begin{aligned} \vec {M}^{N}(t)=\left( M_{1}^{N}(t),M_{2}^{N}(t),\ldots ,M_{s}^{N}(t)\right) . \end{aligned}$$

Observe that \(\vec {M}^{N}(t)\in \mathbb {P}_{N}(S):=\{p\in \mathbb {P} (S):Np(i)\in \mathbb {N},\ \forall i\in S\}\subset \mathbb {P}(S)\), and it is easy to see that \(\mathbb {P}_{N}(S)\) is a finite set.

In addition, we suppose that the context of the environment is a dynamical system whose evolution is determined by the difference equation:

$$\begin{aligned} C^{N}(t+1)=g\left( C^{N}(t),\vec {M}^{N}(t+1),a_{t}\right) ,t\in \mathbb {N} _{0}, \end{aligned}$$
(2.3)

where \(g:\mathbb {R}^{d}\times \mathbb {P}(S)\times A\rightarrow \mathbb {R}^{d}\) is a known function.

Let us assume now the evolution of \(\vec {M}^{N}(\cdot )\) in a recursive way through a difference equation. Clearly, such an evolution is strongly dependent on the transition law \(K_{\rho }\) of the objects, and as a consequence on the unknown density \(\rho \). Hence, we assume the existence of a measurable function \(G_{\rho }^{N}:\mathbb {P}_{N}(S)\times \mathbb {R}^{d}\times A\times \mathbb {R}^{N}\rightarrow \mathbb {P}_{N}(S)\) such that

$$\begin{aligned} \vec {M}^{N}(t+1)=G_{\rho }^{N}\left( \vec {M}^{N}(t),C^{N}(t),a_{t},\vec {w}_{t}\right) , \end{aligned}$$
(2.4)

where \(\left\{ \vec {w}_{t}\right\} \) is a sequence of i.i.d. random vectors on \(\mathbb {R}^{N}\), with common distribution \(\theta \).

For ease notation, we denote \(\mathbb {Y}_{N}:=\mathbb {P}_{N}(S)\times \mathbb {R}^{d},\) and let \(H_{\rho }^{N}:\mathbb {Y}_{N}\times A\times \mathbb {R}^{N}\rightarrow \mathbb {Y}_{N}\) be the function defined as

$$\begin{aligned} H_{\rho }^{N}\left( y,a,w\right) :=\left( G_{\rho }^{N}(y,a,w),g\left( c,G_{\rho } ^{N}(y,a,w),a\right) \right) . \end{aligned}$$
(2.5)

Then, denoting \(y^{N}(t):=\left( \vec {M}^{N}(t),C^{N}(t)\right) \), according to (2.3) and (2.4), \(H_{\rho }^{N}\) defines the dynamic of the process \(\left\{ y^{N}(t)\right\} ;\) that is,

$$\begin{aligned} y^{N}(t+1)&=\left( G_{\rho }^{N}\left( y^{N}(t),a_{t},\vec {w}_{t}\right) ,g\left( C^{N}(t),\vec {M}^{N}(t+1),a_{t}\right) \right) \nonumber \\&=\left( G_{\rho }^{N}\left( y^{N}(t),a_{t},\vec {w}_{t}\right) ,g\left( C^{N}(t),G_{\rho }^{N} (y^{N}(t),a_{t},\vec {w}_{t}),a_{t}\right) \right) \nonumber \\&=H_{\rho }^{N}\left( y^{N}(t),a_{t},\vec {w}_{t}\right) . \end{aligned}$$
(2.6)

Finally, a cost depending on the proportion of the objects, on the environment, and on the selected control, is generated at each stage. This cost will be represented by the measurable function \(r:\mathbb {P} (S)\times \mathbb {R}^{d}\times A\rightarrow \mathbb {R}\).

Let us consider the space \(\mathbb {Y}:=\mathbb {P}(S)\times \mathbb {R}^{d}.\) Observe that \(\mathbb {Y}_{N}:=\mathbb {P}_{N}(S)\times \mathbb {R}^{d} \subseteq \mathbb {Y}\) and the one-stage cost can be then redefined as \(r:\mathbb {Y}\times A\rightarrow \mathbb {R}\).

2.1 Formulation of the N-Markov Control Model (N-MCM)

We define the discrete-time Markov control model associated to the system of N objects previously introduced (in short N-MCM) as follows:

$$\begin{aligned} \mathcal {M}_{N}:=\left( \mathbb {Y}_{N},A,H_{\rho }^{N},\theta ,r\right) . \end{aligned}$$
(2.7)

The model \(\mathcal {M}_{N}\) describes the performance of the system in the following sense: at time t, the controller observes the state \(y=y^{N} (t)=(\vec {M}^{N}(t),C^{N}(t))\in \mathbb {Y}_{N}\) which is composed by both the proportions of the objects and the context of the environment, and then he/she selects a control \(a=a_{t}\in A\). As consequence the following happens: (1) a cost r(ya) is incurred, and (2) the system moves to a new state \(y^{\prime }=y^{N}(t+1)=(\vec {M}^{N}(t+1),C^{N}(t+1))\) according to the transition law

$$\begin{aligned} Q_{\rho }(B|y,a)&:=P\left[ y^{N}(t+1)\in B|y^{N}(t)=y,a_{t}=a\right] \\&=\int _{\mathbb {R}^{N}}I_{B}\left[ H_{\rho }^{N}\left( y,a,w\right) \right] \theta (dw), \end{aligned}$$

with \(H_{\rho }^{N}\) as in (2.5). Once the transition to the state \(y^{\prime }\) occurs, the procedure is repeated. In addition, we will assume that the one-stage costs are accumulated during the evolution of the system in an infinite horizon by using a given discounted cost criterion, and therefore the actions selected by the controller are directed to minimize the total expected discounted cost introduced in (2.22) below.

In order to ensure the existence of minimizers, we impose the following continuity and compactness conditions on some elements of \(\mathcal {M}_{N}\).

Assumption 2.1

  1. (a)

    The control space A is a compact metric Borel space, whose metric is denoted by \(d_{A}\).

  2. (b)

    The function g in (2.3) is a Lipschitz function with constant \(L_{g}\); that is, for \(c,c^{\prime }\in \mathbb {R}^{d},\) \(\vec {m} ,\vec {m}^{\prime }\in \mathbb {P}(S),\) \(a,a^{\prime }\in A\),

    $$\begin{aligned} \left\| g(c,\vec {m},a)-g(c^{\prime },\vec {m}^{\prime },a^{\prime })\right\| _{\infty }^{2}\le L_{g}\max \left\{ \left\| c-c^{\prime }\right\| _{\infty }^{2},\left\| \vec {m}-\vec {m}^{\prime }\right\| _{\infty } ^{1},d_{A}(a,a^{\prime })\right\} . \end{aligned}$$
    (2.8)

    Without lost of generality, we assume that \(L_{g}\ge 1\).

  3. (c)

    The mapping \(a\longmapsto H_{\rho }^{N}(y,a,w)\) defined in (2.5) is continuous, for all \(y\in \mathbb {Y}_{N}\) and \(w\in \mathbb {R}^{N}\).

  4. (d)

    The one-stage cost r is a bounded and uniformly Lipschitz function with constant \(L_{r}\); that is, for some constant \(R>0\)

    $$\begin{aligned} |r(y,a)|\le R \ \forall (y,a)\in \mathbb {Y}\times A, \end{aligned}$$

    and for every \(a,a^{\prime }\in A\), and \(y,y^{\prime }\in \mathbb {Y}\),

    $$\begin{aligned} \sup _{(a,a^{\prime })\in A\times A}|r(y,a)-r(y^{\prime },a^{\prime })|\le L_{r}\left\| y-y^{\prime }\right\| _{\infty }. \end{aligned}$$

2.2 A consumption-Investment Model with Controlled Subsidy/Fee

We consider a consumption-investment system composed by N “small” investors (i.e., economic agents whose actions do not influence the market prices) which invest among various assets with different return rates, but that also consume some specific product. There is a central controller, for instance the government or a public body, who provides a subsidy to assist the investors or imposes a fee that the investors must pay. For simplicity, we shall consider only two assets for the investors: one of them is a risk-free asset with fixed rate \(\tau \), and the other a risk asset with a stochastic return rate \(\xi _{t}\) taking values in a bounded set \(Z\subseteq \mathbb {R}\). The fraction associated to the wealth to be invested in the risky asset is a function \(\varphi _{1}:\mathbb {R}^{d}\rightarrow [0,1]\) that depends on the context of the environment; this environment might be, for example, uncertainty of the investors, type of markets that investors are trading, frecuency of transactions, etc. In an analogous way, the quantity \((1-\varphi _{1})\) will represent the fraction of wealth to be invested in the risk-free asset. On the other hand, we will assume that each investor consumes a quantity \(\varphi _{2}:\mathbb {R}^{d}\rightarrow \mathbb {R}_{+}\) that is also a bounded function dependent on the context of the environment.

In the spirit of our assumption, since the state space S is denumerable, we shall assume that the use of cents is negligible. Hence, let \(a_{t}\) be the decision of the central controller at time t which is assumed to satisfy \(a_{t}\in \left\{ 0,\pm 1,...\pm a^{*}\right\} =:A\) for some \(a^{*} \ge 0.\) That is

  • \(a_{t}=\) fee of size \(-a_{t}\) (if \(a_{t}<0\)) or subsidy of size \(a_{t}\) (if \(a_{t}>0\)), at time t.

Denoting by \(X_{n}^{N}(t)\in \{0,1,\cdots ,s\}=S\) the wealth of the investor n at time t, we can represent this process by means of the following difference equation

$$\begin{aligned} X_{n}^{N}(t+1)= & {} \text {int}\left\{ \left[ (1-\varphi _{1}(C^{N}(t)))(1+\tau )+\varphi _{1}(C^{N}(t))\xi _{t}\right] \right. \nonumber \\&\times \left. \left[ X_{n}^{N}(t)-\varphi _{2} (C^{N}(t))+a_{t}\right] \right\} , \end{aligned}$$
(2.9)

where int\(\left\{ x\right\} \) is the integer part of x. It is assumed that \(s\in \mathbb {N}_{0}\) is sufficiently large, and the functions \(\varphi _{m}\), \(m=1,2\), satisfy the Lipschitz conditions with constants \(L_{\varphi _{m}}\), \(m=1,2\), respectively, taking values in appropriate sets such that the following holds true

$$\begin{aligned} F(i,c,a,z):=\text {int}\Big \{ \left[ (1-\varphi _{1}(c))(1+\tau )+\varphi _{1}(c)z\right] \left[ i-\varphi _{2}(c)-a\right] \Big \} \in S. \end{aligned}$$
(2.10)

Furthermore, using the Lipschitz properties of \(\varphi _{1}\) and \(\varphi _{2} \), we can deduce that F is in fact a Lipschitz function in the following sense:

$$\begin{aligned} \left| F(i,c,a,z)-F(i,c^{\prime },a^{\prime },z)\right| \le L_{F} \max \left\{ \Vert c-c^{\prime }\Vert _{\infty }^{2},\ |a-a^{\prime }|\right\} , \end{aligned}$$
(2.11)

where

$$\begin{aligned} L_{F}= & {} 1+\left( 1+\tau +\max \nolimits _{z\in Z}|z|\right) \left( L_{\varphi _{1}}s+\bar{L}_{\varphi _{1}}L_{\varphi _{2}}+\bar{L}_{\varphi _{2} }L_{\varphi _{1}}+a^{*}L_{\varphi _{1}}+L_{\varphi _{2}}\right) \\&+(1+\tau )(1+L_{\varphi _{2}}) \end{aligned}$$

and \(\bar{L}_{\varphi _{m}}\) represents some (uniform) bound of \(\varphi _{m}\), \(m=1,2.\)

Assuming that \(\rho \) is the density of the random rate \(\xi _{t},\) the transition law turns out to be

$$\begin{aligned} K_{ij}^{\rho }(a,c)=\int _{\mathbb {R}}I{_{j}\left[ F(i,c,a,z)\right] \rho (z)dz} , \end{aligned}$$
(2.12)

for each \(i,j\in S\) and \((a,c)\in A\times \mathbb {R}^{d}\). Further, since F is an \(S-\)valued function and \(S:=\{0,1,\cdots ,s\}\) is finite, it is easy to see that, for all \(i,j\in S,\) \(a,a^{\prime }\in A,\) \(c,c^{\prime }\in \mathbb {R}^{d},\) the indicator function satisfies

$$\begin{aligned} \left| I{_{j}[F(i,c,a,z)]-I_{j}[F(i,c}^{\prime }{,a}^{\prime } {,z)]}\right|&\le \left| {F(i,c,a,z)-F(i,c}^{\prime }{,a}^{\prime }{,z)}\right| \\&\le L_{F}\max \left\{ \left\| c-c^{\prime }\right\| _{\infty }^{2},d_{A}(a,a^{\prime })\right\} , \end{aligned}$$

where the last inequality is due to Lipschitz property of F given in (2.11). Hence, from (2.2),

$$\begin{aligned} \left| K_{ij}^{\rho }(a,c)-K_{ij}^{\rho }(a^{\prime },c^{\prime })\right|&\le \int \left| I{_{j}[F(i,c,a,z)]-I_{j}[F(i,c}^{\prime }{,a}^{\prime }{,z)]}\right| {\rho (z)dz}\nonumber \\&\le L_{F}\max \left\{ \left\| c-c^{\prime }\right\| _{\infty } ^{2},d_{A}(a,a^{\prime })\right\} , \end{aligned}$$
(2.13)

which implies that \(K_{\rho }\) is Lipschitz.

On the other hand, for each \(i\in S,\) the evolution of the proportions \(M_{i}(t)\) of the investors can be seen in a recursive way as follows (see [7])

$$\begin{aligned} M_{i}^{N}(t+1)=\frac{1}{N}\sum _{k=0}^{s}\sum _{n=1}^{NM_{k}^{N}(t)} I_{\{A_{ki}^{\rho }(a_{t},C^{N}(t))\}}(w_{n}^{k}(t)), \end{aligned}$$
(2.14)

where \(w_{n}^{k}(t)\) are i.i.d. random variables uniformly distributed on [0, 1],

$$\begin{aligned} A_{ki}^{\rho }(a,c):=\left[ \Gamma _{ki}^{\rho }(a,c),\Gamma _{ki+1}^{\rho }(a,c)]\subseteq [0,1\right] , \end{aligned}$$
(2.15)

and

$$\begin{aligned} \Gamma _{ki}^{\rho }(a,c):=\sum _{l=0}^{i-1}K_{kl}^{\rho }(a,c),\,\,k,i\in S. \end{aligned}$$
(2.16)

For each \(i\in S\) and \(t\in \mathbb {N}_{0},\) we denote

$$\begin{aligned} \vec {w}^{i}(t):=\left( w_{1}^{i}(t),\cdots ,w_{NM_{i}^{N}}^{i}(t)\right) \end{aligned}$$

and

$$\begin{aligned} \vec {w}_{t}:=\left( \vec {w}^{0}(t),\cdots ,\vec {w}^{s}(t)\right) . \end{aligned}$$

It is worth noting that \(\sum _{i=0}^{s}NM_{i}^{N}(t)=N,\) thus \(\vec {w}_{t} \in [0,1]^{N}.\) This assertion implies that the number of (uniform) random variables involved in the dynamic (2.14) coincides with the number N of small agents; a fact that is presented in a general way through the dynamic (2.4).

Let us now rewrite the above expressions as in (2.6); namely, we define

$$\begin{aligned} G_{\rho ,i}^{N}\left( y^{N}(t),a_{t},\vec {w}_{t}\right) :=\frac{1}{N}\sum _{k=0}^{s} \sum _{n=1}^{NM_{k}^{N}(t)}I_{\{A_{ki}^{\rho }(C^{N}(t),a_{t})\}}(w_{n} ^{k}),\ \ i\in S. \end{aligned}$$

This function \(G_{\rho }^{N}\) takes the following vectorial form

$$\begin{aligned} G_{\rho }^{N}(y,a,w)= & {} \left( G_{\rho ,0}^{N}(y,a,w),\ldots ,G_{\rho ,s} ^{N}(y,a,w)\right) ,\ \ (y,a,w)\in \mathbb {Y}_{N}\nonumber \\&\times \,A\times [0,1]^{N}, \end{aligned}$$
(2.17)

yielding to the following expression

$$\begin{aligned} \vec {M}^{N}(t+1)=G_{\rho }^{N}\left( \vec {M}^{N}(t),C^{N}(t),a_{t},\vec {w} _{t}\right) . \end{aligned}$$
(2.18)

In addition, recalling that \(\mathbb {P}(S)\) denotes the space of probability measures on S, we assume that \(g:\mathbb {R}^{d}\times \mathbb {P} (S)\times A\rightarrow \mathbb {R}\) is an arbitrary function satisfying Assumption 2.1(b), such that the context of the environment satisfies

$$\begin{aligned} C^{N}(t+1)=g\left( C^{N}(t),\vec {M}^{N}(t+1),a_{t}\right) ,t\in \mathbb {N} _{0}. \end{aligned}$$
(2.19)

Then, (2.18) and (2.19) define the function

$$\begin{aligned} H_{\rho }^{N}\left( y,a,w\right) :=\left( G_{\rho }^{N}(y,a,w),g(c,G_{\rho } ^{N}(y,a,w),a)\right) , \end{aligned}$$
(2.20)

which determines the dynamic of the process \(\left\{ y^{N}(t)\right\} \) similar to (2.6).

Finally, since the action space A is denumerable, the continuity of \(a\longmapsto H_{\rho }^{N}(\cdot ,a,\cdot ),\) required in Assumption 2.1(c), trivially holds.

Remark 2.2

In the case when \(A\subset \mathbb {R}\) is an arbitrary compact set, say \(A=[-a^{*},a^{*}]\) for some \(a^{*}\ge 0,\) the continuity of the function \(H_{\rho }^{N},\) can be verified as follows. For \(i,j\in S\), \(w\in [0,1]\), \(c\in \mathbb {R}^{d}\), and \(a\in [-a^{*},a^{*}],\) let \(\delta _{w}(A_{ij}^{\rho }(c,a))\) be the Dirac measure corresponding to the indicator function \(I_{\{A_{ij}^{\rho }(c,a)\}}(w)\) (see 2.15, 2.16). Now take a sequence \(\{a_{k} \}\in [-a^{*},a^{*}]\) such that \(a_{k}\rightarrow a\in [-a^{*},a^{*}]\), which is possible because \([-a^{*},a^{*}]\) is a compact set. Since \(a\longmapsto K_{ij}^{\rho }(a,c)\) is continuous for all \(i,j\in S\) and \(c\in \mathbb {R}^{d}\), so is the mapping \(a\longmapsto \Gamma _{ij}^{\rho }(a,c)\). Hence, \(A_{ij}^{\rho }(c,a_{k})\rightarrow A_{ij}^{\rho }(c,a)\) as \(k\rightarrow \infty \) in the set sense. Therefore, due to the fact that \(\delta _{w}(\cdot )\) is a probability measure (so it is continuous), we conclude that \(\delta _{w}(A_{ij}^{\rho }(c,a_{k})\rightarrow \delta _{w}(A_{ij}^{\rho }(c,a))\), as \(k\rightarrow \infty \). This fact and the continuity of the function g given in Assumption 2.1(b) yield the continuity of the map \(a\longmapsto H_{\rho }^{N}(\cdot ,a,\cdot )\).

2.3 Optimality in the N-MCM

In this subsection we introduce the elements that define the optimal control problem as well as the results regarding existence of optimal policies respect to the discounted criterion, associated to the N-MCM (2.7).

Control policies The actions applied by the controller are selected according to rules known as control policies, which are defined as follows. Let \(\mathbb {H}_{0}^{N}:=\mathbb {Y}_{N}\) and \(\mathbb {H}_{t} ^{N}:=\left( \mathbb {Y}_{N}\times A\times \mathbb {R}\times \mathbb {R} ^{N}\right) ^{t}\times \mathbb {Y}_{N}\), \(t\ge 1,\) be the space of histories up to time t. An element \(h_{t}^{N}\) of \(\mathbb {H}_{t}^{N}\) is written as

$$\begin{aligned} h_{t}^{N}=\left( y^{N}(0),a_{0},\xi _{0},\vec {w}_{0},\ldots ,y^{N}(t-1),a_{t-1} ,\xi _{t-1},\vec {w}_{t-1},y^{N}(t)\right) , \end{aligned}$$

where \(y^{N}(t)=\left( \vec {M}^{N}(t),C^{N}(t)\right) \). A control policy is a sequence \(\pi ^{N}=\left\{ \pi _{t}^{N}\right\} \) of stochastic kernels \(\pi _{t}^{N}\) on A given \(\mathbb {H}_{t}^{N}\) such that \(\pi _{t}^{N}\left( A|h_{t}^{N}\right) =1\) for all \(h_{t}^{N}\in \mathbb {H}_{t}^{N},\) \(t\in \mathbb {N}_{0}\). We denote by \(\Pi ^{N}\) the set of all control policies.

Now, let \(\mathbb {F}\) be the set consisting of all measurable functions \(f:\mathbb {Y}\rightarrow A\) and \(\mathbb {F}^{N}:=\mathbb {F}|_{\mathbb {Y^{N}}}\) be the restriction of \(\mathbb {F}\) over \(\mathbb {Y}^{N}\). A policy \(\pi ^{N} \in \Pi ^{N}\) is said to be a (deterministic) Markov policy if there exists a sequence \(\left\{ f_{t}^{N}\right\} \subseteq \mathbb {F}^{N}\) such that for all \(t\in \mathbb {N}_{0}\) and \(h_{t}^{N}\in \mathbb {H}_{t}^{N},\) \(\pi _{t} ^{N}\left( \cdot |h_{t}^{N}\right) =\delta _{f_{t}^{N}(y^{N}(t))}(\cdot )\). In this case \(\pi ^{N}\) takes the form \(\pi ^{N}=\left\{ f_{t}^{N}\right\} \). In particular, if \(f_{t}^{N}\equiv f^{N}\) for some \(f^{N}\in \mathbb {F}\) and for all \(t\in \mathbb {N}_{0}\), we say that \(\pi ^{N}\) is a stationary policy. We denote by \(\Pi _{M}^{N}\) the set of all Markov policies, and following a standard convention, we shall use the same notation of \(\mathbb {F}^{N}\) to denote the set of stationary policies.

Remark 2.3

  1. (a)

    We denote by \(\Pi _{M}\) the set of deterministic Markov policies when we use \(\mathbb {F}\) instead of \(\mathbb {F}^{N}\) in the above definition; that is, \(\Pi _{M}\) is the family of sequences of functions \(\left\{ f_{t}\right\} \subset \mathbb {F}\). Observe that any policy \(\pi =\left\{ f_{t}\right\} \in \Pi _{M}\) whose elements \(f_{t}\) are restricted to \(\mathbb {Y}_{N}\) turns out to be an element of \(\Pi ^{N}\).

  2. (b)

    Under standard arguments (see, e.g., [12]), for each \(\pi ^{N}\in \Pi ^{N}\) and initial state \(y^{N}(0)=y\in \mathbb {Y}_{N},\) there exists a probability space \(\left( \Omega ^{\prime },\mathcal {F}^{\prime } ,P_{y}^{\pi ^{N}}\right) \) consisting in \(\Omega ^{\prime }:=\left( \mathbb {Y}_{N}\times A\times \mathbb {R}\times \mathbb {R}^{N}\right) ^{\infty }\), \(\mathcal {F}^{\prime }\) its respective \(\sigma -\)algebra, and a probability measure \(P_{y}^{\pi ^{N}}\) satisfying the following properties: For each \(t\in \mathbb {N}_{0}\)

    1. (i)

      \(P_{y}^{\pi ^{N}}(y^{N}(0)\in B)=\delta _{y}(B),\ \ B\in \mathcal {B}(\mathbb {Y}_{N})\),

    2. (ii)

      \(P_{y}^{\pi ^{N}}(a_{t}\in C|h_{t}^{N})=\pi _{t}^{N}(C|h_{t}^{N} )\),\(\ \ C\in \mathcal {B}(A)\),

    3. (iii)

      (Like-Markov property):

      $$\begin{aligned} P_{y}^{\pi ^{N}}\left[ y^{N}(t+1)\in B|h_{t}^{N},a_{t}\right]&=Q_{\rho }\left( B|y^{N}(t),a_{t}\right) \nonumber \\&=\int _{\mathbb {R}^{N}}I_{B}\left[ H_{\rho }^{N}\left( y^{N}(t),a_{t} ,w\right) \right] \theta (dw),\ \ \nonumber \\&\quad B\in \mathcal {B}(\mathbb {Y}_{N}). \end{aligned}$$
      (2.21)

The discounted optimality criterion For each control policy \(\pi ^{N}\in \Pi ^{N}\)and initial state \(y^{N}(0)=y\in \mathbb {Y}_{N}\), we define the total expected discounted cost as

$$\begin{aligned} V^{N}(\pi ^{N},y):=E_{y}^{\pi ^{N}}\sum \limits _{t=0}^{\infty }\alpha ^{t} r\left( y^{N}(t),a_{t}\right) , \end{aligned}$$
(2.22)

where \(\alpha \in (0,1)\) is the so-called discount factor and \(E_{y}^{\pi ^{N}}\) denotes the expectation operator with respect to the probability measure \(P_{y}^{\pi ^{N}}\) induced by the policy \(\pi ^{N}\) given \(y^{N}(0)=y\). We say that \(\pi _{*}^{N}\) is optimal for the N-MCM if and only if

$$\begin{aligned} V_{*}^{N}(y):=\inf _{\pi ^{N}\in \Pi ^{N}}V^{N}\left( \pi ^{N},y\right) =V^{N}\left( \pi _{*} ^{N},y\right) ,\ \ \ y\in \mathbb {Y}_{N}. \end{aligned}$$
(2.23)

In this case, \(V_{*}^{N}\) is said to be the \(N-\) value function.

Under the conditions imposed on the N-MCM \(\mathcal {M}_{N}\), we can state the following well known result that provides a characterization on the optimal policies and on the N-value function in terms of the solution of a certain functional equation so-called the N- optimality equation (see, e.g., [11, 21]):

Proposition 2.4

  1. (a)

    The \(N-\)value function \(V_{*}^{N}\) satisfies the \(N-\)optimality equation

    $$\begin{aligned} V_{*}^{N}(y)=\min _{a\in A}\left\{ r(y,a)+\alpha \int _{\mathbb {R}^{N} }V_{*}^{N}\left[ H_{\rho }^{N}\left( y,a,w\right) \right] \theta (dw)\right\} ,\ \ y\in \mathbb {Y}_{N}. \end{aligned}$$
    (2.24)

    In addition,

    $$\begin{aligned} \left| V_{*}^{N}(y)\right| \le \frac{R}{1-\alpha },\ \ y\in \mathbb {Y}_{N}, \end{aligned}$$

    with R being the (uniform) bound of the one-stage cost r defined in Assumption 2.1(d), and \(\alpha \) the discount factor in (2.22).

  2. (b)

    There exists \(f_{*}^{N}\in \mathbb {F}^{N}\) such that \(f_{*}^{N}(y)\in A\) attains the minimum in (2.24), i.e.,

    $$\begin{aligned} V_{*}^{N}(y)=r(y,f_{*}^{N})+\alpha \int _{\mathbb {R}^{N}}V_{*} ^{N}\left[ H_{\rho }^{N}\left( y,f_{*}^{N},w\right) \right] \theta (dw),\ \ y\in \mathbb {Y}_{N}, \end{aligned}$$
    (2.25)

    and furthermore, the stationary policy \(\pi _{*}^{N}=\left\{ f_{*} ^{N}\right\} \in \Pi _{M}^{N}\) is optimal for the control model \(\mathcal {M} _{N}.\)

Proposition 2.4 provides a flexible framework for the optimality analysis of the interacting objects system. However, from the practical point of view, its usefulness is seriously limited either because N is too large (\(N\sim \infty \)) or for the lack of knowledge of density \(\rho \). Indeed, to analyze equations (2.24) and (2.25), we first need to deal with a multiple integral of dimension N which could be considerably difficult to calculate, besides that the dynamics of the system depends heavily on the unknown density \(\rho \). Both situations will be discussed in the following sections in order to overcome these obstacles. Specifically, we first introduce a suitable control model \(\mathcal {M}\) that represents the “limit model” of \(\mathcal {M}_{N}\) as \(N\rightarrow \infty \); this new model is referred to as the mean field control model, and of course also depends on the unknown density \(\rho \). Hence, we pose the mean field control problem which is independent of N, but dependent on \(\rho \). Then, in Sect. 4, an statistical estimation and control procedure is proposed to construct nearly optimal policies for the control model \(\mathcal {M}_{N}\) in an asymptotic sense when \(N\rightarrow \infty \). In other words, \(\mathcal {M}\) is used as an approximating model for \(\mathcal {M}_{N}\), and the hope is that optimal policies in \(\mathcal {M}\) have a good performance in \(\mathcal {M}_{N}\), whenever the model \(\mathcal {M}\) gives a good approximation to the model \(\mathcal {M}_{N}\).

3 The Mean Field Control Model

Recall the set \(\mathbb {Y}=\mathbb {P}(S)\times \mathbb {R}^{d}\). We consider a general controlled deterministic system \(\left\{ \left( \vec {m} (t),c(t)\right) \right\} \in \mathbb {Y}\) that depends implicitly on the distribution \(\rho \) in (2.2) and whose dynamic is governed by means of the following difference equations

$$\begin{aligned} \vec {m}(t+1)&= G_{\rho }\big (\vec {m}(t),c(t),a_{t}\big );\end{aligned}$$
(3.1)
$$\begin{aligned} c(t+1)&= g\big (c(t),\vec {m}(t+1),a_{t}\big ), \end{aligned}$$
(3.2)

where \(\left( \vec {m}(0),c(0)\right) =(\vec {m},c)\in \mathbb {Y}\) represents the initial condition, \(a_{t}\in A\) is the control (or action) selected at time t\(g:\mathbb {R}^{d}\times \mathbb {P}(S)\times A\rightarrow \mathbb {R}^{d}\) is the function defined in (2.3), and \(G_{\rho }:\mathbb {P}(S)\times \mathbb {R}^{d}\times A\rightarrow \mathbb {P}(S)\) is a known Lipschitz function (dependent on \(\rho \)) with constant \(L_{G}\); that is, for \(\vec {m},\vec {m}^{\prime }\in \mathbb {P}(S),\) \(c,c^{\prime }\in \mathbb {R} ^{d},\) and \(a,a^{\prime }\in A,\)

$$\begin{aligned} \left\| G_{\rho }(\vec {m},c,a)-G_{\rho }(\vec {m}^{\prime },c^{\prime },a^{\prime })\right\| _{\infty }^{1}\le L_{G}\max \left\{ \left\| \vec {m}-\vec {m}^{\prime }\right\| _{\infty }^{1},\left\| c-c^{\prime }\right\| _{\infty }^{2},d_{A}(a,a^{\prime })\right\} .\nonumber \\ \end{aligned}$$
(3.3)

Due to the deterministic nature of the process (3.1)–(3.2), it is evident that the dynamic is completely determined by the sequence of actions \(\left\{ a_{t}\right\} \subset A\) and by the initial condition \((\vec {m},c)\in \mathbb {Y}\). Furthermore, we will assume (see Assumption 5.1 below) that the process \(y(t):=\left( \vec {m} (t),c(t)\right) \) represents the mean field limit; that is, y(t) will be the limit process of \(y^{N}(t):=(\vec {M}^{N}(t),C^{N}(t))\) in (2.6) as N goes to infinity.

Let \(H_{\rho }:\mathbb {Y}\times A\rightarrow \mathbb {Y}\) be the function that defines the dynamic of the process \(\left\{ \left( \vec {m}(t),c(t)\right) \right\} \); that is,

$$\begin{aligned} H_{\rho }(y,a):=\left( G_{\rho }(\vec {m},c,a),g(c,G_{\rho }(\vec {m} ,c,a),a)\right) ,\ \ y=(\vec {m},c)\in \mathbb {Y},\ a\in A. \end{aligned}$$
(3.4)

From (3.1) and (3.2), we can write

$$\begin{aligned} y(t+1)&=\left( G_{\rho }(\vec {m}(t),c(t),a_{t}),g(c(t),\vec {m} (t+1),a_{t})\right) \nonumber \\&=:H_{\rho }(y(t),a_{t}),\ \ \ t\ge 0, \end{aligned}$$
(3.5)

with \(y(0)=(\vec {m},c)\in \mathbb {P}(S)\times \mathbb {R}^{d}.\) A straightforward calculation yields that the function \(H_{\rho }\) is a Lipschitz function (recall \(G_{\rho }\) and g are Lipschitz functions). Specifically, for \((y,a),(y^{\prime },a^{\prime })\in \mathbb {Y}\times A\),

$$\begin{aligned} \left\| H_{\rho }(y,a)-H_{\rho }(y^{\prime },a^{\prime })\right\| _{\infty }\le L_{H_{\rho }}\max \left\{ \left\| y-y^{\prime }\right\| _{\infty },d_{A}(a,a^{\prime })\right\} , \end{aligned}$$
(3.6)

where \(L_{H_{\rho }}=\max \left\{ L_{g},L_{g}L_{G}\right\} .\) Using the same one-stage cost r defined for the N-MCM (2.7), we can then define the mean field control model as

$$\begin{aligned} \mathcal {M}=(\mathbb {Y},A,H_{\rho },r), \end{aligned}$$

which has a similar interpretation as the \(N-\)MCM \(\mathcal {M}_{N}.\)

Example 3.1

(Consumption-investment problem) Carrying on with our example, we define the controlled deterministic system \(\left\{ \left( \vec {m}(t),c(t)\right) \right\} \in \mathbb {Y}\) as (see [7])

$$\begin{aligned} \vec {m}(t+1)&=\vec {m}(t)K_{\rho }(a_{t},c(t))\end{aligned}$$
(3.7)
$$\begin{aligned} c(t+1)&=g(c(t),\vec {m}(t+1),a_{t}). \end{aligned}$$
(3.8)

where \(K_{\rho }\) becomes the matrix \([K_{ij}^{\rho }]\), whose elements turn out to be stochastic kernels defined in (2.12) and \(g:\mathbb {R} ^{d}\times \mathbb {P}(S)\times A\rightarrow \mathbb {R}^{d}\) is the function defined in (2.19). Observe that \(\vec {m}(t+1)\) is the vector with components

$$\begin{aligned} m_{j}(t+1)= {\displaystyle \sum \limits _{i=1}^{s}} m_{i}(t)K_{ij}^{\rho }(a_{t},c(t)), \end{aligned}$$

where \(\vec {m}(0)=m\in \mathbb {P}(S).\) In this case the function \(G_{\rho }\) in (3.1) takes the form

$$\begin{aligned} G_{\rho }(\vec {m},c,a)=\vec {m}K_{\rho }(a,c), \quad \left( \vec {m},c\right) \in \mathbb {Y},a\in A, \end{aligned}$$
(3.9)

and, since the kernel \(K_{\rho }\) is Lipschitz (see (2.13)), so is \(G_{\rho }\), as was stated in (3.3), with some constant \(L_{G}.\)

In Sect. 5 we will show that (3.7)-(3.8) are in fact the limit processes of (2.18)–(2.19). \(\square \)

3.1 Optimality in the Mean Field

In this subsection we present a well-known theory regarding optimality results for the controlled system (3.1)–(3.2) when using the deterministic discounted criterion (3.10). Basically these results show characterizations on the optimal policies and on the corresponding value function in the sense that these optimal quantities become solutions of a given functional equation associated to the mean field control model.

As is well-known (see, e.g., [4]), for the deterministic controlled systems, a control policy \(\pi \) is a sequence of decision rules (or selectors) \(\pi =\left\{ f_{t}\right\} \subset \mathbb {F}\). Therefore, according to the Remark 2.3(a), we can naturally consider the set \(\Pi _{M}\) as the set of all control policies for the model \(\mathcal {M}\). Hence, given a control policy \(\pi \in \Pi _{M}\) together with the initial condition \(y(0)=y\in \mathbb {Y}\), we define the total discounted cost for the mean field model as

$$\begin{aligned} v(\pi ,y)=\sum \limits _{t=0}^{\infty }\alpha ^{t}r\left( y(t),a_{t}\right) . \end{aligned}$$
(3.10)

Then, the mean field optimal control problem is to find a policy \(\pi _{*}\in \Pi _{M}\) such that

$$\begin{aligned} v_{*}(y):=\inf \limits _{\pi \in \Pi _{M}}v(\pi ,y)=v(\pi _{*},y), \quad y\in \mathbb {Y}, \end{aligned}$$
(3.11)

where \(v_{*}\) is the mean field value function and \(\pi _{*}\) is said to be an optimal policy for the mean field control model \(\mathcal {M}\).

Observe that from the continuity of the function \(H_{\rho }\) [see (3.6)], the compactness of the control space A, and the continuity of the one-stage cost r, we can state the following result regarding the value function (see, e.g., [11, 21]).

Proposition 3.2

(a) The value function \(v_{*}\) satisfies the mean field optimality equation

$$\begin{aligned} v_{*}(y)=\min _{a\in A}\left\{ r(y,a)+\alpha v_{*}\left[ H_{\rho }\left( y,a\right) \right] \right\} ,\ \ y\in \mathbb {Y}. \end{aligned}$$
(3.12)

Equivalently,

$$\begin{aligned} \min _{a\in A}\Phi (y,a)=0,\ \ y\in \mathbb {Y}, \end{aligned}$$

where

$$\begin{aligned} \Phi (y,a):=r(y,a)+\alpha v_{*}\left[ H_{\rho }\left( y,a\right) \right] -v_{*}(y), \end{aligned}$$
(3.13)

is the so-called discrepancy function. In addition,

$$\begin{aligned} \left| v_{*}(y)\right| \le \frac{R}{1-\alpha },\ \ y\in \mathbb {Y}. \end{aligned}$$

(b) There exists \(f^{*}\in \mathbb {F}\) such that \(f^{*}(y)\in A\) attains the minimum in (3.12), i.e.,

$$\begin{aligned} v_{*}(y)=r(y,f^{*})+\alpha v_{*}\left[ H_{\rho }\left( y,f^{*}\right) \right] ,\quad y\in \mathbb {Y}, \end{aligned}$$
(3.14)

and furthermore, the stationary policy \(\pi ^{*}=\left\{ f^{*}\right\} \in \Pi _{M}\) is optimal for the control model \(\mathcal {M}.\)

Remark 3.3

Let \(\left\{ (y_{t},a_{t})\right\} \) be a sequence of state-action pairs corresponding to the application of a stationary policy \(\pi ^{*}=\left\{ f^{*}\right\} \in \Pi _{M}.\) Observe that by the optimality principle and dynamic programming arguments, \(\pi ^{*}\) is an optimal policy if, and only if \(\Phi (y_{t},f^{*}(y_{t}))=0,\) for all \(t\in \mathbb {N}_{0}.\)

Although the optimal value function and the optimal policy are well characterized through Proposition 3.2 and Remark 3.3, the information about the density \(\rho \) plays an important role in equations (3.12)–(3.14), and as a consequence, the optimality equation and its minimizers are highly dependent on the density \(\rho \). However, under certain conditions, when this density is unknown, as is our case, suitable estimation-control procedures can be applied in order to find optimal policies. This point is studied in the next section.

4 Estimation and Control in the Mean Field

The main problem we address in this paper is to obtain optimality results under the assumption that the density \(\rho \) in (2.2), and as consequence the function \(H_{\rho }\) in (3.4)–(3.5), are unknown. In this scenario, assuming observability of the random disturbances \(\xi _{0} ,\xi _{1},...,\) the controller has to appeal to a combination of statistical estimation methods and control procedures on the mean field model \(\mathcal {M}\), in order to gain some insights on the evolution of the system. That is, before choosing the action \(a_{t}\) at time t,  the controller gets an estimate \(\rho _{t}\) of \(\rho \) —thus gets also an estimate \(H_{t} =H_{\rho _{t}}\) of the function \(H_{\rho }\)—, then, the decisions of the controller are adapted to this estimate, obtaining a control \(a_{t}=a_{t} (\rho _{t}).\)

To fix ideas, let us consider \(\xi _{0},\xi _{1},...,\xi _{k-1}\) be independent realizations of a random variable with the unknown density \(\rho \) observed up to time \(k-1,\) and let \(\rho _{k}(\cdot ):=\rho _{k}(\cdot ;\xi _{0},\xi _{1},...,\xi _{k-1})\) be a density which is an estimator such that, as \(k\rightarrow \infty \)

$$\begin{aligned} \int _{\mathbb {R}}\left| \rho _{k}(z)-\rho (z)\right| dz\rightarrow 0\ \text {a.s} \end{aligned}$$
(4.1)

and

$$\begin{aligned} \sup _{(y,a)\in \mathbb {Y}\times A}\left\| G_{\rho _{k}}(y,a)-G_{\rho }(y,a)\right\| _{\infty }^{1}\rightarrow 0 \ \text {a.s,} \end{aligned}$$
(4.2)

where \(y=(\vec {m},c)\), and for each \(k\in \mathbb {N},\) \(G_{\rho _{k}}\) is the function defining the dynamic of the process \(\left\{ \vec {m}(t)\right\} \) [see (3.1)] when the density \(\rho _{k}\) is used instead of \(\rho .\) Thus, \(G_{\rho _{k}}\) defines a new estimated process which is generated by the function [see (3.4), (3.5)]

$$\begin{aligned} H_{k}(y,a):=\left( G_{\rho _{k}}(y,a),g(c,G_{\rho _{k}}(y,a),a)\right) ,\quad y=(\vec {m},c)\in \mathbb {Y},\ a\in A. \end{aligned}$$

It is easy to see that

$$\begin{aligned} \sup _{(y,a)\in \mathbb {Y}\times A}\left\| H_{k}(y,a)-H_{\rho } (y,a)\right\| _{\infty }\rightarrow 0 \ \text {a.s.,} \ \text {as} \ k\rightarrow \infty . \end{aligned}$$
(4.3)

Indeed, since g is a Lipschitz function, we have that, for all \(y=(\vec {m},c)\in \mathbb {Y},\ a\in A\),

$$\begin{aligned} \left\| g(c,G_{\rho _{k}}(y,a),a)-g(c,G_{\rho }(y,a),a)\right\| _{\infty }^{2}\le L_{g}\left\| G_{\rho _{k}}(y,a)-G_{\rho }(y,a)\right\| _{\infty }^{1}. \end{aligned}$$
(4.4)

Then, combining (4.2) and (4.4), we get

$$\begin{aligned} \sup _{(y,a)\in \mathbb {Y}\times A}\left\| g(c,G_{\rho _{k}} (y,a),a)-g(c,G_{\rho }(y,a),a)\right\| _{\infty }^{2}\rightarrow 0 \ \text {a.s.,} \ \text {as} \ k\rightarrow \infty . \end{aligned}$$

Thus, we can easily see that (4.3) holds. Moreover, for each \(\pi \in \Pi _{M}\) and \(y\in \mathbb {Y},\) from (4.3) together with a simple use of the dominated convergence theorem, we can conclude

$$\begin{aligned} E_{y}^{\pi }\left[ \sup _{(x,a)\in \mathbb {Y}\times A}\left\| H_{k} (x,a)-H_{\rho }(x,a)\right\| _{\infty }\right] \rightarrow 0,\ \text {as} \ k\rightarrow \infty , \end{aligned}$$
(4.5)

because \(\rho _{k}\) does not depend on \(\pi \) and y.

Let \(\left\{ v_{k}\right\} \) be a sequence of functions \(v_{k} :\mathbb {Y}\rightarrow \mathbb {R}\) in \(\mathbb {C}_{b}(\mathbb {Y)}\) to be defined as follows:

$$\begin{aligned} v_{0}&\equiv 0;\nonumber \\ v_{k}(y)&=\min _{a\in A}\big \{ r(y,a)+\alpha v_{k-1}\left[ H_{k}\left( y,a\right) \right] \big \} ,\ \ k\in \mathbb {N},\ y\in \mathbb {Y}. \end{aligned}$$
(4.6)

Then, noting that the function \((y,a)\rightarrow H_{k}\left( y,a\right) ,\) \(\ k\in \mathbb {N},\) is continuous and that A is compact, from standard measurable selection theorems (see, e.g., Proposition D5(a) in [12]), for each \(k\in \mathbb {N}\), there exists \(\hat{f}_{k} \in \mathbb {F}\) (dependent on \(\rho _{k}\)), such that

$$\begin{aligned} v_{k}(y)=r(y,\hat{f}_{k})+\alpha v_{k-1}\left[ H_{k}\left( y,\hat{f} _{k}\right) \right] ,\ \ y\in \mathbb {Y}. \end{aligned}$$
(4.7)

We define the control policy \(\hat{\pi }=\left\{ \hat{f}_{k}\right\} \in \Pi _{M}.\) Observe that this policy is completely computable for the controller, and therefore, according to our objective, we are interested in to study its optimality. However, it is worth noting that the discounted criterion strongly depends on the decisions selected in the early stages, right where the statistical estimation process yields poor information about the unknown dynamic. This leads to thinking that, in general, it is not possible to ensure that \(\hat{\pi }\) is an optimal policy in the usual sense for the mean field model. Hence, we need to use the following weaker optimality criterion to analyze its optimality, which is motivated by the comment in the Remark 3.3 (see, e.g., [10, 11, 13, 19] for further information about this optimality criterion).

Definition 4.1

We say that a policy \(\pi \in \Pi _{M}\) is eventually optimal for the mean field control model (or simply eventually optimal) if and only if, for any initial condition \(y(0)=y\in \mathbb {Y}\),

$$\begin{aligned} \lim _{t\rightarrow \infty }E_{y}^{\pi }\Phi (y(t),a_{t})=0,\quad y\in \mathbb {Y}, \end{aligned}$$

where \(\Phi \) is the discrepancy function defined in (3.13).

Before establishing the result, we need to impose the following technical requirement.

Assumption 4.2

The constant \(L_{H_{\rho }}\) defined in (3.6) satisfies \(\alpha L_{H_{\rho }}<1.\)

Theorem 4.3

Under Assumptions 2.1, and 4.2, the policy \(\hat{\pi }\) obtained by means of the iterative method described in (4.7), is eventually optimal.

The proof of this theorem is based on several lemmas, so it will be presented at the end of the section.

Example 4.4

(Consumption-investment problem) For the estimator \(\rho _{k},\) we define, similarly as (2.12), the estimated transition kernel \(K_{k}(a,c)=\left[ K_{ij}^{k}(a,c)\right] \) with components [see (2.10)]

$$\begin{aligned} K_{ij}^{k}(a,c)=:\int _{\mathbb {R}}I{_{j}{[F(i,c,a,z)]}\rho }_{k}{(z)dz} , \quad i,j\in S,\ (a,c)\in A\times \mathbb {R}. \end{aligned}$$

Also, we define

$$\begin{aligned} G_{\rho _{k}}(\vec {m},c,a):=\vec {m}K_{k}(a,c), \quad \left( \vec {m},c\right) \in \mathbb {Y},a\in A, \end{aligned}$$

and

$$\begin{aligned} H_{k}(y,a):=\left( \vec {m}K_{k}(a,c),g(c,\vec {m}K_{k}(a,c),a)\right) ,\quad y=(\vec {m},c)\in \mathbb {Y},a\in A. \end{aligned}$$

Observe that for all \(i,j\in S,\ (a,c)\in A\times \mathbb {R}^{d},\)

$$\begin{aligned} \left| K_{ij}^{k}(a,c)-K_{ij}^{\rho }(a,c)\right| \le \int _{\mathbb {R}}\left| \rho _{k}(z)-\rho (z)\right| dz. \end{aligned}$$

Therefore, according to (4.1)

$$\begin{aligned} \sup _{(a,c)\in A\times \mathbb {R}^{d}}\left\| K_{k}(a,c)-K_{\rho }(a,c)\right\| _{\infty }^{0}\rightarrow 0 \ \text {a.s,} \ \text {as} \ t\rightarrow \infty , \end{aligned}$$
(4.8)

which, in turn, implies (see (3.9))

$$\begin{aligned}&\sup _{(y,a)\in \mathbb {Y}\times A}\left\| G_{\rho _{k}}(y,a)-G_{\rho }(y,a)\right\| _{\infty }^{1}=\sup _{(y,a)\in \mathbb {Y}\times A}\left\| \vec {m}K_{k}(a,c)-\vec {m}K_{\rho }(a,c)\right\| _{\infty }^{1}\\&\quad \rightarrow 0\ \text {a.s.,} \ \text {as} \ k\rightarrow \infty . \end{aligned}$$

\(\square \)

The remainder of this section is focused in the proof of Theorem 4.3.

Let \(\left\{ u_{t}\right\} \subset \mathbb {C}_{b}(\mathbb {Y})\) be the mean field value iteration functions defined as:

$$\begin{aligned} u_{0}&\equiv 0;\end{aligned}$$
(4.9)
$$\begin{aligned} u_{t}(y)&=\min _{a\in A}\left\{ r(y,a)+\alpha u_{t-1}\left[ H_{\rho }\left( y,a\right) \right] \right\} , \quad t\in \mathbb {N},\ y\in \mathbb {Y}. \end{aligned}$$
(4.10)

As shown in [4, 11, 21], our hypotheses lead to

$$\begin{aligned} v_{*}(y)=\lim _{t\rightarrow \infty }u_{t}(y),\ \ \ y\in \mathbb {Y}, \end{aligned}$$
(4.11)

where \(v_{*}\) is the mean field value function satisfying (3.12).

Lemma 4.5

Suppose that Assumption 2.1 holds. Then:

  1. (a)

    For each \(t\in \mathbb {N}_{0}\), the functions \(u_{t}\) generated by means of the iterations (4.9)–(4.10) are Lipschitz continuous with constant

    $$\begin{aligned} L_{u_{t}}:=L_{r} {\displaystyle \sum \limits _{l=0}^{t-1}} \left( \alpha L_{H_{\rho }}\right) ^{l}. \end{aligned}$$
    (4.12)
  2. (b)

    In addition, if Assumption 4.2 holds, then the mean field value function \(v_{*}\) is Lipschitz continuous with constant

    $$\begin{aligned} L_{v_{*}}=\frac{L_{r}}{1-\alpha L_{H_{\rho }}}, \end{aligned}$$
    (4.13)

    where \(L_{r}\) and \(L_{H_{\rho }}\) are the Lipschitz constants in Assumption 2.1(d) and (3.6), respectively.

Proof

(a) We proceed by induction. First, from (4.9), clearly part (a) holds for \(t=0.\) Now we assume that \(u_{t}\) is a Lipschitz function with constant given in (4.12). Then, for \(y,y^{\prime }\in \mathbb {Y},\) from (4.10) we have

$$\begin{aligned} \left| u_{t+1}(y)-u_{t+1}(y^{\prime })\right| \le \sup _{a\in A}\left\{ \left| r(y,a)-r(y^{\prime },a)\right| +\alpha \left| u_{t}\left[ H_{\rho }(y,a)\right] -u_{t}\left[ H_{\rho }(y^{\prime },a)\right] \right| \right\} . \end{aligned}$$

Thus, since r and \(H_{\rho }\) are Lipschitz functions (see Assumption 2.1(d) and (3.6)), as long as (4.12) is used, we get

$$\begin{aligned}&\left| u_{t+1}(y)-u_{t+1}(y^{\prime })\right| \le L_{r}\left\| y-y^{\prime }\right\| _{\infty }+\alpha L_{u_{t}}L_{H_{\rho }}\left\| y-y^{\prime }\right\| _{\infty }\\&\le \left( L_{r}+\alpha L_{H_{\rho }}L_{r} {\displaystyle \sum \limits _{l=0}^{t-1}} \left( \alpha L_{H_{\rho }}\right) ^{l}\right) \left\| y-y^{\prime }\right\| _{\infty } \le L_{r}\left( 1+ {\displaystyle \sum \limits _{l=0}^{t-1}} \left( \alpha L_{H_{\rho }}\right) ^{l+1}\right) \left\| y-y^{\prime }\right\| _{\infty }\\&=L_{r} {\displaystyle \sum \limits _{l=0}^{t}} \left( \alpha L_{H_{\rho }}\right) ^{l}\left\| y-y^{\prime }\right\| _{\infty }. \end{aligned}$$

Therefore, \(u_{t+1}\) is a Lipschitz function with constant

$$\begin{aligned} L_{u_{t+1}}:=L_{r} {\displaystyle \sum \limits _{l=0}^{t}} \left( \alpha L_{H_{\rho }}\right) ^{l}. \end{aligned}$$

This fact proves part (a).

(b) For \(y,y^{\prime }\in \mathbb {Y},\) adding and subtracting the terms \(u_{t}(y)\) and \(u_{t}(y^{\prime })\) to \(|v_{*}(y)-v_{*}(y^{\prime })|\), we obtain

$$\begin{aligned} \left| v_{*}(y)-v_{*}(y^{\prime })\right|&\le \left| v_{*}(y)-u_{t}(y)\right| +\left| u_{t}(y)-u_{t}(y^{\prime })\right| +\left| u_{t}(y^{\prime })-v_{*}(y^{\prime })\right| \nonumber \\&\le \left| v_{*}(y)-u_{t}(y)\right| +L_{u_{t}}\left\| y-y^{\prime }\right\| _{\infty }+\left| u_{t}(y^{\prime })-v_{*}(y^{\prime })\right| ,\ \ \forall t\in \mathbb {N}_{0}, \end{aligned}$$
(4.14)

where the last inequality is due to part (a). Now observe that under Assumption 4.2

$$\begin{aligned} \lim _{t\rightarrow \infty }L_{u_{t}}=L_{r} {\displaystyle \sum \limits _{l=0}^{\infty }} \left( \alpha L_{H_{\rho }}\right) ^{l}=\frac{L_{r}}{1-\alpha L_{H_{\rho }}}. \end{aligned}$$
(4.15)

Therefore, letting \(t\rightarrow \infty \) in (4.14), we have that (4.11) together (4.15) yield

$$\begin{aligned} \left| v_{*}(y)-v_{*}(y^{\prime })\right| \le \frac{L_{r} }{1-\alpha L_{H_{\rho }}}\left\| y-y^{\prime }\right\| _{\infty },\qquad y,y^{\prime }\in \mathbb {Y}, \end{aligned}$$

that is, \(v_{*}\) is a Lipschitz continuous function. \(\square \)

Lemma 4.6

Let \(v_{k}\) be the family of functions generated by the iterations (4.6) and \(v_{*}\) the value function in (3.11) (see (3.12)). Then, under Assumptions 2.1 and 4.2, for each \(\pi \in \Pi _{M}\) and \(y\in \mathbb {Y},\) \(E_{y}^{\pi }\left\| v_{*}-v_{k}\right\| \rightarrow 0\), as \(k\rightarrow \infty .\)

Proof

From (3.12) and (4.6), we have, for each \(k\in \mathbb {N}\) and \(y\in \mathbb {Y},\)

$$\begin{aligned}&\left| v_{*}(y)-v_{k}(y)\right| \le \alpha \sup _{a\in A}\left| v_{*}\left[ H_{\rho }(y,a)\right] -v_{k-1}\left[ H_{k}(y,a)\right] \right| \\&\le \alpha \sup _{a\in A}\left| v_{*}\left[ H_{\rho }(y,a)\right] -v_{*}\left[ H_{k}(y,a)\right] \right| +\alpha \sup _{a\in A}\left| v_{*}\left[ H_{k}(y,a)\right] -v_{k-1}\left[ H_{k}(y,a)\right] \right| , \end{aligned}$$

where in the last inequality we have added and subtracted the term \(v_{*}\left[ H_{k}(y,a)\right] .\) Hence, from Lemma 4.5 and the fact that \(v_{*},v_{k}\in \mathbb {B}(\mathbb {Y})\) \(\forall k\in \mathbb {N}\),

$$\begin{aligned} 0\le \left\| v_{*}-v_{k}\right\| \le L_{v_{*}}\sup _{(y,a)\in \mathbb {Y}\times A}\left\| H_{\rho }(y,a)-H_{k}(y,a)\right\| _{\infty }+\alpha \left\| v_{*}-v_{k-1}\right\| , \end{aligned}$$
(4.16)

which implies

$$\begin{aligned} E_{y}^{\pi }\left\| v_{*}-v_{k}\right\| \le L_{v_{*}}E_{y}^{\pi }\left[ \sup _{(y,a)\in \mathbb {Y}\times A}\left\| H_{\rho }(y,a)-H_{k} (y,a)\right\| _{\infty }\right] +\alpha E_{y}^{\pi }\left\| v_{*}-v_{k-1}\right\| , \end{aligned}$$
(4.17)

for each \(\pi \in \Pi _{M}\) and \(y\in \mathbb {Y}.\) Let \(l:=\limsup _{k\rightarrow \infty }E_{y}^{\pi }\Vert v_{*}-v_{k}\Vert <\infty \). Hence, letting \(k\rightarrow \infty \) in (4.17), and using the convergence in (4.5), we get \(l\le \alpha l\). Finally, since \(\alpha <1\), we can deduce that \(\lim _{k\rightarrow \infty }E_{y}^{\pi }\Vert v_{\infty }-v_{k} \Vert =0\), which proves the result. \(\square \)

Proof of Theorem 4.3

We define, for each \(k\in \mathbb {N},\) the approximate discrepancy function \(\Phi _{k} :\mathbb {Y}\times A\rightarrow \mathbb {R}\) as

$$\begin{aligned} \Phi _{k}(y,a):=r(y,a)+\alpha v_{k-1}\left[ H_{k}\left( y,a\right) \right] -v_{k}(y),\ \ (y,a)\in \mathbb {Y}\times A. \end{aligned}$$

Now observe that, for each \(k\in \mathbb {N}\) and \((y,a)\in \mathbb {Y}\times A,\)

$$\begin{aligned} \left| \Phi (y,a)-\Phi _{k}(y,a)\right| \le \left| v_{*}\left[ H_{\rho }(y,a)\right] -v_{k-1}\left[ H_{k}(y,a)\right] \right| +\left| v_{*}(y)-v_{k}(y)\right| . \end{aligned}$$

Then, from Lemma 4.6, letting \(k\rightarrow \infty \) we get

$$\begin{aligned} E_{y}^{\hat{\pi }}\left[ \sup _{(y,a)\in \mathbb {Y}\times A}\left| \Phi (y,a)-\Phi _{k}(y,a)\right| \right] \rightarrow 0\text {.} \end{aligned}$$
(4.18)

On the other hand, observing that \(\Phi _{k}(y,\hat{f}_{k}(y))=0,\) \(y\in \mathbb {Y}\) when using the control policy generated by (4.7), we have

$$\begin{aligned} 0\le \Phi (y(k),\hat{f}_{k}(y(k)))&=\left| \Phi (y(k),\hat{f} _{k}(y(k)))-\Phi _{k}(y(k),\hat{f}_{k}(y(k)))\right| \\&\le \sup _{(y,a)\in \mathbb {Y}\times A}\left| \Phi (y,a)-\Phi _{k}(y,a)\right| , \end{aligned}$$

Thus, from (4.18), we obtain

$$\begin{aligned} \lim _{k\rightarrow \infty }E_{y}^{\hat{\pi }}\Phi (y(k),a_{k})=0. \end{aligned}$$

\(\square \)

5 Mean Field Convergence

In this section we study the performance of the eventually optimal policy \(\hat{\pi }\) obtained in Sect. 4; that is, we are interested in to analyze the optimality deviation of \(\hat{\pi }\) when it is used to control the process \(\left\{ y^{N}(t)\right\} \). Clearly, such an optimality deviation must be measured in terms of the difference between the corresponding optimal value functions \(V_{*}^{N}\) and \(v_{*}\) of the models \(\mathcal {M}_{N}\) and \(\mathcal {M}\) respectively, and moreover, as was pointed out in Sect. 2, it must be analyzed in an asymptotic sense as N goes to infinity. To this end, we impose the following assumption which concerns with the convergence of the trajectories \(y^{N}(\cdot )\) to the trajectories \(y(\cdot )\) defined in (2.6) and (3.5), respectively, in the sense of (5.1) below.

Observe that according to the Propositions 2.4 and 3.2, as well as the definition of the policy \(\hat{\pi },\) we can restrict our analysis to the class of Markov policies \(\Pi _{M}\).

Assumption 5.1

We assume:

  1. (a)

    \((\vec {M}^{N}(0),C^{N}(0))=\left( \vec {m}(0),c(0)\right) =(\vec {m}_{0},c_{0})=y\in \mathbb {Y}_{N}\), for all \(N\in \mathbb {N}\).

  2. (b)

    For any \(y\in \mathbb {Y}_{N}\), \(T\in \mathbb {N}\), and \(\varepsilon >0\), there exist positive constants K and \(\lambda \) such that

    $$\begin{aligned} \sup _{\pi \in \Pi _{M}}P_{y}^{\pi }\left\{ \sup _{0\le t\le T}\left\| y^{N}(t)-y(t)\right\| _{\infty }\ge \gamma _{T}(\varepsilon )\right\} \le KTe^{-\lambda N\varepsilon ^{2}}, \end{aligned}$$
    (5.1)

    where \(\gamma _{T}(\varepsilon )\rightarrow 0\) as \(\varepsilon \rightarrow 0.\)

We will use the following notation: for any fixed policy \(\pi =\left\{ f_{t}\right\} \in \Pi _{M},\) we denote

$$\begin{aligned} a_{t}^{\pi ,N}:=f_{t}(y^{N}(t))\text { and }a_{t}^{\pi }:=f_{t}(y(t)) \end{aligned}$$

the actions at time t corresponding to the application of the policy \(\pi \) under the process \(\left\{ y^{N}(t)\right\} \) and \(\left\{ y(t)\right\} \), respectively.

Now, following similar ideas to those of [7], we show that the example we have been working satisfies Assumption 5.1(b).

Example 5.2

(Consumption-investment problem) Recall the relations (2.12)–(2.16). Let \(\pi =\left\{ f_{t}\right\} \in \Pi _{M}\) be an arbitrary policy and \(y\in \mathbb {Y}_{N}\subset \mathbb {Y}\) be the initial state. We denote

$$\begin{aligned} B_{inj}^{N\rho }(t):=I_{\left\{ A_{ij}^{\rho }\left( a_{t}^{\pi ,N} ,C^{N}(t)\right) \right\} }(w_{n}^{i}(t)),\ \ i,j\in S,n\in \mathbb {N}, \end{aligned}$$

where \(C^{N}(t)\) is as in (2.3) and \(w_{n}^{i}(t)\) are i.i.d. random variables uniformly distributed on [0, 1]. Observe that, for each \(t\in \mathbb {N}_{0},\left\{ B_{inj}^{N\rho }(t)\right\} _{inj}\) are i.i.d. Bernoulli random variables with mean

$$\begin{aligned}&E_{y}^{\pi }\left[ B_{inj}^{N\rho }(t)|a_{t}^{\pi ,N}=a,C^{N}(t)=c\right] =K_{ij}^{\rho }(a,c)\\&\quad =I{_{j}{[F(i,c,a,z)]}}\rho (z)dz,\ \ i,j\in S,\ (a,c)\in A\times \mathbb {R}^{d}. \end{aligned}$$

Then, for a fixed \(\varepsilon >0,\) by Hoeffding’s inequality, we have

$$\begin{aligned} P_{y}^{\pi }\left[ \left| \sum _{n=1}^{NM_{i}^{N}(t)}B_{inj}^{N\rho }(t)-NM_{i}^{N}(t)K_{ij}^{\rho }\left( a_{t}^{\pi ,N},C^{N}(t)\right) \right| <N\varepsilon \right] >1-2e^{-2N\varepsilon ^{2}}. \end{aligned}$$

Consider the set \(\bar{\Omega }=\left\{ \omega \in \Omega ^{\prime }\left| \sum _{n=1}^{NM_{i}^{N}(t)}B_{inj}^{N\rho }(t)-NM_{i}^{N}(t)K_{ij}^{\rho }\left( a_{t}^{\pi ,N},C^{N}(t)\right) \right| \right. \left. <N\varepsilon \right\} \subset \Omega ^{\prime }\) (see Remark 2.3(b)), and let \(\varepsilon _{t}\) be a positive number such that \(\left\| y^{N} (t)-y(t)\right\| _{\infty }\le \varepsilon _{t}\); that is,

$$\begin{aligned} \left\| \vec {M}^{N}(t)-\vec {m}(t)\right\| _{\infty }^{1}\le \varepsilon _{t}\ \ \text {and} \ \left\| C^{N}(t)-c(t)\right\| _{\infty }^{2}\le \varepsilon _{t}. \end{aligned}$$
(5.2)

Thus, from (2.14), (3.7), and (5.2), we have that the following relations hold true on \(\bar{\Omega }\):

$$\begin{aligned} \left| M_{j}^{N}(t+1)-m_{j}(t+1)\right|= & {} \left| {\displaystyle \sum \limits _{i=0}^{s}} \frac{1}{N}\left[ \sum _{n=1}^{NM_{i}^{N}(t)}B_{inj}^{N\rho }(t)-Nm_{i} (t)K_{ij}^{\rho }\left( a_{t}^{\pi ,N},c(t)\right) \right] \right| \nonumber \\\le & {} {\displaystyle \sum \limits _{i=0}^{s}} \frac{1}{N}\left| \sum _{n=1}^{NM_{i}^{N}(t)}B_{inj}^{N\rho }(t)-Nm_{i} (t)K_{ij}^{\rho }\left( a_{t}^{\pi ,N},c(t)\right) \right| \nonumber \\\le & {} {\displaystyle \sum \limits _{i=0}^{s}} \frac{1}{N}\left| \sum _{n=1}^{NM_{i}^{N}(t)}B_{inj}^{N\rho }(t)-NM_{i} ^{N}(t)K_{ij}^{\rho }\left( a_{t}^{\pi ,N},C^{N}(t)\right) \right| \nonumber \\&+ {\displaystyle \sum \limits _{i=0}^{s}} \left| M_{i}^{N}(t)-m_{i}(t)\right| K_{ij}^{\rho }\left( a_{t}^{\pi ,N},C^{N}(t)\right) \nonumber \\&+ {\displaystyle \sum \limits _{i=0}^{s}} m_{i}(t)\left| K_{ij}^{\rho }\left( a_{t}^{\pi ,N},C^{N}(t)\right) -K_{ij}^{\rho }\left( a_{t}^{\pi ,N},c(t)\right) \right| \nonumber \\&<(s+1)\varepsilon +(s+1)\varepsilon _{t}+L_{K}\varepsilon _{t}. \end{aligned}$$
(5.3)

Hence, since the right-hand of this last inequality does not depend on j, we have

$$\begin{aligned} \left\| \vec {M}^{N}(t+1)-\vec {m}(t+1)\right\| _{\infty }^{1} \le (s+1)\varepsilon +(s+1)\varepsilon _{t}+L_{K}\varepsilon _{t}. \end{aligned}$$

On the other hand, since g is a Lipschitz function (see Assumption 2.1), expressions (2.19) and (3.2)) together with (5.2) and (5.3) lead to

$$\begin{aligned}&\left\| C^{N}(t+1)-c(t+1)\right\| _{\infty }^{2}\le \left\| g(C^{N}(t),\vec {M}^{N}(t+1),a_{t}^{\pi ,N})\right. \\&\quad \quad \left. -g(c(t),\vec {m}(t+1),a_{t}^{\pi ,N})\right\| _{\infty }^{2}\\&<L_{g}\max \left\{ \varepsilon _{t},(s+1)\varepsilon +(s+1)\varepsilon _{t}+L_{K}\varepsilon _{t}\right\} =L_{g}\left( (s+1)\varepsilon +(s+1)\varepsilon _{t}+L_{K}\varepsilon _{t}\right) , \end{aligned}$$

which implies that on the set \(\bar{\Omega }\) (recall \(L_{g}\ge 1\))

$$\begin{aligned} \left\| y^{N}(t+1)-y(t+1)\right\| _{\infty }<L_{g}\left( (s+1)\varepsilon +(s+1)\varepsilon _{t}+L_{K}\varepsilon _{t}\right) . \end{aligned}$$

Considering now \(\left\| y^{N}(0)-y(0)\right\| _{\infty }=\varepsilon _{0}=0\) (see Assumption 5.1(a)) and applying an inductive procedure, a straightforward calculation yields that, on the set \(\bar{\Omega }\),

$$\begin{aligned} \left\| y^{N}(t+1)-y(t+1)\right\| _{\infty }<L_{g}(s+1)\varepsilon \beta _{t},\ \ t\in \mathbb {N}_{0}, \end{aligned}$$

where \(\left\{ \beta _{t}\right\} \) is a increasing sequence. Then, for a fixed \(T\in \mathbb {N},\)

$$\begin{aligned} \left\| y^{N}(t+1)-y(t+1)\right\| _{\infty }<L_{g}(s+1)\varepsilon \beta _{T},\ \ \forall 0\le t\le T \end{aligned}$$

on the set \(\bar{\Omega }.\) Therefore, under the policy \(\pi \in \Pi _{M}\),

$$\begin{aligned} P_{y}^{\pi }\left[ \sup _{0\le t\le T}\left\| y^{N}(t+1)-y(t+1)\right\| _{\infty }<L_{g}(s+1)\varepsilon \beta _{T}\right] \ge 1-2e^{-2N\varepsilon ^{2} }, \end{aligned}$$

which, letting \(\gamma _{T}(\varepsilon ):=L_{g}(s+1)\varepsilon \beta _{T}\), \(K=\lambda =2,\) implies

$$\begin{aligned} \sup _{\pi \in \Pi _{M}}P_{y}^{\pi }\left\{ \sup _{0\le t\le T}\left\| y^{N}(t)-y(t)\right\| _{\infty }\ge \gamma _{T}(\varepsilon )\right\} \le KTe^{-\lambda N\varepsilon ^{2}}. \end{aligned}$$

Finally, we observe that \(\gamma _{T}(\varepsilon )\rightarrow 0\) as \(\varepsilon \rightarrow 0.\) \(\square \)

Now we introduce the following additional notation: For any \(T\in \mathbb {N}\) we denote

$$\begin{aligned} Y_{T}:=\sup _{0\le t\le T}\Vert y^{N}(t)-y(t)\Vert _{\infty } \end{aligned}$$
(5.4)

and

$$\begin{aligned} \mathcal {K}(T):=(L_{g})^{T}\max \{L_{g},diam(A)\}, \end{aligned}$$
(5.5)

where \(L_{g}\ge 1\) is the Lipschitz constant in Assumption 2.1 (b) and \(diam(A):=\sup _{(a,a^{\prime })\in A\times A}d(a,a^{\prime })\).

Recall that for any given \(t\in \mathbb {N}_{0}\),

$$\begin{aligned} \Vert y^{N}(t)-y(t)\Vert _{\infty }=\max \left\{ \Vert \vec {M}^{N}(t)-\vec {m}(t)\Vert _{\infty }^{1}\ ,\ \Vert C^{N}(t)-c(t)\Vert _{\infty }^{2}\right\} . \end{aligned}$$
(5.6)

We are now in conditions to set our main results. Firstly, we provide a bound for the gap between the value functions \(V_{*}^{N}\) and \(v_{*}\), which in turns defines an approximation scheme as \(N\rightarrow \infty \). Next we show that the control policy \(\hat{\pi }\) is eventually optimal on the control model \(\mathcal {M}_{N}\) in an asymptotic sense.

Theorem 5.3

Under the Assumptions 2.1, 4.2, and 5.1, the following statements hold true:

  1. (a)

    For each \(T\in \mathbb {N},\) \(0\le t\le T,\) and \(y\in \mathbb {Y}_{N},\)

    $$\begin{aligned}&\sup _{\varphi \in \Pi _{M}}E_{y}^{\varphi }\left| V_{*}^{N}(y^{N} (t))-v_{*}(y(t))\right| \le \frac{2R\alpha ^{T}}{1-\alpha }+L_{r} \frac{1-\alpha ^{T}}{1-\alpha }\nonumber \\&\times \left[ KTe^{-\lambda N\varepsilon ^{2} }(1+\mathcal {K}(T))+\gamma _{T}(\varepsilon )\right] . \end{aligned}$$
    (5.7)
  2. (b)

    The control policy \(\hat{\pi }\in \Pi _{M}\) estimated in (4.7) is eventually asymptotically optimal for the N Markov control model \(\mathcal {M}_{N}\), as \(N\rightarrow \infty \); that is

    $$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{N\rightarrow \infty }E_{y}^{\hat{\pi }}\Phi ^{N}(y^{N}(t),\hat{f}_{t})=0, \end{aligned}$$
    (5.8)

    where

    $$\begin{aligned} \Phi ^{N}(y^{N},a):=r(y^{N},a)+\alpha \int _{\mathbb {R}^{N}}V_{*}^{N}\left[ H_{\rho }^{N}\left( y^{N},a,w\right) \right] \theta (dw)-V_{*}^{N} (y^{N}),\ \ y^{N}\in \mathbb {Y}_{N} \end{aligned}$$
    (5.9)

    is the discrepancy function in the \(N-\)MCM \(\mathcal {M}_{N}\) (see also (3.13)).

In the remainder of this section we will assume that Assumptions 2.1, 4.2, and 5.1 hold true. Based in this fact, the proof of Theorem 5.3 will be a consequence of the following propositions.

Proposition 5.4

  1. (a)

    For each \(\pi \in \Pi _{M}\) and \(T\in \mathbb {N}\),

    $$\begin{aligned} Y_{T}:=\sup _{0\le t\le T}\Vert y^{N}(t)-y(t)\Vert _{\infty }\le \mathcal {K}(T). \end{aligned}$$
    (5.10)
  2. (b)

    For each \(y\in \mathbb {Y}_{N}\) and \(T\in \mathbb {N}\),

    $$\begin{aligned} \mathrm{sup}_{\pi \in \Pi _{M}}E_{y}^{\pi }\left[ \sup _{0\le t\le T}\Vert y^{N}(t)-y(t)\Vert _{\infty }\right] \le KTe^{-\lambda N\varepsilon ^{2} }(1+\mathcal {K}(T))+\gamma _{T}(\varepsilon ). \end{aligned}$$
    (5.11)

Proof

(a) To obtain (5.10) it is sufficient to prove that for each \(t\in \mathbb {N}_{0}\) and \(\pi \in \Pi _{M}\)

$$\begin{aligned} \Vert y^{N}(t)-y(t)\Vert _{\infty }\le (L_{g})^{t-1}\max \{L_{g},diam(A)\}. \end{aligned}$$
(5.12)

We then focus to get (5.12). Notice that under Assumption 5.1(a) we have \(a_{0}^{\pi ,N}=a_{0}^{\pi }=:a_{0}\in A\) and \(\Vert y^{N}(0)-y(0)\Vert _{\infty }=0\). On the other hand, since \(\vec {M}^{N}(t)\) and \(\vec {m}(t)\) are probability measures, it follows that \(\Vert \vec {M} ^{N}(t)-\vec {m}(t)\Vert _{\infty }^{1}\le 1\), for all \(t\in \mathbb {N}_{0}\). Hence, because \(L_{g}\ge 1,\) the proof reduces to analyze the norm \(\Vert \cdot \Vert _{\infty }^{2}\) in (5.6). In particular, (5.12) will be proved if we show that

$$\begin{aligned} \Vert C^{N}(t)-c(t)\Vert _{\infty }^{2}\le (L_{g})^{t-1}\max \{L_{g} ,diam(A)\}, \quad \forall t\in \mathbb {N}_{0}. \end{aligned}$$
(5.13)

To this end, we proceed by induction. First, observe that from (2.3) and (3.2) we obtain

$$\begin{aligned} \Vert C^{N}(1)-c(1)\Vert _{\infty }^{2}&=\Vert g(c_{0},\vec {M}^{N} (1),a_{0})-g(c_{0},\vec {m}(1),a_{0})\Vert _{\infty }^{2}\\&\le L_{g}\Vert \vec {M}^{N}(1)-\vec {m}(1)\Vert _{\infty }^{1}\le L_{g} \quad \ \text{(by } \text{(2.8)) }. \end{aligned}$$

Also,

$$\begin{aligned} \Vert C^{N}(2)-c(2)\Vert _{\infty }^{2}&=\Vert g(C^{N}(1),\vec {M} ^{N}(2),a_{1}^{\pi ,N})-g(c(1),\vec {m}(1),a_{1}^{\pi })\Vert _{\infty }^{2}\\&\le L_{g}\max \left\{ \Vert C^{N}(1)-c(1)\Vert _{\infty }^{2},\ \Vert \vec {M}^{N}(2)-\vec {m}(2)\Vert _{\infty }^{1},\ \right. \\&\quad \left. d_{A}(a_{1}^{\pi ,N},a_{1}^{\pi })\right\} \\&\le L_{g}\max \left\{ L_{g},\ 1,\ diam(A)\right\} =L_{g}\max \left\{ L_{g},diam(A)\right\} . \end{aligned}$$

Now, assume that (5.13) holds for some \(t\in \mathbb {N}\). Then

$$\begin{aligned} \Vert C^{N}(t+1)-c(t+1)\Vert _{\infty }^{2}=\Vert g(C^{N}(t),\vec {M} ^{N}(t+1),a_{t}^{\pi ,N})-g(c(t),\vec {m}(t+1),a_{t}^{\pi })\Vert _{\infty }^{2}\\ \le L_{g}\max \left\{ \Vert C^{N}(t)-c(t)\Vert _{\infty }^{2},\ \Vert \vec {M}^{N}(t+1)-m(t+1)\Vert _{\infty }^{1},\ d_{A}(a_{t}^{\pi ,N},a_{t}^{\pi })\right\} \quad \text{(by } \text{(2.8)) }\\ \le L_{g}\max \left\{ (L_{g})^{t-1}\max \{L_{g},diam(A)\},1,diam(A)\right\} \quad \text{(by } \text{(5.13)) }\\ \le (L_{g})^{t}\max \{L_{g},diam(A)\}. \end{aligned}$$

This proves (5.13), which in turns yields (5.12) and (5.10).

(b) Observe that for each \(y\in \mathbb {Y}_{N}\), \(\pi \in \Pi _{M},\) \(T\in \mathbb {N}\), and \(\varepsilon >0,\) the expectation in (5.11) satisfies (see (5.4))

$$\begin{aligned}&E_{y}^{\pi }[Y_{T}]=E_{y}^{\pi }\left[ Y_{T}I_{\{Y_{T}\ge \gamma _{T}(\varepsilon )\}}+Y_{T}{I}_{\{Y_{T}<\gamma _{T}(\varepsilon )\}}\right] \nonumber \\&\le E_{y}^{\pi }\left[ Y_{T}{I}_{\{Y_{T}\ge \gamma _{T} (\varepsilon )\}}\right] +\gamma _{T}(\varepsilon )P_{y}^{\pi }(Y_{T}<\gamma _{T}(\varepsilon ))\le E_{y}^{\pi }\left[ Y_{T}{I}_{\{T_{T}\ge \gamma _{T}(\varepsilon )\}}\right] +\gamma _{T}(\varepsilon ). \end{aligned}$$
(5.14)

On the other hand, by (5.10) as well as the non negativeness of \(Y_{T}\), we have

$$\begin{aligned} \frac{Y_{T}}{1+\mathcal {K}(T)}\le \frac{Y_{T}}{1+Y_{T}}\le 1, \end{aligned}$$

which implies

$$\begin{aligned} \frac{Y_{T}}{1+\mathcal {K}(T)}{I}_{\{Y_{T}\ge \gamma _{T}(\varepsilon )\}}\le {I}_{\{Y_{T}\ge \gamma _{T}(\varepsilon )\}}. \end{aligned}$$

This fact together with the definition of \(Y_{T}\) and Assumption 5.1(b) give

$$\begin{aligned} \frac{1}{1+\mathcal {K}(T)}E_{y}^{\pi }[Y_{T}{I}_{\{Y_{T}\ge \gamma _{T}(\varepsilon )\}}]\le P_{y}^{\pi }(Y_{T}\ge \gamma _{T}(\varepsilon ))\le KTe^{-\lambda N\varepsilon ^{2}},\ \ \pi \in \Pi _{M}. \end{aligned}$$

Finally, from (5.14) we get

$$\begin{aligned} E_{y}^{\pi }\left[ Y_{T}\right] \le KTe^{-\lambda N\varepsilon ^{2} }(1+\mathcal {K}(T))+\gamma _{T}(\varepsilon ), \quad \pi \in \Pi _{M}, \end{aligned}$$
(5.15)

and by taking supremum over \(\pi \in \Pi _{M}\) in (5.15) we prove the part (b). \(\square \)

The next results are related with the finite horizon discounted cost criteria for the \(N-\)MCM \(\mathcal {M}_{N}\) and for the mean field control model \(\mathcal {M}\). For any \(\pi \in \Pi _{M}\), \(y\in \mathbb {Y}_{N}\subset \mathbb {Y}\), and \(T\in \mathbb {N}\), we define

$$\begin{aligned} V_{T}^{N}(\pi ,y):=E_{y}^{\pi }\left[ \sum _{k=0}^{T-1}\alpha ^{k}r(y^{N} (k),a_{k})\right] \quad \text{ and }\quad v_{T}(\pi ,y):=\sum _{k=0}^{T-1} \alpha ^{k}r(y^{N}(k),a_{k}). \end{aligned}$$

Proposition 5.5

Let \(L_{r}\) and R be the constants in Assumption 2.1(d). Then, for each \(y\in \mathbb {Y}_{N}\), \(\varepsilon >0,\) \(T\in \mathbb {N}\), and \(0\le t\le T\), the following statements hold true:

  1. (a)
    $$\begin{aligned} \sup _{\pi \in \Pi }E_{y}^{\pi }\left| r(y^{N}(t),a_{t}^{\pi ,N})-r(y(t),a_{t} ^{\pi })\right| \le L_{r}\left( KTe^{-\lambda N\varepsilon ^{2} }(1+\mathcal {K}(T))+\gamma _{T}(\varepsilon )\right) ; \end{aligned}$$
    (5.16)
  2. (b)
    $$\begin{aligned}&\sup _{\varphi \in \Pi }E_{y}^{\varphi }\left[ \sup _{\pi \in \Pi }\left| V_{T}^{N}(\pi ,y^{N}(t))-v_{T}(\pi ,y(t))\right| \right] \le L_{r} \frac{1-\alpha ^{T}}{1-\alpha }\nonumber \\&\quad \times \left[ KTe^{-\lambda N\varepsilon ^{2} }(1+\mathcal {K}(T))+\gamma _{T}(\varepsilon )\right] ; \end{aligned}$$
    (5.17)
  3. (c)
    $$\begin{aligned} \sup _{\varphi \in \Pi }E_{y}^{\varphi }\left[ \sup _{\pi \in \Pi }\left| V^{N}(\pi ,y^{N}(t))-V_{T}^{N}(\pi ,y^{N}(t))\right| \right] \le \frac{R\alpha ^{T}}{1-\alpha }; \end{aligned}$$
    (5.18)
  4. (d)
    $$\begin{aligned} \sup _{\varphi \in \Pi }E_{y}^{\varphi }\left[ \sup _{\pi \in \Pi }\left| v(\pi ,y(t))-v_{T}(\pi ,y(t))\right| \right] \le \frac{R\alpha ^{T} }{1-\alpha }. \end{aligned}$$
    (5.19)

Proof

(a) Let us fix any \(\pi \in \Pi _{M}\) and \(T\in \mathbb {N}\). Then, Assumption 2.1(d) together with Proposition 5.4, lead to the following relations

$$\begin{aligned}&E_{y}^{\pi }\left| r(y^{N}(t),a_{t}^{\pi ,N})-r(y(t),a_{t}^{\pi })\right| \le L_{r}E_{y}^{\pi }\left[ \Vert y^{N}(t)-y(t)\Vert _{\infty }\right] \\&\quad \le L_{r}E_{y}^{\pi }\left[ \sup _{0\le t\le T}\Vert y^{N}(t)-y(t)\Vert _{\infty }\right] \le L_{r}\left( KTe^{-\lambda N\varepsilon ^{2} }(1+\mathcal {K}(T))+\gamma _{T}(\varepsilon )\right) .\nonumber \end{aligned}$$
(5.20)

This implies the part (a).

(b) For each \(\pi \in \Pi _{M}\),

$$\begin{aligned}&|V_{T}(\pi ,y^{N}(t))-v_{T}(\pi ,y(t))|=\left| E_{y^{N}(t)}^{\pi }\left\{ \sum _{k=0}^{T-1}\alpha ^{k}r(y^{N}(k),a_{k}^{\pi ,N})\right. \right. \\&\quad \qquad \qquad \qquad \left. \left. -\sum _{k=0}^{T-1}\alpha ^{k}r(y(k),a_{k}^{\pi })\right\} \right| \\&\le \sum _{k=0}^{T-1}\alpha ^{k}E_{y^{N}(t)}^{\pi }\left| r(y^{N} (k),a_{k}^{\pi ,N})-r(y(t),a_{k}^{\pi })\right| \\&\le L_{r}\frac{1-\alpha ^{T}}{1-\alpha }\left[ KTe^{-\lambda N\varepsilon ^{2}}(1+\mathcal {K} (T))+\gamma _{T}(\varepsilon )\right] , \end{aligned}$$

where the last inequality follows from (5.20). This gives

$$\begin{aligned}&\sup _{\pi \in \Pi }\left| V_{T}^{N}(\pi ,y^{N}(t))-v_{T}(\pi ,y(t))\right| \le L_{r}\frac{1-\alpha ^{T}}{1-\alpha }\left[ KTe^{-\lambda N\varepsilon ^{2} }(1+\mathcal {K}(T))+\gamma _{T}(\varepsilon )\right] ,\\&\quad \forall t\in \mathbb {N}_{0}. \end{aligned}$$

Taking expectation \(E_{y}^{\varphi }\) in both sides of the above expression, and then taking supremum over \(\varphi \in \Pi _{M}\), we obtain (5.17).

(c) For each \(\pi \in \Pi _{M}\), we have

$$\begin{aligned}&\left| V^{N}(\pi ,y^{N}(t))-V_{T}^{N}(\pi ,y^{N}(t))\right| \\&\le \left| E_{y^{N}(t)}^{\pi }\left\{ \sum _{k=0}^{\infty }\alpha ^{k} r(y^{N}(k),a_{k}^{\pi ,N})-\sum _{k=0}^{T-1}\alpha ^{k}r(y^{N}(k),a_{k}^{\pi ,N})\right\} \right| \\&\le \sum _{k=T}^{\infty }\alpha ^{k}E_{y^{N}(t)}^{\pi }|r(y^{N}(k),a_{k} ^{\pi ,N})|\le R\sum _{k=T}^{\infty }\alpha ^{k}\le \frac{R\alpha ^{T}}{1-\alpha }. \end{aligned}$$

Hence, easily we can see that (5.18) holds.

(d) It follows by using the same arguments of (c). \(\square \)

5.1 Proof of Theorem 5.3(a)

Let \(\pi _{*}^{N}=\left\{ f_{*}^{N}\right\} \in \Pi _{M}^{N}\) be an optimal stationary policy for the \(N-\)MCM \(\mathcal {M}_{N}\) (see Proposition 2.4(b)), and for an arbitrary selector \(\tilde{f}\in \mathbb {F},\) we define the stationary policy \(\bar{\pi }=\left\{ \bar{f}\right\} \in \Pi _{M}\), where \(\bar{f}:\mathbb {Y}\rightarrow A\) is given by

$$\begin{aligned} \bar{f}(y)=f_{*}^{N}(y)I_{\mathbb {Y}_{N}}(y)+\tilde{f}(y)I_{\left[ \mathbb {Y}_{N}\right] ^{c}}(y). \end{aligned}$$

In addition, let \(\varphi \in \Pi _{M}\) be an arbitrary policy and let us denote \(y_{\varphi }^{N}(t)=y^{N}(t)\in \mathbb {Y}_{N}\) and \(y_{\varphi }(t):=y(t)\in \mathbb {Y}.\) Observe that for each \(t\in \mathbb {N}_{0},\)

$$\begin{aligned} V_{*}^{N}(y^{N}(t))=V^{N}(\pi _{*}^{N},y^{N}(t))=V^{N}(\bar{\pi } ,y^{N}(t))\le \sup _{\pi \in \Pi _{M}}V^{N}(\pi ,y^{N}(t)). \end{aligned}$$

Hence,

$$\begin{aligned} V_{*}^{N}(y^{N}(t))-v_{*}(y(t))\le \sup _{\pi \in \Pi _{M}}V^{N}(\pi ,y^{N}(t))-\inf _{\pi \in \Pi _{M}}v(\pi ,y(t)) \end{aligned}$$

which, in turns implies

$$\begin{aligned} \left| V_{*}^{N}(y^{N}(t))-v_{*}(y(t))\right| \le \sup _{\pi \in \Pi _{M}}\left| V^{N}(\pi ,y^{N}(t))-v(\pi ,y(t))\right| ,\ \ t\in \mathbb {N}_{0}. \end{aligned}$$

Therefore, for each \(y\in \mathbb {Y}_{N}\) and \(0\le t\le T\),

$$\begin{aligned}&E_{y}^{\varphi }\left| V_{*}^{N}(y^{N}(t))-v_{*}(y(t))\right| \le E_{y}^{\varphi }\left[ \sup _{\pi \in \Pi _{M}}\left| V^{N}(\pi ,y^{N}(t))-v(\pi ,y(t))\right| \right] \\&\le E_{y}^{\varphi }\left[ \sup _{\pi \in \Pi _{M}}\left\{ \left| V^{N}(\pi ,y^{N}(t))-V_{T}^{N}(\pi ,y^{N}(t))\right| + \left| V_{T} ^{N}(\pi ,y^{N}(t))-v_{T}(\pi ,y(t))\right| \right. \right. \\&\quad \left. \left. +\left| v_{T}(\pi ,y(t))-v(\pi ,y(t))\right| \right\} \right] \\&\le E_{y}^{\varphi }\left[ \sup _{\pi \in \Pi _{M}}\left| V^{N}(\pi ,y^{N}(t))-V_{T}^{N}(\pi ,y^{N}(t))\right| \right] \\&\quad +E_{y}^{\varphi }\left[ \sup _{\pi \in \Pi _{M}}\left| V_{T}^{N}(\pi ,y^{N}(t))-v_{T}(\pi ,y(t))\right| \right] \\&\quad +E_{y}^{\varphi }\left[ \sup _{\pi \in \Pi _{M}}\left| v_{T}(\pi ,y(t))-v(\pi ,y(t))\right| \right] \\&\le \frac{2R\alpha ^{T}}{1-\alpha }+L_{r}\frac{1-\alpha ^{T}}{1-\alpha }\left[ KTe^{-\lambda N\varepsilon ^{2}}(1+\mathcal {K}(T))+\gamma _{T}(\varepsilon )\right] , \end{aligned}$$

where the last inequality is due to Proposition 5.5. Finally, by taking supremum over \(\varphi \in \Pi _{M}\), we obtain (5.7).\(\square \)

5.2 Proof of Theorem 5.3(b)

For ease notation, we let \(\hat{a}_{t}^{N}:=a_{t}^{\hat{\pi },N}\) and \(\hat{a}_{t}:=a_{t}^{\hat{\pi }}\). Then, consider \(\left\{ (y^{N}(t),\hat{a} _{t}^{N})\right\} \in \mathbb {Y}_{N}\times A\) and \(\left\{ (y(t),\hat{a} _{t})\right\} \in \mathbb {Y}\times A\) the sequences of state-action pairs corresponding to application of the policy \(\hat{\pi }\) (see (4.7)). For each \(t\in \mathbb {N}_{0},\) we define the random variable

$$\begin{aligned} \Delta _{t}^{N}:=\left| \Phi ^{N}(y^{N}(t),\hat{a}_{t}^{N})-\Phi (y(t),\hat{a}_{t})\right| . \end{aligned}$$

Then, from the definition of the discrepancy functions \(\Phi ^{N}\) and \(\Phi \) given in (5.9) and (3.13), respectively, we have for each \(t\in \mathbb {N}_{0},\)

$$\begin{aligned}&\Delta _{t}^{N}\le \left| r(y^{N}(t),\hat{a}_{t}^{N})-r(y(t),\hat{a} _{t})\right| +\left| V_{*}^{N}(y^{N}(t))-v_{*}(y(t))\right| \nonumber \\&\quad +\alpha \left| \int _{\mathbb {R}^{N}}V_{*}^{N}\left[ H_{\rho } ^{N}\left( y^{N}(t),\hat{a}_{t}^{N},w\right) \right] \theta (dw)-v_{*}(H_{\rho }\left( y(t),\hat{a}_{t}\right) )\right| \nonumber \\&\le \left| r(y^{N}(t),\hat{a}_{t}^{N})-r(y(t),\hat{a}_{t})\right| +\left| V_{*}^{N}(y^{N}(t))-v_{*}(y(t))\right| \nonumber \\&\quad +\left| \int _{\mathbb {R}^{N}}\left\{ V_{*}^{N}\left[ H_{\rho } ^{N}\left( y^{N}(t),\hat{a}_{t}^{N},w\right) \right] -v_{*}(y(t+1))\right\} \theta (dw)\right| \quad \text{(by } \text{(3.5)) }\nonumber \\&=\left| r(y^{N}(t),\hat{a}_{t}^{N})-r(y(t),\hat{a}_{t})\right| +\left| V_{*}^{N}(y^{N}(t))-v_{*}(y(t))\right| \nonumber \\&\quad +\left| E_{y}^{\hat{\pi }}\left[ V_{*}^{N}(y^{N}(t+1))-v_{*}(y(t+1))\ |\ h_{t}^{N},\hat{a}_{t}^{N}\right] \right| \quad \text{(by } \text{(5.21)) }\nonumber \\&\le \left| r(y^{N}(t),\hat{a}_{t}^{N})-r(y(t),\hat{a}_{t})\right| +\left| V_{*}^{N}(y^{N}(t))-v_{*}(y(t))\right| \nonumber \\&\quad +\,E_{y}^{\hat{\pi }}\left[ \left| V_{*}^{N}(y^{N}(t+1))-v_{*}(y(t+1))\right| \ |\ h_{t}^{N},\hat{a}_{t}^{N}\right] . \end{aligned}$$
(5.21)

Taking expectation \(E_{y}^{\hat{\pi }}\) in (5.21), and using properties of conditional expectation we get

$$\begin{aligned} E_{y}^{\hat{\pi }}\left[ \Delta _{t}^{N}\right]&\le E_{y}^{\hat{\pi } }\left| r(y^{N}(t),\hat{a}_{t}^{N})-r(y(t),\hat{a}_{t})\right| + E_{y}^{\hat{\pi }}\left| V_{*}^{N}(y^{N}(t))-v_{*}(y(t))\right| \\&\quad +E_{y}^{\hat{\pi }}\left| V_{*}^{N}(y^{N}(t+1))-v_{*}(y(t+1))\right| . \end{aligned}$$

Furthermore, Proposition 5.5 and Theorem 5.3 yield

$$\begin{aligned} E_{y}^{\hat{\pi }}\left[ \Delta _{t}^{N}\right]&\le L_{r}\left[ KTe^{-\lambda N\varepsilon ^{2}}(1+\mathcal {K}(T)+\gamma _{T}(\varepsilon ))\right] +\frac{4R\alpha ^{T}}{1-\alpha }\\&\quad +2L_{r}\frac{1-\alpha ^{T}}{1-\alpha }\left[ KTe^{-\lambda N\varepsilon ^{2} }(1+\mathcal {K}(T))+\gamma _{T}(\varepsilon )\right] , \end{aligned}$$

for any arbitrary \(\varepsilon >0\) and \(T>t.\)

Also, observe that

$$\begin{aligned}&E_{y}^{\hat{\pi }}\left[ \Phi ^{N}(y^{N}(t),\hat{a}_{t}^{N})\right] \le E_{y}^{\hat{\pi }}\left[ |\Phi ^{N}(y^{N}(t),\hat{a}_{t}^{N})-\Phi (y(t),\hat{a}_{t})|\right] +E_{y}^{\hat{\pi }}\left[ \Phi (y(t),\hat{a}_{t})\right] \\&\quad =\, E_{y}^{\hat{\pi }}[\Delta _{t}^{N}]+E_{y}^{\hat{\pi }}\left[ \Phi (y(t),\hat{a}_{t})\right] \le L_{r}\left[ KTe^{-\lambda N\varepsilon ^{2} }(1+\mathcal {K}(T))+\gamma _{T}(\varepsilon )\right] +\frac{4R\alpha ^{T} }{1-\alpha }\\&\qquad +\,2L_{r}\frac{1-\alpha ^{T}}{1-\alpha }\left[ KTe^{-\lambda N\varepsilon ^{2} }(1+\mathcal {K}(T))+\gamma _{T}(\varepsilon )\right] +\quad E_{y}^{\hat{\pi }}\left[ \Phi (y(t),\hat{a}_{t})\right] . \end{aligned}$$

Thus, taking limit as \(N\rightarrow \infty \) we obtain

$$\begin{aligned} 0&\le \lim _{N\rightarrow \infty }E_{y}^{\hat{\pi }}\left[ \Phi ^{N} (y^{N}(t),\hat{a}_{t}^{N})\right] \le L_{r}\gamma _{T}(\varepsilon )+\frac{4R\alpha ^{T}}{1-\alpha }+2L_{r}\frac{1-\alpha ^{T}}{1-\alpha }\gamma _{T}(\varepsilon )\nonumber \\&+E_{y}^{\hat{\pi }}\left[ \Phi (y(t),\hat{f}_{t}(y(t)))\right] . \end{aligned}$$
(5.22)

Finally, as \(\varepsilon \) and T are arbitrary, by letting \(t\rightarrow \infty \) in (5.22), a simple use of Theorem 4.3 shows that

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{N\rightarrow \infty }E_{y}^{\hat{\pi }}\left[ \Phi ^{N}(y^{N}(t),\hat{a}_{t}^{N})\right] =0 \end{aligned}$$

which proves the desired result.\(\square \)