1 Introduction

Mean-field games have been introduced in [28] and [34] to show the existence of approximate Nash equilibria for fully observed non-cooperative continuous time games, when the number of agents is large but finite. The underlying idea of the mean-field method is to transform the decentralized game problem to a centralized stochastic control problem using the so-called Nash certainty equivalence (NCE) principle [28]. The optimal solution of this control problem, calibrated appropriately using the empirical distribution of the term that (weakly) couples the players, provides an approximate Nash equilibrium for games with a sufficiently large number of agents. To obtain the optimal solution to the associated stochastic control problem, one should simultaneously solve a Fokker–Planck equation evolving forward in time and a Hamilton–Jacobi–Bellman equation evolving backward in time. We refer the reader to [7, 13, 14, 22, 27, 29, 35, 50] for studies of fully observed continuous-time mean-field games with different models and cost functions, such as games with major-minor players, risk-sensitive games, games with Markov jump parameters, and LQG games.

In this paper, we study discrete-time partially observed mean-field games with risk-sensitive optimality criteria. Risk-sensitivity brings in an element of robustness to decision making and has been widely used in many fields, such as control, economics, financial engineering, and operations research, among others. As opposed to risk-neutral optimization where only the mean value of the cost is considered, risk-sensitive one places positive weights on also the higher moments, thus capturing the risk element (see [3, 50, 52]). In the model we study in this paper, we have a large but finite number of agents interacting with each other through their individual dynamics and cost functions via the mean-field term (i.e., the empirical distribution of their states). It is known that establishing the existence of Nash equilibria for these types of games is quite difficult due to the (almost) decentralized and noisy nature of the information structure of the problem [4, 5]. Therefore, it is of interest to find an approximate equilibrium with reduced complexity. To that end, upon letting the number of agents go to infinity, the mean-field term converges to the distribution of the state of a single generic agent. This decouples the dynamics and cost functions of the agents from each other, and because of that, in the limiting case, a generic agent is faced with a stochastic control problem with a constraint on the distribution of the state at each time (i.e., a mean-field game problem). The main goal in these problems is to show the existence of a policy and a state distribution flow such that this policy is an optimal solution of the stochastic control problem when the total population behavior is modeled by the state distribution flow and the resulting distribution of each agent’s state is the same as the state distribution flow when the generic agent applies this policy. This equilibrium condition is called the Nash certainty equivalence (NCE) principle in the literature. In this paper, we first consider the existence of such an equilibrium for the limiting case and then establish that the policy in this equilibrium constitutes an approximate Nash equilibrium for finite-agent games with sufficiently many agents.

In the literature, partially observed mean-field games have not been studied much, especially in the discrete-time setup. Indeed, this work seems to be the first one that studies discrete-time risk-sensitive mean-field games under partial observations. Prior works have mostly considered the risk-neutral continuous-time setup. It is obvious that analyses of continuous-time and discrete-time setups are quite different, requiring different sets of tools. In [30], the authors study a partially observed continuous-time mean-field game with linear individual dynamics. In [43, 45, 46], the authors consider a continuous-time mean-field game with major-minor agents and nonlinear dynamics where the minor agents can partially observe the state of the major agent. In [44, 47], the same authors also develop a nonlinear filtering theory for McKean–Vlasov-type stochastic differential equations that arise as the infinite population limit of the partially observed differential game of the mean-field type. In [12], the authors study the linear quadratic mean-field game with major-minor agents where the minor agents can partially observe the state of the major agent. In [20, 21], the authors consider the linear quadratic mean-field game, again with major-minor agents where, in this case, both the minor agents and the major agent can partially observe the state of the major agent. In [48], the authors study a continuous-time partially observed stochastic control problem of the mean-field type and establish a maximum principle to characterize the optimal control. In [31], the authors consider a continuous-time mean-field game with linear individual dynamics where two types of partial information structure are considered: (i) agents cannot observe the white noise which is common to all agents; (ii) agents can access the additive white-noise version of their own states.

For risk-sensitive cost criteria, existing works are mostly on the continuous-time set-up, with [37], discussed further below, being one exception. Now, in continuous-time set-up, reference [50] studies a class of mean-field games with nonlinear individual dynamics and a risk-sensitive cost function. They characterize the mean-field equilibrium via coupled HJB and FP equations and explicit solutions to these equations are given when the individual state dynamics are linear. In [49], the author considers a continuous-time mean-field game with nonlinear individual dynamics, where state dynamics have \(L^p\)-norm structure. Stochastic maximum principle is used to characterize the optimal solution of the problem. In [17], the authors study a partially observed version of the continuous-time risk-sensitive mean-field game. They establish a stochastic maximum principle for the characterization of the mean-field equilibrium. Reference [36] considers continuous-time risk-sensitive mean-field games with linear individual dynamics and local state information for the players. First a generic risk-sensitive optimal control problem is solved which yields mean-field equilibrium, and then, it is shown that the policies in mean-field equilibrium lead to an approximate Nash equilibrium for games with a sufficiently large number of agents. It is also shown that this approximate Nash equilibrium is partially equivalent to the approximate Nash equilibrium of a certain robust mean-field game problem. Finally, [37] presents the counterparts of these results for the discrete-time linear-quadratic risk-sensitive mean-field game.

Here, we consider discrete-time mean-field games with Polish state, action, and observation spaces (i.e., complete and separable metric spaces) under risk-sensitive optimality criteria for the players. In the infinite population limit of such games, a generic agent should solve a partially observed stochastic control problem under the NCE principle. Due to the constraints induced by NCE principle, common techniques used to analyze partially observed stochastic control problems are not sufficient. To establish the existence of an equilibrium solution in the infinite population limit, we have to bring in the fixed-point approach that is used to obtain equilibria in classical game problems, along with the technique of converting partially observed optimal control problems to fully observed ones on the belief space. The definitions of the finite-agent game and the mean-field game problems are given in Sect. 2 and Sect. 3, respectively. In Sect. 4, we prove the existence of a mean-field equilibrium. In Sects. 5 and 6, we establish that the mean-field equilibrium policy is approximately Nash for finite-agent games with sufficiently many agents. In Sect. 7, we extend previous results to games with infinite-horizon risk-sensitive cost functions. Section 9 concludes the paper.

In an earlier paper [41], we studied the risk-neutral version of this problem under a similar set of assumptions on the system components. There are some parallels between the techniques used in this paper and those in [41] to show the existence of a mean-field equilibrium and to prove that the policies in mean-field equilibrium provide an approximate Nash equilibrium for games with large but finitely many agents. In this paper, we exploit this connection and refer the reader to [41] for proofs of certain results. We note, however, that as far as their analyses go, there are considerable technical differences between risk-sensitive and risk-neutral cost functions. The fact that, in the risk-sensitive case, the cost function is in a multiplicative form leads to complication in the analysis of the optimality condition. Therefore, to establish the existence of a mean-field equilibrium in the infinite-population limit and an approximate Nash equilibrium in the finite-agent case, we need to first transform the risk-sensitive problem to one where the cost function is risk-neutral and in an additive form. However, in this risk-neutral form, the one-stage cost function and the transition probability become non-homogeneous (i.e., time-dependent) as opposed to the risk-neutral problem in [41]. Hence, after a careful execution of this step, we can prove the existence of a mean-field equilibrium by adapting the technique developed in [41] to the non-homogeneous and finite-horizon case. We also note that in [42] we have studied the fully observed version of the same problem under a slightly different set of assumptions on the system components. Indeed, to prove the existence of an approximate Nash equilibrium, here we generalize the results established in [42] to the game models with expanding state spaces and non-homogeneous system components.

Notation. For a metric space \({\mathsf E}\), we let \(C_b({\mathsf E})\) denote the set of all bounded continuous real functions on \({\mathsf E}\), \({\mathcal P}({\mathsf E})\) denote the set of all Borel probability measures on \({\mathsf E}\), and \({\mathcal B}({\mathsf E})\) denote the collection of Borel sets. For any \({\mathsf E}\)-valued random element x, \(\mathcal{L}(x)(\,\cdot \,) \in {\mathcal P}({\mathsf E})\) denotes the distribution of x. A sequence \(\{\mu _n\}\) of measures on \({\mathsf E}\) is said to converge weakly to a measure \(\mu \) if \(\int _{{\mathsf E}} g(e) \mu _n(de)\rightarrow \int _{{\mathsf E}} g(e) \mu (de)\) for all \(g \in C_b({\mathsf E})\). For any \(\nu \in {\mathcal P}({\mathsf E})\) and measurable real function g on \({\mathsf E}\), we define \(\nu (g) = \int g d\nu \). For any subset B of \({\mathsf E}\), we let \(\partial B\) and \(B^c\) denote the boundary and complement of B, respectively. The notation \(v\sim \nu \) means that the random element v has distribution \(\nu \). Unless otherwise specified, the term “measurable" will refer to Borel measurability.

2 Finite Player Game Model

2.1 Original Game Model

Let \({\mathsf S}\), \({\mathsf A}\), and \({\mathsf Y}\) be Polish spaces. We consider a discrete-time partially observed N-agent mean-field game with a state space \({\mathsf S}\), an action space \({\mathsf A}\), and an observation space \({\mathsf Y}\). For every \(i \in \{1,2,\ldots ,N\}\), the state, the action, and the observation of Agent i at time t (\(t=0,1,2,\ldots \)) are, respectively, denoted by \(s^N_i(t) \in {\mathsf S}, \text { } u^N_i(t) \in {\mathsf A}, \text { } \text {and} \text { } g^N_i(t) \in {\mathsf Y}.\) We let \( d_t^{(N)}(\,\cdot \,) = \frac{1}{N} \sum _{i=1}^N \delta _{s_i^N(t)}(\,\cdot \,) \in {\mathcal P}({\mathsf S}) \) denote the empirical distribution of the states (i.e., mean-field term) at time t, where \(\delta _s\in {\mathcal P}({\mathsf S})\) is the Dirac measure at s; that is, \(\delta _s(A) = 1\) if \(s \in A\) and otherwise 0.

At the initial time step \(t=0\), the states \((s^N_1(0),\ldots ,s^N_N(0)) \sim \kappa _0 \otimes \ldots \otimes \kappa _0\) are independent and identically distributed according to \(\kappa _0\). For each \(t \ge 0\), the current-observations \((g^N_1(t),\ldots ,g^N_N(t))\) and the next-states \((s^N_1(t+1),\ldots ,s^N_N(t+1))\) are distributed according to the probability laws:

$$\begin{aligned} \prod ^N_{i=1} l\big ({\text {d}}g^N_i(t)\big |s^N_i(t)\big ) \text { }\text { and } \text { } \prod ^N_{i=1} q\big ({\text {d}}s^N_i(t+1)\big |s^N_i(t),u^N_i(t),d^{(N)}_t\big ), \end{aligned}$$
(1)

where \(q : {\mathsf S}\times {\mathsf A}\times {\mathcal P}({\mathsf S}) \rightarrow {\mathcal P}({\mathsf S})\) is the state transition kernel and \(l: {\mathsf S}\rightarrow {\mathcal P}({\mathsf Y})\) is the observation kernel. Note that the state dynamics of each agent are weakly coupled through the mean-field term \(d^{(N)}_t\).

For any Agent i, define the history spaces \({\mathsf G}_0 = {\mathsf Y}\) and \({\mathsf G}_{t}= ({\mathsf Y}\times {\mathsf A})^t\times {\mathsf Y}\) for \(t=1,2,\ldots \), all endowed with product Borel \(\sigma \)-algebras. A policy for Agent i is a sequence \(\pi ^i=\{\pi _{t}^i\}\) of stochastic kernels on \({\mathsf A}\) given \({\mathsf G}_{t}\); that is, for any \(t\ge 0\), \( u_i^N(t) \sim \pi _t^i(\cdot |\gamma _i^N(t)), \) where \(\gamma ^N_i(t) = \big (g^N_i(t),u^N_i(t-1),g^N_i(t-1)\ldots ,u^N_i(0),g^N_i(0)\big )\) is the observation-action history observed by Agent i up to time t. The set of all policies for Agent i is denoted by \(\varPi _i\). Let \(\tilde{\varPi }_i\) be the set of policies in \(\varPi _i\) which only use the observations; that is, \(\pi \in \tilde{\varPi }_i\) if \(\pi _t: \prod _{k=0}^t {\mathsf Y}\rightarrow {\mathcal P}({\mathsf A})\) for each \(t\ge 0\). Let \({\varvec{\varPi }}^{(N)}= \prod _{i=1}^N \varPi _i \text { and } {\varvec{\tilde{\varPi }}}^{(N)} = \prod _{i=1}^N {\tilde{\varPi }}_i.\) We let \({\varvec{\pi }}^{(N)} = (\pi ^1,\ldots ,\pi ^N)\) (\(\pi ^i \in \varPi _i\)) denote the N-tuple of joint policies of all the agents in the game. Under such an N-tuple of policies, the actions of agents at each time \(t \ge 0\) are obtained with respect to the conditional probability distribution

$$\begin{aligned} \prod ^N_{i=1} \pi ^i_t\big (du^N_i(t)\big |\gamma ^N_i(t)\big ). \end{aligned}$$
(2)

The one-stage cost function for a generic agent is a measurable function \(m : {\mathsf S}\times {\mathsf A}\times {\mathcal P}({\mathsf S}) \rightarrow [0,\infty )\). Then, the agent’s finite-horizon risk-sensitive cost under a policy \({\varvec{\pi }}^{(N)} \in {\varvec{\varPi }}^{(N)}\) is given by

$$\begin{aligned} V_i^{(N)}({\varvec{\pi }}^{(N)})&= \frac{1}{\lambda } \log \biggl ( E^{{\varvec{\pi }}^{(N)}}\biggl [ e^{\lambda \sum _{t=0}^{T}\beta ^{t}m(s_{i}^N(t),u_{i}^N(t),d^{(N)}_t)}\biggr ]\biggr ), \end{aligned}$$

where \(\beta \in (0,1]\) is the discount factor, \(\lambda > 0\) is the risk factor, and T is the finite horizon of the problem. Here, \(E^{{\varvec{\pi }}^{(N)}}\big [\cdot \big ]\) denotes the expectation with respect to the probability law, which is uniquely specified by the kernels in (1) and (2) and the initial state distribution \(\kappa _0\).

Since \(\frac{1}{\lambda }\log (\cdot )\) is a strictly increasing function, without loss of generality, it suffices to consider only the part with expectation:

$$\begin{aligned} W_i^{(N)}({\varvec{\pi }}^{(N)})&= E^{{\varvec{\pi }}^{(N)}}\biggl [ e^{\lambda \sum _{t=0}^{T}\beta ^{t}m(s_{i}^N(t),u_{i}^N(t),d^{(N)}_t)}\biggr ]. \end{aligned}$$

With this cost function, the equilibrium solution for the game is defined as follows:

Definition 1

A policy \({\varvec{\pi }}^{(N*)}= (\pi ^{1*},\ldots ,\pi ^{N*})\) constitutes a Nash equilibrium for the N-player game, if

$$\begin{aligned} W_i^{(N)}({\varvec{\pi }}^{(N*)}) = \inf _{\pi ^i \in \varPi _i} W_i^{(N)}({\varvec{\pi }}^{(N*)}_{-i},\pi ^i) \end{aligned}$$

for each \(i=1,\ldots ,N\), where \({\varvec{\pi }}^{(N*)}_{-i} = (\pi ^{j*})_{j\ne i}\).

As we explained in detail in [41], establishing the existence of Nash equilibria for partially observed mean-field games is challenging due to the (almost) decentralized and noisy nature of the information structure of the problem. To that end, we slightly change the definition of Nash equilibrium in this model and adopt the approximate Nash equilibrium concept instead of exact Nash equilibrium.

Definition 2

A policy \({\varvec{\pi }}^{(N*)} \in \tilde{\varvec{\varPi }}^{(N)}\) is a Nash equilibrium if

$$\begin{aligned} W_i^{(N)}({\varvec{\pi }}^{(N*)})&= \inf _{\pi ^i \in \tilde{\varPi }_i} W_i^{(N)}({\varvec{\pi }}^{(N*)}_{-i},\pi ^i) \end{aligned}$$

for each \(i=1,\ldots ,N\), and an \(\varepsilon \)-Nash equilibrium (for a given \(\varepsilon > 0\)) if

$$\begin{aligned} W_i^{(N)}({\varvec{\pi }}^{(N*)})&\le \inf _{\pi ^i \in \tilde{\varPi }_i} W_i^{(N)}({\varvec{\pi }}^{(N*)}_{-i},\pi ^i) + \varepsilon \end{aligned}$$

for each \(i=1,\ldots ,N\).

According to this definition, the agents can only use their local observations \((g^N_i(t),\ldots ,g^N_i(0))\) to construct their policies. In real-life applications, agents typically have access only to their local observations. Hence, it suffices to establish the existence of an approximate Nash equilibrium for the game with a local information structure. In addition, in the discrete-time mean field literature, it is common to establish the existence of approximate Nash equilibria with local (decentralized) information structures (see [1] [9]). This is true for partially observed case as well (see [41]).

Here, our goal is to establish the existence of approximate Nash equilibria for games with sufficiently many agents. Indeed, if the number of agents is small, it is all but impossible to show even the existence of approximate Nash equilibria for these types of games. Therefore, it is key to assume that the number of agents is large (but finite). With this assumption, we can go to the infinite population limit, for which we can model the mean-field term as an exogenous state-measure flow, which should be consistent with the distribution of a generic agent (i.e., the NCE principle) by the law of large numbers. In this case, to establish the existence of an equilibrium, a generic agent should solve a classical partially observed stochastic control problem with a constraint on the distributions on the states (i.e., mean-field game). Then, we expect that if each agent in the finite-agent N game adopts the equilibrium policy in the infinite-population limit, the resulting policy will be an approximate Nash equilibrium for all sufficiently large N.

Our approach to prove the existence of approximate Nash equilibria can be summarized as follows: (i) Note that in the risk-sensitive criteria, the one-stage cost functions are in a multiplicative form as opposed to the risk-neutral setting. As stated earlier, this makes the analysis of the problem quite complicated. Therefore, we first construct an equivalent non-homogeneous game model, where the cost can be written in an additive form as in the risk-neutral case (see Sect. 2.2). (ii) Then, we introduce the infinite-population limit (\(N \rightarrow \infty \)) of the equivalent game model to approximate the finite-agent setting (see Sect. 3). (iii) By adapting the proof technique in [41] to the non-homogeneous and finite-horizon set-up, we prove the existence of an appropriately defined mean-field equilibrium for this limiting infinite-population game (see Sect. 4). (iv) Then, we return to the finite-N case for the equivalent game model and show that if each agent in the game problem adopts the mean-field equilibrium policy, then the resulting policy will be an approximate Nash equilibrium for all sufficiently large N. Since the equivalent game model is identical to the original game model in terms of cost functions, this establishes the existence of approximate Nash equilibria for the original game model (see Sects. 5 and 6).

Now, proceeding along the lines above, we first introduce the following assumptions, imposed throughout the paper.

Assumption 1

  1. (a)

    The cost function m is bounded and continuous with \(\Vert m \Vert = \sup _{s \in {\mathsf S}} |m(s)| \le K\).

  2. (b)

    The stochastic kernel q is weakly continuous in \((s,u,\kappa )\); i.e., \(q(\,\cdot \,|s(k),u(k),\kappa _k) \rightarrow q(\,\cdot \,|s,u,\kappa )\) weakly when \((s(k),u(k),\kappa _k) \rightarrow (s,u,\kappa )\).

  3. (c)

    The observation kernel l is continuous in s with respect to total variation norm; i.e., for all s, \(l(\,\cdot \,|s_k) \rightarrow l(\,\cdot \,|s)\) in total variation norm when \(s_k \rightarrow s\).

  4. (d)

    \({\mathsf A}\) is compact.

  5. (e)

    There exist a constant \(\alpha \ge 0\) and a continuous moment function \(v: {\mathsf S}\rightarrow [1,\infty )\) (see [25, Definition E.7]) such that

    $$\begin{aligned} \sup _{(u,\kappa ) \in {\mathsf A}\times {\mathcal P}({\mathsf S})} \int _{{\mathsf S}} v(y) q({\text {d}}y|s,u,\kappa ) \le \alpha v(s). \end{aligned}$$
    (3)
  6. (f)

    The initial probability measure \(\kappa _0\) satisfies \( \int _{{\mathsf S}} v(s) \kappa _0({\text {d}}s) = M < \infty . \)

2.2 Equivalent Game Model

In this section, we construct an equivalent game model whose states are the states of the original model plus the one-stage costs incurred up to that time. Namely, the state at time t for Agent i is

$$\begin{aligned} x_i^N(t) = \biggl (s_i^N(t),\sum _{k=0}^{t-1} \beta ^k m(s_i^N(k),u_i^N(k),d_k^{(N)})\biggr ). \end{aligned}$$

In this new model, finite-horizon risk-sensitive cost function can be written in an additive-form like in risk-neutral case. For this new game model, we have been inspired by [6], in which the authors study the classical fully observed risk-sensitive control problem. For a generic agent, this new game model is specified by

$$\begin{aligned} \biggl ( {\mathsf X}, {\mathsf A}, {\mathsf Y}, \{p_t\}_{t = 0}^{T+1}, r, \{c_t\}_{t=0}^{T+1}, \mu _0 \biggr ), \end{aligned}$$

where \( {\mathsf X}= {\mathsf S}\times [0,L] \) is the new state space with \(L = \frac{K}{1-\beta }\), where L is the maximum risk-neutral discounted-cost that can be incurred. For every t, the state transition kernel \(p_t : {\mathsf X}\times {\mathsf A}\times {\mathcal P}({\mathsf X}) \rightarrow {\mathcal P}({\mathsf X})\) is defined as:Footnote 1

$$\begin{aligned} p_t\bigl (B \times D \big | x(t),a(t),\mu _t\bigr ) = q(B|s(t),a(t),\mu _{t,1}) \otimes \delta _{m(t) + \beta ^t m(s(t),a(t),\mu _{t,1})}(D), \end{aligned}$$

where \(B \in {\mathcal B}({\mathsf S})\), \(D \in {\mathcal B}([0,L])\), \( x(t) = (s(t),m(t)), \) and \(\mu _{t,1}\) is the marginal of \(\mu _t\) on \({\mathsf S}\). Here, \(p_t\) is indeed the controlled transition probability of the next state \(s_i^N(t+1)\) and current risk-neutral total discounted cost

$$\begin{aligned}\sum _{k=0}^{t} \beta ^k m(s_i^N(k),a_i^N(k),d_k^{(N)})\end{aligned}$$

given the current state-action pair \((s_i^N(t),a_i^N(t))\) and past risk-neutral total discounted cost \(\sum _{k=0}^{t-1} \beta ^k m(s_i^N(k),a_i^N(k),d_k^{(N)})\) in the original game. The observation kernel \(r: {\mathsf X}\rightarrow {\mathcal P}({\mathsf Y})\) is equivalent to the observation kernel l in the original problem; that is, \(r({\text {d}}y|x) = l({\text {d}}y|s)\) where \(x = (s,m)\). For each t, the one-stage cost function \(c_t: {\mathsf X}\times {\mathsf A}\times {\mathcal P}({\mathsf X}) \rightarrow [0,\infty )\) is defined as:

$$\begin{aligned} c_t(x(t),a(t),\mu _t) = {\left\{ \begin{array}{ll} 0, &{} \text { } \text { if }t \le T \\ e^{\lambda m(t)}, &{} \text { } \text { if }t = T + 1. \end{array}\right. } \end{aligned}$$

Finally, the initial measure \(\mu _0\) is given by \(\mu _0({\text {d}}x(0)) = \kappa _0({\text {d}}s(0)) \otimes \delta _0({\text {d}}m(0))\), where the initial states \(\{x_i^N(0)\}\) are independent and identically distributed according to \(\mu _0\). Note that, in this equivalent game model, the finite-horizon is \(T+1\) instead of T and system components depend on time t. We also define the empirical distribution of the states at time t as follows:

$$\begin{aligned} e_t^{(N)}(\,\cdot \,) = \frac{1}{N} \sum _{i=1}^N \delta _{x_i^N(t)}(\,\cdot \,) \in {\mathcal P}({\mathsf X}). \end{aligned}$$

Suppose that Assumption 1 holds. Then, for each t, the following are true for the new game model:

  1. (I)

    The one-stage cost function \(c_t\) is bounded and continuous.

  2. (II)

    The stochastic kernel \(p_t\) is weakly continuous.

  3. (III)

    The observation kernel r is continuous with respect to the total variation distance.

  4. (IV)

    Let \(w: {\mathsf X}\rightarrow [1,\infty )\) be defined as \(w(x) = w((s,m)) = v(s)\), which is a moment function. Then, we have

    $$\begin{aligned} \sup _{(a,\mu ) \in {\mathsf A}\times {\mathcal P}({\mathsf X})} \int _{{\mathsf X}} w(y) p_t({\text {d}}y|x,a,\mu ) \le \alpha w(x). \end{aligned}$$
    (4)
  5. (V)

    The initial probability measure \(\mu _0\) satisfies \( \int _{{\mathsf X}} w(x) \mu _0({\text {d}}x) = M < \infty . \)

Recall that \(\tilde{\varPi }_i\) denotes the set of policies for Agent i that only use observations in the original game. Note that \(\tilde{\varPi }_i\) is also the set of policies for Agent i that only use observations in the new game model. For Agent i, the finite-horizon risk-neutral total cost under the N-tuple of policies \({\varvec{\pi }}^{(N)} \in \tilde{\varvec{\varPi }}^{(N)}\) is denoted as \(J_i^{(N)}({\varvec{\pi }}^{(N)})\); that is

$$\begin{aligned} J_i^{(N)}({\varvec{\pi }}^{(N)})&= E^{{\varvec{\pi }}^{(N)}}\biggl [ \sum _{t=0}^{T+1} c_t(x_i^N(t),a_i^N(t),e_t^{(N)})\biggr ]. \end{aligned}$$

The following proposition makes the connection between this new model and the original model. The proof is straightforward, and so, we omit the details (see the proof of [42, Proposition 5.1]).

Proposition 1

For any \({\varvec{\pi }}^{(N)} \in \tilde{\varvec{\varPi }}^{(N)}\) and \(i=1,\ldots ,N\), we have \(J_i^{(N)}({\varvec{\pi }}^{(N)}) = W_i^{(N)}({\varvec{\pi }}^{(N)})\).

Proposition 1 states that the new game model is equivalent to the original game model in terms of cost functions. This is true because the new game model consists of the one-stage costs incurred up to the current time as an additional state variable. Therefore, if we take the exponent of this additional state at time \(T+1\) as in the definition of \(c_{T+1}\), we obtain the risk-sensitive cost of the original game model. Hence, in the remainder of this paper, we replace the original game model with the new one; that is, from this point on, we have the following system components satisfying (I)-(V):

$$\begin{aligned} \biggl ( {\mathsf X}, {\mathsf A}, {\mathsf Y}, \{p_t\}_{t = 0}^{T+1}, r, \{c_t\}_{t=0}^{T+1}, \mu _0 \biggr ). \end{aligned}$$

Remark 1

Note that in the new game model, the time horizon is \(T+1\), which means that agents should also design control policies for the time step \(T+1\). However, note that control policies at time step \(T+1\) do not affect the cost function (i.e., one-stage cost at time \(T+1\) is only a function of the state), and thus agents indeed do not need to select these policies in the new game model. Hence, we can in a sense view the time horizons of the two problems as T.

Note that the cost functions \(J_i^{(N)}({\varvec{\pi }}^{(N)})\) of this new game model are in additive form (i.e., risk-neutral). Therefore, we can use a technique similar to the one in [41] to prove the existence of an approximate Nash equilibrium. To this end, we will first consider the infinite-population limit of the new game model and prove the existence of an equilibrium. Then, we will go back to the finite agent case and establish the existence of approximate Nash equilibrium for the new game model using the infinite population equilibrium solution. Since, by Proposition 1, the new game model has the same cost function as the original game model, the last result also implies the existence of an approximate Nash equilibrium for the original game, which was the main goal of this paper.

3 Partially Observed Mean-Field Games and Mean-Field Equilibria

In this section, we introduce the infinite population limit of the new game introduced in the preceding section. Although it is called mean-field game, it is not game in the classical sense: It is a stochastic control problem whose state distribution at each time step should satisfy a certain consistency condition. The optimal solution of this problem is referred to as mean-field equilibrium. In other words, we have a single agent and model the mean-field term by an exogenous state-measure flow \({\varvec{\mu }}:= (\mu _t)_{t = 0}^{T+1} \subset {\mathcal P}({\mathsf X})\) with a given initial condition \(\mu _0\), by the law of large numbers. This measure flow \({\varvec{\mu }}\) should also be consistent with the state distributions of this single agent when the agent acts optimally. The precise mathematical description of the problem is given as follows.

The mean-field game model for a generic agent is specified by

$$\begin{aligned} \biggl ( {\mathsf X}, {\mathsf A}, {\mathsf Y}, \{p_t\}_{t = 0}^{T+1}, r, \{c_t\}_{t=0}^{T+1}, \mu _0 \biggr ), \end{aligned}$$

where, as before, \({\mathsf X}\), \({\mathsf A}\), and \({\mathsf Y}\) are the state, action, and observation spaces, respectively. The stochastic kernel \(p_t : {\mathsf X}\times {\mathsf A}\times {\mathcal P}({\mathsf X}) \rightarrow {\mathcal P}({\mathsf X})\) denotes the transition probability, and \(r: {\mathsf X}\times {\mathcal P}({\mathsf X}) \rightarrow {\mathcal P}({\mathsf Y})\) denotes the observation kernel. The measurable function \(c_t: {\mathsf X}\times {\mathsf A}\times {\mathcal P}({\mathsf X}) \rightarrow [0,\infty )\) is the one-stage cost function and \(\mu _0\) is the distribution of the initial state.

Recall the history spaces \({\mathsf G}_0 = {\mathsf Y}\) and \({\mathsf G}_{t}=({\mathsf Y}\times {\mathsf A})^{t}\times {\mathsf Y}\) for \(t=1,2,\ldots \), all endowed with product Borel \(\sigma \)-algebras. A policy is a sequence \(\pi =\{\pi _{t}\}\) of stochastic kernels on \({\mathsf A}\) given \({\mathsf G}_{t}\). The set of all policies is denoted by \(\varPi \).

We let \({\mathcal M}= \bigl \{{\varvec{\mu }}\in {\mathcal P}({\mathsf X})^{T+2}: \mu _0 \text { is fixed}\bigr \}\) be the set of all state-measure flows with a given initial condition \(\mu _0\). Given any measure flow \({\varvec{\mu }}\in {\mathcal M}\), the evolution of the states, observations, and actions is as follows:

$$\begin{aligned} x(0)&\sim \mu _0, \\ y(t)&\sim r(\,\cdot \,|x(t)), \text { } t=0,1,\ldots \\ x(t)&\sim p_{t-1}(\,\cdot \,|x(t-1),a(t-1),\mu _{t-1}), \text { } t=1,2,\ldots \\ a(t)&\sim \pi _t(\,\cdot \,|\gamma (t)), \text { } t=0,1,\ldots , \end{aligned}$$

where \(\gamma (t) \in {\mathsf G}_t\) is the observation-action history up to time t. An initial distribution \(\mu _0\) on \({\mathsf X}\), a policy \(\pi \), and a state-measure flow \({\varvec{\mu }}\) define a unique probability measure \(P^{\pi }\) on \(({\mathsf X}\times {\mathsf Y}\times {\mathsf A})^{T+2}\). The expectation with respect to \(P^{\pi }\) is denoted by \(E^{\pi }[\,\cdot \,]\). A policy \(\pi ^{*} \in \varPi \) is said to be optimal for \({\varvec{\mu }}\) if \( J_{{\varvec{\mu }}}(\pi ^{*}) = \inf _{\pi \in \varPi } J_{{\varvec{\mu }}}(\pi ), \) where the finite-horizon cost of policy \(\pi \) with measure flow \({\varvec{\mu }}\) is given by

$$\begin{aligned} J_{{\varvec{\mu }}}(\pi )&= E^{\pi }\biggl [ \sum _{t=0}^{T+1} c_t(x(t),a(t),\mu _t) \biggr ] \end{aligned}$$

Using these definitions, we first define the set-valued mapping \( \varPsi : {\mathcal M}\rightarrow 2^{\varPi } \) as \(\varPsi ({\varvec{\mu }}) = \{\pi \in \varPi : \pi \text { is optimal for } {\varvec{\mu }}\}\). Conversely, we define a single-valued mapping \( \varLambda : \varPi \rightarrow {\mathcal M}\) as follows: given \(\pi \in \varPi \), the state-measure flow \({\varvec{\mu }}:= \varLambda (\pi )\) is constructed recursively as:

$$\begin{aligned} \mu _{t+1}(\,\cdot \,) = \int _{{\mathsf X}\times {\mathsf A}} p_t(\,\cdot \,|x(t),a(t),\mu _t) P^{\pi }({\text {d}}a(t)|x(t)) \mu _t({\text {d}}x(t)), \end{aligned}$$

where \(P^{\pi }({\text {d}}a(t)|x(t))\) denotes the conditional distribution of a(t) given x(t) under \(\pi \) and \((\mu _{\tau })_{0\le \tau \le t}\). Using \(\varPsi \) and \(\varLambda \), we now introduce the mean-field equilibrium.

Definition 3

A pair \((\pi ^*,{\varvec{\mu }}^*) \in \varPi \times {\mathcal M}\) is a mean-field equilibrium if \(\pi ^* \in \varPsi ({\varvec{\mu }}^*)\) and \({\varvec{\mu }}^* = \varLambda (\pi ^*)\).

The main result of this section is the existence of a mean-field equilibrium. Later we will show that this mean-field equilibrium constitutes an approximate Nash equilibrium for games with sufficiently many agents.

Theorem 1

The mean-field game \(\bigl ({\mathsf X}, {\mathsf A}, {\mathsf Y}, \{p_t\}_{t = 0}^{T+1}, r, \{c_t\}_{t=0}^{T+1}, \mu _0 \bigr )\) admits a mean-field equilibrium \((\pi ^*,{\varvec{\mu }}^*)\).

The proof of Theorem 1 is given in Sect. 4. Our approach to prove Theorem 1 can be summarized as follows: (i) First, we lift the partially observed stochastic control problem a generic agent is faced with for a given measure flow to a fully observed stochastic control problem; (ii) we then transform the fixed point equation \(\pi \in \varPsi (\varLambda (\pi ))\) characterizing the mean-field equilibrium into a fixed point equation of a set-valued mapping from the set of state-action measure flows into itself using the Bellman optimality operator; (iii) then, we prove that this set-valued mapping has a closed graph; and (iv) finally, we deduce the existence of a mean-field equilibrium using Kakutani’s fixed point theorem.

4 Proof of Theorem 1

Note that any measure flow \({\varvec{\mu }}\in {\mathcal M}\) leads to a non-homogenous partially observed Markov decision process (POMDP). Hence, before starting the proof of Theorem 1, we first review a few relevant results on POMDPs. To this end, fix any \({\varvec{\mu }}\in {\mathcal M}\) and consider the corresponding optimal control problem.

Let \({\mathcal P}_w({\mathsf X}) = \bigl \{\mu \in {\mathcal P}({\mathsf X}): \int _{{\mathsf X}} w(x) \mu ({\text {d}}x) < \infty \bigr \}\). It is known that any POMDP can be reduced to a (completely observable) MDP (see [39, 53]), whose states are the posterior state distributions or beliefs of the observer; that is, the state at time t is

$$\begin{aligned} z(t) = {\mathsf {Pr}}\{x(t) \in \,\cdot \, | y(0),\ldots ,y(t), a(0), \ldots , a(t-1)\} \in {\mathcal P}({\mathsf X}). \end{aligned}$$

We call this equivalent MDP the belief-state MDP. Note that since \(\mathcal{L}(x(t)) \in {\mathcal P}_w({\mathsf X})\) under any policy by (IV)-(V), we have

$$\begin{aligned}{\mathsf {Pr}}\{x(t) \in \,\cdot \, | y(0),\ldots ,y(t), a(0), \ldots , a(t-1)\} \in {\mathcal P}_w({\mathsf X})\end{aligned}$$

almost everywhere. Therefore, the belief-state MDP has state space \({\mathsf Z}= {\mathcal P}_w({\mathsf X})\) and action space \({\mathsf A}\). Here, \({\mathsf Z}\) is endowed with the Borel \(\sigma \)-algebra generated by the topology of weak convergence. Next, we construct the transition probabilities \(\{\eta _t\}_{t=0}^{T+1}\) of the belief-state MDP (see also [24]). Let z denote the generic state variable for the belief-state MDP. Fix any t. First consider the transition probability on \({\mathsf X}\times {\mathsf Y}\) given \({\mathsf Z}\times {\mathsf A}\)

$$\begin{aligned} R_t(x \in A, y \in B|z,a) = \int _{{\mathsf X}} \kappa _t(A,B|x',a) z({\text {d}}x'), \end{aligned}$$

where \(\kappa _t({\text {d}}x,{\text {d}}y|x',a) = r({\text {d}}y|x) \otimes p_t({\text {d}}x|x',a,\mu _t)\). Let us disintegrate \(R_t\) as follows \( R_t({\text {d}}x,{\text {d}}y|z,a) = H_t({\text {d}}y|z,a) \otimes F_t({\text {d}}x|z,a,y). \) Then, we define the mapping \(F_t: {\mathsf Z}\times {\mathsf A}\times {\mathsf Y}\rightarrow {\mathsf Z}\) as:

$$\begin{aligned} F_t(z,a,y)(\,\cdot \,) = F_t(\,\cdot \,|z,a,y) . \end{aligned}$$
(5)

Then, \(\eta _t: {\mathsf Z}\times {\mathsf A}\rightarrow {\mathcal P}({\mathsf Z})\) is defined as:

$$\begin{aligned}&\eta _t(\,\cdot \,|z(t),a(t)) = \int _{{\mathsf Y}} \delta _{F_t(z(t),a(t),y(t+1))}(\,\cdot \,) \text { } H_t({\text {d}}y(t+1)|z(t),a(t)). \end{aligned}$$

The initial point for the belief-state MDP is \(\mu _0\); that is, \(\mathcal{L}(z(0)) \sim \delta _{\mu _0}\). Finally, for each t, the one-stage cost function \(C_t\) of the belief-state MDP is given by

$$\begin{aligned} C_t(z,a) = \int _{{\mathsf X}} c_t(x,a,\mu _t) z({\text {d}}x). \end{aligned}$$
(6)

Hence, the belief-state MDP is a Markov decision process with the components \( \bigl ({\mathsf Z},{\mathsf A},\{\eta _t\}_{t=0}^{T+1},\{C_t\}_{t=0}^{T+1},\delta _{\mu _0}\bigr ). \)

For the belief-state MDP define the history spaces \({\mathsf K}_0 = {\mathsf Z}\) and \({\mathsf K}_{t}=({\mathsf Z}\times {\mathsf A})^{t}\times {\mathsf Z}\), \(t=1,2,\ldots \). A policy is a sequence \(\varphi =\{\varphi _{t}\}\) of stochastic kernels on \({\mathsf A}\) given \({\mathsf K}_{t}\). The set of all policies is denoted by \(\varPhi \). A Markov policy is a sequence \(\varphi =\{\varphi _{t}\}\) of stochastic kernels on \({\mathsf A}\) given \({\mathsf Z}\). The set of Markov policies is denoted by \({\mathsf M}\). Let \(\tilde{J}(\varphi ,\mu _0)\) denote the finite-horizon cost function of policy \(\varphi \in \varPhi \) for initial point \(\mu _0\) of the belief-state MDP. Notice that any history vector \(s(t) = (z(0),\ldots ,z(t),a(0),\ldots ,a(t-1))\) of the belief-state MDP is a function of the history vector \(\gamma (t) = (y(0),\ldots ,y(t),a(0),\ldots ,a(t-1))\) of the POMDP. Let us write this relation as \(i(\gamma (t)) = s(t)\). Hence, for a policy \(\varphi = \{\varphi _t\} \in \varPhi \), we can define a policy \(\pi ^{\varphi } = \{\pi _t^{\varphi }\} \in \varPi \) as \( \pi _t^{\varphi }(\,\cdot \,|\gamma (t)) = \varphi _t(\,\cdot \,|i(\gamma (t))). \) Let us write this as a mapping from \(\varPhi \) to \(\varPi \): \(\varPhi \ni \varphi \mapsto i(\varphi ) = \pi ^{\varphi } \in \varPi \). It is straightforward to show that the cost functions \(\tilde{J}(\varphi ,\mu _0)\) and \(J_{{\varvec{\mu }}}(\pi ^{\varphi })\) are the same. One can also prove that (see [53, 39])

$$\begin{aligned} \inf _{\varphi \in \varPhi } \tilde{J}(\varphi ,\mu _0)&= \inf _{\pi \in \varPi } J_{{\varvec{\mu }}}(\pi ) \end{aligned}$$
(7)

and furthermore, that if \(\varphi \) is an optimal policy for belief-state MDP, then \(\pi ^{\varphi }\) is optimal for the POMDP as well. Therefore, the optimal control problem for the mean-field game is equivalent to the optimal control of belief-state MDP.

We now derive the conditions that are satisfied by belief-state MDP. To that end, define \(W:{\mathsf Z}\rightarrow \mathbb {R}\) as:

$$\begin{aligned} W(z) = \int _{{\mathsf X}} w(x) z({\text {d}}x). \end{aligned}$$

Note that W is a lower semi-continuous moment function on \({\mathsf Z}\). One can prove that (see [41, Section 4]) the belief-state MDP satisfies the following conditions under Assumption 1:

  1. (i)

    The cost functions \(\{C_t\}\) are bounded and continuous.

  2. (ii)

    The stochastic kernels \(\{\eta _t\}\) are weakly continuous.

  3. (iii)

    \({\mathsf A}\) is compact and \({\mathsf Z}\) is \(\sigma \)-compact.

  4. (iv)

    There exists a constant \(\alpha \ge 0\) such that

    $$\begin{aligned} \sup _{a \in {\mathsf A}} \int _{{\mathsf Z}} W(y) \eta _t({\text {d}}y|z,a) \le \alpha W(z), \text { } \text {for all }t. \end{aligned}$$
  5. (v)

    The initial probability measure \(\delta _{\mu _0}\) satisfies \( W(\delta _{\mu _0}) = M < \infty . \)

With these conditions, we are now ready to prove Theorem 1 by adapting techniques in [41] to the non-homogeneous and finite-horizon set-up.

We first define the mapping \({\mathsf B}: {\mathcal P}({\mathsf Z}) \rightarrow {\mathcal P}({\mathsf X})\), which will define the relation between state-measure flows in the mean-field game and state-measure flows in the belief-state MDP, as follows:

$$\begin{aligned} {\mathsf B}(\nu )(\,\cdot \,) = \int _{{\mathsf Z}} z(\,\cdot \,) \text { } \nu ({\text {d}}z). \end{aligned}$$

Using this definition, for any \({\varvec{\nu }}\in {\mathcal P}({\mathsf Z}\times {\mathsf A})^{T+2}\), we define the measure flow \({\varvec{\mu }}^{{\varvec{\nu }}} \in {\mathcal P}({\mathsf X})^{T+2}\) as follows:

$$\begin{aligned} {\varvec{\mu }}^{{\varvec{\nu }}} = \bigl ({\mathsf B}(\nu _{t,1})\bigr )_{t=0}^{T+1}, \end{aligned}$$

where for any \(\nu \in {\mathcal P}({\mathsf Z}\times {\mathsf A})\), we let \(\nu _1\) denote the marginal of \(\nu \) on \({\mathsf Z}\). Let \(\{\eta _t^{{\varvec{\nu }}}\}_{t=0}^{T+1}\) and \(\{C_t^{{\varvec{\nu }}}\}_{t=0}^{T+1}\) be, respectively, the transition probabilities and one-stage cost functions of belief-state MDP induced by the measure flow \({\varvec{\mu }}^{{\varvec{\nu }}}\). We let \(J_{*,t}^{{\varvec{\nu }}}: {\mathsf Z}\rightarrow [0,\infty )\) denote the optimal value function at time t of this belief-state MDP; that is,

$$\begin{aligned} J_{*,t}^{{\varvec{\nu }}}(z) = \inf _{\varphi \in \varPhi } E^{\varphi } \biggl [ \sum _{k=t}^{T+1} C_k^{{\varvec{\nu }}}(z(k),a(k)) \bigg | z(t) = z \biggr ]. \end{aligned}$$

Let \(J_{*}^{{\varvec{\nu }}} = \bigl ( J^{{\varvec{\nu }}}_{*,t}\bigr )_{t=0}^{T+1}\).

To prove the existence of a mean-field equilibrium, we use the technique in [32]. To that end, we first transform the fixed point equation \(\pi \in \varPsi (\varLambda (\pi ))\) characterizing the mean-field equilibrium into a fixed-point equation of a set-valued mapping from the set of state-action measure flows \({\mathcal P}({\mathsf Z}\times {\mathsf A})^{T+2}\) into itself. Then, using Kakutani’s fixed point theorem ([2, Corollary 17.55]), we deduce the existence of a mean-field equilibrium.

For any t, the Bellman optimality operator \(T_t^{{\varvec{\nu }}}: C_{b}({\mathsf Z})\rightarrow C_{b}({\mathsf Z})\) is given by

$$\begin{aligned} T_t^{{\varvec{\nu }}} u(z) = \min _{a \in {\mathsf A}} \biggl [ C^{{\varvec{\nu }}}_t(z,a) + \int _{{\mathsf Z}} u(y) \eta ^{{\varvec{\nu }}}_t({\text {d}}y|z,a) \biggr ]. \end{aligned}$$

Note that \(T_t^{{\varvec{\nu }}} J^{{\varvec{\nu }}}_{*,t+1} = J^{{\varvec{\nu }}}_{*,t}\) for every t. The following theorem is a known result in the theory of nonhomogeneous Markov decision processes (see [26, Theorems 14.4 and 17.1]). For any given \({\varvec{\nu }}\), it characterizes the optimal policy of the belief-state MDP.

Theorem 2

For any \({\varvec{\nu }}\), a policy \(\varphi \in {\mathsf M}\) is optimal if and only if, for all t,

$$\begin{aligned}&\nu _t^{\varphi } \biggl ( \biggr \{ (z,a) : C^{{\varvec{\nu }}}_t(z,a) + \int _{{\mathsf Z}} J_{*,t+1}^{{\varvec{\nu }}}(y) \eta ^{{\varvec{\nu }}}_t({\text {d}}y|z,a) = T_t^{{\varvec{\nu }}} J_{*,t+1}^{{\varvec{\nu }}}(z) \biggr \} \biggr ) = 1, \end{aligned}$$
(8)

where \(\nu _t^{\varphi } = \mathcal{L}\bigl ( z(t),a(t) \bigr )\) under \(\varphi \) and \({\varvec{\nu }}\).

Using Theorem 2, we now define the set-valued map from \({\mathcal P}({\mathsf Z}\times {\mathsf A})^{T+2}\) into itself. To that end, for any \({\varvec{\nu }}\in {\mathcal P}({\mathsf Z}\times {\mathsf A})^{T+2}\), let us define the following sets:

$$\begin{aligned} C({\varvec{\nu }})&= \biggl \{ {\varvec{\nu }}' \in {\mathcal P}({\mathsf Z}\times {\mathsf A})^{T+2}: \nu '_{0,1} = \delta _{\mu _0}, \, \nu '_{t+1,1}(\,\cdot \,) = \int _{{\mathsf Z}\times {\mathsf A}} \eta _t^{{\varvec{\nu }}}(\,\cdot \,|z,a) \nu _t({\text {d}}z,{\text {d}}a)\biggr \} \end{aligned}$$

and

$$\begin{aligned} B({\varvec{\nu }})&= \biggl \{ {\varvec{\nu }}' \in {\mathcal P}({\mathsf Z}\times {\mathsf A})^{T+2}: \forall 0\le t \le T+1, \text { } \\&\nu _t' \biggl ( \biggr \{ (z,a) : C_t^{{\varvec{\nu }}}(z,a) + \int _{{\mathsf Z}} J_{*,t+1}^{{\varvec{\nu }}}(y) \eta _t^{{\varvec{\nu }}}({\text {d}}y|z,a) = T_t^{{\varvec{\nu }}} J^{{\varvec{\nu }}}_{*,t+1}(z) \biggr \} \biggr ) = 1 \biggr \}. \end{aligned}$$

Here, the set \(C({\varvec{\nu }})\) characterizes the consistency of the mean-field term with the state distribution of a generic agent, and the set \(B({\varvec{\nu }})\) characterizes optimality of the policy for the mean-field term. The set-valued mapping \(\varGamma : {\mathcal P}({\mathsf Z}\times {\mathsf A})^{T+2} \rightarrow 2^{{\mathcal P}({\mathsf Z}\times {\mathsf A})^{T+2}}\) is given as follows:

$$\begin{aligned} \varGamma ({\varvec{\nu }}) = C({\varvec{\nu }}) \cap B({\varvec{\nu }}). \end{aligned}$$

Note that the fixed-point equation \(\pi \in \varPsi (\varLambda (\pi ))\) characterizes the behavior of the state distribution and the control law in mean-field equilibrium separately. To establish the existence of mean-field equilibrium via Kakutani’s Fixed Point Theorem or Banach Fixed Point Theorem using this equation, one needs to put some topology on the policy space. However, by combining the state distribution with the control law, which gives the joint distribution of the state and the action, we can characterize via the set-valued mapping \(\varGamma \) the behavior of the state and the control law together in mean-field equilibrium. This will enable us to deduce the existence of a mean-field equilibrium without introducing a topology for the control laws, which is in general the solution technique in continuous time setup (see [28]).

An element \({\varvec{\nu }}\) is a fixed point of \(\varGamma \) if \({\varvec{\nu }}\in \varGamma ({\varvec{\nu }})\). The following proposition makes the connection between mean-field equilibria and fixed points of \(\varGamma \).

Proposition 2

Suppose that \(\varGamma \) has a fixed point \({\varvec{\nu }}= (\nu _t)_{t = 0}^{T+1}\). Construct a Markov policy \(\varphi = \{\varphi _t\}\) for belief-state MDP by disintegrating each \(\nu _t\) as \(\nu _t({\text {d}}z,{\text {d}}a) = \nu _{t,1}({\text {d}}z) \varphi _t({\text {d}}a|z)\). Let \(\pi ^*=\pi ^{\varphi }\) and \({\varvec{\mu }}^* = ({\mathsf B}(\nu _{t,1}))_{t = 0}^{T+1}\). Then, the pair \((\pi ^{*},{\varvec{\mu }}^*)\) is a mean-field equilibrium.

Proof

Note that, since \({\varvec{\nu }}\in C({\varvec{\nu }})\), we have \(\nu _{t} = \mathcal{L}\bigl ( z(t),a(t) \bigr )\) for belief-state MDP under the policy \(\varphi \) and the measure flow \({\varvec{\mu }}^*\). Then, for any \(f \in C_b({\mathsf X})\), we have

$$\begin{aligned} \mu _{t+1}^*(f)&= {\mathsf B}(\nu _{t+1,1})(f) \nonumber \\&= \int _{{\mathsf Z}\times {\mathsf A}} \int _{{\mathsf Z}} z'(f) \eta _t^{{\varvec{\nu }}}({\text {d}}z'|z,a) \nu _t({\text {d}}z,{\text {d}}a) \nonumber \\&= \int _{{\mathsf Z}\times {\mathsf A}} \biggl \{ \int _{{\mathsf X}} \int _{{\mathsf X}} f(y) p_t({\text {d}}y|x,a,\mu _t^*) z({\text {d}}x) \biggr \} \nu _t({\text {d}}z,{\text {d}}a) \nonumber \\&= E^{\varphi } \bigl [ l_t(z(t),a(t)) \bigr ] \,\, {\biggl (\hbox {here} l_t(z,a) = \int _{{\mathsf X}} \int _{{\mathsf X}} f(y) p_t({\text {d}}y|x,a,\mu _t^*) z({\text {d}}x)\biggr )} \nonumber \\&= E^{\pi ^{*}} \biggl [ \int _{{\mathsf X}} f(y) p_t({\text {d}}y|x(t),a(t),\mu _t^*) \biggr ]. \end{aligned}$$
(9)

Since (9) is true for all \(f \in C_b({\mathsf X})\), we have

$$\begin{aligned} \mu _{t+1}^*(\,\cdot \,) = \int _{{\mathsf X}\times {\mathsf A}} p_t(\,\cdot \,|x(t),a(t),\mu _t^*) P^{\pi ^{*}}({\text {d}}a(t)|x(t)) \mu _t^*({\text {d}}x(t)), \end{aligned}$$

where \(P^{\pi ^{*}}({\text {d}}a(t)|x(t))\) denotes the conditional distribution of a(t) given x(t) under \(\pi ^{*}\) and \((\mu _{\tau }^*)_{0\le \tau \le t}\). Hence, \(\varLambda (\pi ^{*}) = {\varvec{\mu }}^*\).

Since \({\varvec{\nu }}\in B({\varvec{\nu }})\), the corresponding Markov policy \(\varphi \) satisfies (8) for \({\varvec{\nu }}\). Therefore, by Theorem 2 and the fact that \(\nu _{t} = \mathcal{L}\bigl ( z(t),a(t) \bigr )\) for belief-state MDP under the policy \(\varphi \) and the measure flow \({\varvec{\mu }}^*\), \(\varphi \) is optimal for belief-state MDP induced by the measure flow \({\varvec{\mu }}^*\) (or, equivalently, \({\varvec{\nu }}\)). Therefore, \(\pi ^{*} \in \varPsi ({\varvec{\mu }}^*)\). \(\square \)

By Proposition 2, it suffices to prove that \(\varGamma \) has a fixed point in order to establish the existence of a mean-field equilibrium. To prove this, we use Kakutani’s fixed point theorem, which is stated below:

Theorem 3

[2, Corollary 17.55] Let K be a non-empty compact convex subset of a locally convex Hausdorff space, and let the set-valued mapping \(\phi : K \rightarrow 2^K\) have closed graph and non-empty convex values. Then, the set of fixed points of \(\phi \) is compact and non-empty.

Hence, in order to use Kakutani’s fixed point theorem, the set-valued mapping \(\varGamma \) should be defined on a convex and compact set. However, the set \({\mathcal P}({\mathsf Z}\times {\mathsf A})^{T+2}\) in the definition of \(\varGamma \) is not compact. To get around that, we will prove that the image of \({\mathcal P}({\mathsf Z}\times {\mathsf A})^{T+2}\) under \(\varGamma \) is in fact a subset of some convex and compact set, and it is sufficient to consider this convex and compact set in the definition of \(\varGamma \). To that end, for each t, define the set

$$\begin{aligned} {\mathcal P}^t({\mathsf Z}) = \biggl \{ \mu \in {\mathcal P}({\mathsf Z}): \int _{{\mathsf Z}} W(z) \mu ({\text {d}}z) \le \alpha ^t M \biggr \}. \end{aligned}$$

Since W is a lower semi-continuous moment function, the set \({\mathcal P}^t({\mathsf Z})\) is compact with respect to the weak topology [25, Proposition E.8, p. 187]. Let us define

$$\begin{aligned} {\mathcal P}^t({\mathsf Z}\times {\mathsf A}) = \bigl \{ \nu \in {\mathcal P}({\mathsf Z}\times {\mathsf A}): \nu _1 \in {\mathcal P}^t({\mathsf Z}) \bigr \}. \end{aligned}$$

Since \({\mathsf A}\) is compact, \({\mathcal P}^t({\mathsf Z}\times {\mathsf A})\) is tight. Furthermore, \({\mathcal P}^t({\mathsf Z}\times {\mathsf A})\) is closed with respect to the weak topology since W is lower semi-continuous. Hence, \({\mathcal P}^t({\mathsf Z}\times {\mathsf A})\) is compact. Let \(\varXi = \prod _{t=0}^{T+1} {\mathcal P}^t({\mathsf Z}\times {\mathsf A})\), which is convex and compact with respect to the product topology.

Proposition 3

We have \(\varGamma \bigl ({\mathcal P}({\mathsf Z}\times {\mathsf A})^{T+2}\bigr ) = \bigl \{{\varvec{\nu }}' : {\varvec{\nu }}' \in \varGamma ({\varvec{\nu }}), \text { } {\varvec{\nu }}\in {\mathcal P}({\mathsf Z}\times {\mathsf A})^{T+2} \bigr \} \subset \varXi \).

Proof

Fix any \({\varvec{\nu }}\in {\mathcal P}({\mathsf Z}\times {\mathsf A})^{T+2}\). It is sufficient to prove that \(C({\varvec{\nu }}) \subset \varXi \) as \(\varGamma ({\varvec{\nu }}) = C({\varvec{\nu }}) \cap B({\varvec{\nu }})\). Let \({\varvec{\nu }}' \in C({\varvec{\nu }})\). We prove by induction that \(\nu '_{t,1} \in {\mathcal P}^t_v({\mathsf Z})\) for all t. The claim trivially holds for \(t=0\) as \(\nu '_{0,1} = \delta _{\mu _0}\). Assume that the claim holds for t and consider \(t+1\). We have

$$\begin{aligned} \int _{{\mathsf Z}} W(y) \nu '_{t+1,1}({\text {d}}y)&= \int _{{\mathsf Z}\times {\mathsf A}} \int _{{\mathsf Z}} W(y) \eta _{t}^{{\varvec{\nu }}}({\text {d}}y|z,a) \nu _{t}({\text {d}}z,{\text {d}}a) \\&\le \int _{{\mathsf Z}} \alpha W(z) \nu _{t,1}({\text {d}}z) \text { }(\text {by (iv)}) \\&\le \alpha ^{t+1} M \text { }(\text {as }\nu _{t,1} \in {\mathcal P}^t_v({\mathsf Z})). \end{aligned}$$

Hence, \(\nu '_{t+1,1} \in {\mathcal P}^{t+1}_v({\mathsf Z})\). \(\square \)

By Proposition 3, we can now consider \(\varGamma \) as a multi-valued mapping from \(\varXi \) into itself. It can be proved that \(C({\varvec{\nu }}) \cap B({\varvec{\nu }}) \ne \emptyset \) for any \({\varvec{\nu }}\in \varXi \). Indeed, for any \(t\ge 0\), we define

$$\begin{aligned} \mu _{t+1}(\,\cdot \,) = \int _{{\mathsf Z}\times {\mathsf A}} \eta _t^{{\varvec{\nu }}}(\,\cdot \,|z,a) \, \nu _t({\text {d}}x,{\text {d}}a). \end{aligned}$$

Moreover, for any \(t\ge 0\), let \(f_t: {\mathsf Z}\rightarrow {\mathsf A}\) be the minimizer of the following optimality equation:

$$\begin{aligned}&C_t^{{\varvec{\nu }}}(z,f_t(z)) + \int _{{\mathsf Z}} J_{*,t+1}^{{\varvec{\nu }}}(y) \eta _t^{{\varvec{\nu }}}({\text {d}}y|z,f_t(z)) = T_t^{{\varvec{\nu }}} J^{{\varvec{\nu }}}_{*,t+1}(z). \end{aligned}$$

Existence of such an \(f_t\) follows from the Measurable Selection Theorem [25, Section D] since \(C_t^{{\varvec{\nu }}}\) is continuous in a, \(\eta _t^{{\varvec{\nu }}}\) is weakly continuous in a, and \({\mathsf A}\) is compact. If we define \(\nu '_t({\text {d}}z,{\text {d}}a) = \mu _t({\text {d}}z) \, \delta _{f_t(z)}({\text {d}}a)\), then it is straightforward to prove that \({\varvec{\nu }}' \in C({\varvec{\nu }}) \cap B({\varvec{\nu }})\), and thus \(C({\varvec{\nu }}) \cap B({\varvec{\nu }}) \ne \emptyset \). Moreover, both \(C({\varvec{\nu }})\) and \(B({\varvec{\nu }})\) are convex, and so, their intersection is also convex. \(\varXi \) is a convex compact subset of a locally convex topological space \({\mathcal M}({\mathsf Z}\times {\mathsf A})^{T+2}\), where \({\mathcal M}({\mathsf Z}\times {\mathsf A})\) denotes the set of all finite signed measures on \({\mathsf Z}\times {\mathsf A}\). Hence, in order to deduce the existence of a fixed point of \(\varGamma \), we only need to prove that it has a closed graph. Before stating this result, we state the following proposition which is a key element of the proof.

Proposition 4

([41, Proposition 4.3]) Let \({\varvec{\nu }}^{(n)} \rightarrow {\varvec{\nu }}\) in product topology. Then, for all t, \(\eta _t^{{\varvec{\nu }}^{(n)}}(\,\cdot \,|z_n,a_n)\) weakly converges to \(\eta _t^{{\varvec{\nu }}}(\,\cdot \,|z,a)\) for all \((z_n,a_n) \rightarrow (z,a) \in {\mathsf Z}\times {\mathsf A}\).

Using Proposition 4, we can now prove the following result.

Proposition 5

The graph of \(\varGamma \), i.e., the set

is closed.

Proof

The graph of \(\varGamma \) is closed if and only if when \(({\varvec{\nu }}^{(n)},{\varvec{\xi }}^{(n)}) \rightarrow ({\varvec{\nu }},{\varvec{\xi }})\) as \(n\rightarrow \infty \) for some \(\bigl \{({\varvec{\nu }}^{(n)},{\varvec{\xi }}^{(n)})\bigr \} \subset \varXi \), then we must have \({\varvec{\xi }}\in \varGamma ({\varvec{\nu }})\). To that end, let be such that \(({\varvec{\nu }}^{(n)},{\varvec{\xi }}^{(n)}) \rightarrow ({\varvec{\nu }},{\varvec{\xi }})\) as \(n\rightarrow \infty \) for some \(({\varvec{\nu }},{\varvec{\xi }}) \in \varXi \times \varXi \). We prove that \({\varvec{\xi }}\in \varGamma ({\varvec{\nu }})\).

Using Proposition 4, we first prove that \({\varvec{\xi }}\in C({\varvec{\nu }})\); that is, for all t, we have

$$\begin{aligned} \xi _{t+1,1}(\,\cdot \,) = \int _{{\mathsf Z}\times {\mathsf A}} \eta _t^{{\varvec{\nu }}}(\,\cdot \,|z,a) \nu _t({\text {d}}z,{\text {d}}a). \end{aligned}$$

For all n and t, we have

$$\begin{aligned} \xi ^{(n)}_{t+1,1}(\,\cdot \,) = \int _{{\mathsf Z}\times {\mathsf A}} \eta _t^{{\varvec{\nu }}^{(n)}}(\,\cdot \,|z,a) \nu ^{(n)}_t({\text {d}}z,{\text {d}}a). \end{aligned}$$
(10)

Since \({\varvec{\xi }}^{(n)} \rightarrow {\varvec{\xi }}\) in \(\varXi \), \(\xi ^{(n+1)}_{t+1} \rightarrow \xi _{t+1}\) weakly. Let \(g \in C_b({\mathsf Z})\). Then, by [33, Theorem 3.5], we have

$$\begin{aligned}&\lim _{n\rightarrow \infty } \int _{{\mathsf Z}\times {\mathsf A}} \int _{{\mathsf Z}} g(z') \eta _t^{{\varvec{\nu }}^{(n)}}({\text {d}}z'|z,a) \nu ^{(n)}_t({\text {d}}z,{\text {d}}a) =\int _{{\mathsf Z}\times {\mathsf A}} \int _{{\mathsf Z}} g(z') \eta _t^{{\varvec{\nu }}}({\text {d}}z'|z,a) \nu _t({\text {d}}x,{\text {d}}a) \end{aligned}$$

since \({\varvec{\nu }}^{(n)}_t \rightarrow {\varvec{\nu }}_t\) weakly and \(\int _{{\mathsf Z}} g(y) \eta _t^{{\varvec{\nu }}^{(n)}}(\,\cdot \,|z,a)\) converges to \(\int _{{\mathsf Z}} g(y) \eta _t^{{\varvec{\nu }}}(\,\cdot \,|z,a)\) continuouslyFootnote 2 (see [33, Theorem 3.5]). This implies that the measure on the right-hand side of (10) converges weakly to \(\int _{{\mathsf Z}\times {\mathsf A}} \eta _t^{{\varvec{\nu }}}(\,\cdot \,|z,a) \nu _t({\text {d}}z,{\text {d}}a)\). Therefore, we have

$$\begin{aligned} \xi _{t+1,1}(\,\cdot \,) = \int _{{\mathsf Z}\times {\mathsf A}} \eta _t^{{\varvec{\nu }}}(\,\cdot \,|z,a) \nu _t({\text {d}}z,{\text {d}}a), \end{aligned}$$

from which we conclude that \({\varvec{\xi }}\in C({\varvec{\nu }})\).

To complete the proof, it suffices to prove that \({\varvec{\xi }}\in B({\varvec{\nu }})\). To that end, for each n and t, let us define the following functions:

$$\begin{aligned} F^{(n)}_t(z,a)&= C_t^{{\varvec{\nu }}^{(n)}}(z,a) + \int _{{\mathsf Z}} J^{{\varvec{\nu }}^{(n)}}_{*,t+1}(y) \eta _t^{{\varvec{\nu }}^{(n)}}({\text {d}}y|z,a) \end{aligned}$$

and

$$\begin{aligned} F_t(z,a)&= C_t^{{\varvec{\nu }}}(z,a) + \int _{{\mathsf Z}} J^{{\varvec{\nu }}}_{*,t+1}(y) \eta _t^{{\varvec{\nu }}}({\text {d}}y|z,a). \end{aligned}$$

By definition, \( J^{{\varvec{\nu }}^{(n)}}_{*,t}(z) = \min _{a \in {\mathsf A}} F^{(n)}_t(z,a) \text { } \text { and } \text { } J^{{\varvec{\nu }}}_{*,t}(z) = \min _{a \in {\mathsf A}} F_t(z,a). \) Define also the following sets:

$$\begin{aligned} A_t^{(n)} = \bigl \{ (z,a): F^{(n)}_t(z,a) = J^{{\varvec{\nu }}^{(n)}}_{*,t}(z) \bigr \} \text { } \text {and} \text { } A_t = \bigl \{ (z,a): F_t(z,a) = J^{{\varvec{\nu }}}_{*,t}(z) \bigr \}. \end{aligned}$$

Since \({\varvec{\xi }}^{(n)} \in B({\varvec{\nu }}^{(n)})\), we have \( 1 = \xi ^{(n)}_t\bigl ( A_t^{(n)} \bigr ), \text { } \text {for all }n\text { and} t. \) To prove to \({\varvec{\xi }}\in B({\varvec{\nu }})\), we need to show that \( 1 = \xi _t\bigl ( A_t \bigr ), \text { } \text {for all }t. \)

First note that since both \(F^{(n)}_t\) and \(J^{{\varvec{\nu }}^{(n)}}_{*,t}\) are continuous, \(A_t^{(n)}\) is closed. Moreover, \(A_t\) is also closed as both \(F_t\) and \(J^{{\varvec{\nu }}}_{*,t}\) are continuous. Using Proposition 4, one can also prove as in [40, Proposition 3.10], [42, Proposition 4.4] that \(F_t^{(n)}\) converges to \(F_t\) continuously and \(J^{{\varvec{\nu }}^{(n)}}_{*,t}\) converges to \(J^{{\varvec{\nu }}}_{*,t}\) continuously, as \(n\rightarrow \infty \).

For each \(M\ge 1\), define the closed set \(B_t^M = \bigl \{ (z,a): F_t(z,a) \ge J^{{\varvec{\nu }}}_{*,t}(z) + \epsilon (M) \bigr \}\), where the sequence \(\{\epsilon (M)\}\) is decreasing and \(\epsilon (M) \rightarrow 0\) as \(M\rightarrow \infty \). Since both \(F_t\) and \(J^{{\varvec{\nu }}}_{*,t}\) are continuous, we can choose \(\{\epsilon (M)\}_{M\ge 1}\) so that \(\xi _t(\partial B_t^M) = 0\) for each M. Note that by the monotone convergence theorem, we have

$$\begin{aligned} \xi ^{(n)}_t\big (A_t^c \cap A_t^{(n)}\big ) = \liminf _{M\rightarrow \infty } \xi ^{(n)}_t\big (B^M_t \cap A_t^{(n)}). \end{aligned}$$

This implies that

$$\begin{aligned} 1&= \limsup _{n\rightarrow \infty } \liminf _{M\rightarrow \infty } \biggl \{ \xi ^{(n)}_t\big (A_t \cap A^{(n)}_t\big ) + \xi ^{(n)}_t\big (B^M_t \cap A_t^{(n)}\big )\biggr \} \\&\le \liminf _{M\rightarrow \infty } \limsup _{n\rightarrow \infty } \biggl \{\xi ^{(n)}_t\big (A_t \cap A^{(n)}_t\big ) + \xi ^{(n)}_t\big (B^M_t \cap A_t^{(n)}\big )\biggr \}. \end{aligned}$$

For any fixed M, we prove that the limit of the second term in the last expression converges to zero. To that end, we first note that \(\xi ^{(n)}_t\) converges weakly to \(\xi _t\) as \(n\rightarrow \infty \) when both measures are restricted to \(B_t^M\), as \(B_t^M\) is closed and \(\xi _t(\partial B_t^M)=0\) [10, Theorem 8.2.3]. Furthermore, since \(F_t^{(n)}\) converges to \(F_t\) continuously and \(J^{{\varvec{\nu }}^{(n)}}_{*,t}\) converges to \(J^{{\varvec{\nu }}}_{*,t}\) continuously, \(1_{A^{(n)}_t \cap B^M_t}\) converges continuously to 0, which implies by [33, Theorem 3.5] that

$$\begin{aligned} \limsup _{n\rightarrow \infty } \xi ^{(n)}_t\big (B^M_t \cap A^{(n)}_t\big ) = 0. \end{aligned}$$

Therefore, we obtain

$$\begin{aligned} 1 \le \limsup _{n\rightarrow \infty } \xi ^{(n)}_t\big (A_t \cap A_t^{(n)}\big ) \le \limsup _{n\rightarrow \infty } \xi _t^{(n)}(A_t) \le \xi _t(A_t), \end{aligned}$$

where the last inequality follows from the Portmanteau theorem [8, Theorem 2.1] and the fact that \(A_t\) is closed. Hence, \(\xi _t(A_t)=1\). Since t is arbitrary, this is true for all t. This means that \({\varvec{\xi }}\in B({\varvec{\nu }})\). Therefore, \({\varvec{\xi }}\in \varGamma ({\varvec{\nu }})\). \(\square \)

As a result of Proposition 5, we now conclude via Kakutani’s fixed point theorem ([2, Corollary 17.55]) that \(\varGamma \) has a fixed point. Therefore, the pair \((\pi ^{*},{\varvec{\mu }}^*)\) in Proposition 2 is a mean field equilibrium. This completes the proof of Theorem 1.

5 Approximation of Nash Equilibria

We are now ready to prove that the policy in the mean-field equilibrium, when applied by every agent, is approximately Nash equilibrium for mean-field games with a sufficiently large number of agents. Let \((\pi ^{'*},{\varvec{\mu }}^*)\) denote the pair in the mean-field equilibrium. In order to prove the existence of an approximate Nash equilibrium, we need Assumption 2 in addition to Assumption 1.

Our approach can be summarized as follows: (i) First, Assumption 2 enables us to define another mean-field equilibrium, in which the policy deterministically and continuously depends on only the observations; (ii) we then construct an equivalent game model whose states are the states of the game model in Sect. 2.2 plus the current and past observations; (iii) in this equivalent model, the new mean-field equilibrium policy becomes Markov; (iv) using this Markov structure, we prove that the cost function of a generic agent under any policy in the finite-agent regime, where the rest of the agents adopt mean-field equilibrium policy, converges to the cost function in the infinite-population limit as the number of agents goes to infinity; (v) since the mean-field equilibrium policy is optimal in the infinite-population limit, we establish the existence of an approximate Nash equilibrium via the result in step (iv).

Let \(d_{BL}\) denote the bounded Lipschitz metric on \({\mathcal P}({\mathsf S})\), which metrizes the weak topology [18, Proposition 11.3.2].

Assumption 2

  1. (a)

    \(\omega _q(r) \rightarrow 0\) and \(\omega _m(r) \rightarrow 0\) as \(r\rightarrow 0\), where

    $$\begin{aligned} \omega _{q}(r)&= \sup _{(s,u) \in {\mathsf S}\times {\mathsf A}} \sup _{\begin{array}{c} \mu ,\nu : \\ d_{BL}(\mu ,\nu )\le r \end{array}} \Vert q(\,\cdot \,|s,u,\mu ) - q(\,\cdot \,|s,u,\nu )\Vert _{TV} \\ \omega _{m}(r)&= \sup _{(s,u) \in {\mathsf S}\times {\mathsf A}} \sup _{\begin{array}{c} \mu ,\nu : \\ d_{BL}(\mu ,\nu )\le r \end{array}} |m(s,u,\mu ) - m(s,u,\nu )|. \end{aligned}$$
  2. (b)

    For each \(t\ge 0\), \(\pi _t^{'*}: {\mathsf G}_t \rightarrow {\mathcal P}({\mathsf A})\) is deterministic; that is, \(\pi _t^{'*}(\,\cdot \,|g(t)) = \delta _{f_t(g(t))}(\,\cdot \,)\) for some measurable function \(f_t:{\mathsf G}_t\rightarrow {\mathsf A}\), and weakly continuous.

In Appendix 1, we give sufficient conditions for Assumption 2-(b) in terms of the system components.

We now construct another mean-field equilibrium in which the policy deterministically depends on only the observations. For t, let \({\mathsf Y}^{t+1} = \prod _{k=0}^t {\mathsf Y}\). Then, for each \(t\ge 1\), define \(\tilde{f}_t:{\mathsf Y}^{t+1}\rightarrow {\mathsf A}\) as:

$$\begin{aligned}&\tilde{f}_t(y(t),\ldots ,y(0)) = f_t\bigl (y(t),\ldots ,y(0),\tilde{f}_{t-1}(y(t-1),\ldots ,y(0)),\ldots ,\tilde{f}_0(y(0))\bigr ), \end{aligned}$$

where \(\tilde{f}_0 = f_0\). Let \(\pi _t^*(\,\cdot \,|y(t),\ldots ,y(0)) = \delta _{\tilde{f}_t(y(t),\ldots ,y(0))}(\,\cdot \,)\). Note that \(\pi _t^*\) is a weakly continuous stochastic kernel on \({\mathsf A}\) given \({\mathsf Y}^{t+1}\) under Assumption 2-(b). Moreover, \(\pi ^*\) and \(\pi ^{'*}\) are equivalent because, for all t, we have

$$\begin{aligned} P^{\pi ^{'*}}\bigl (a(t) \in \,\cdot \,|g(t)\bigr )&= P^{\pi ^{'*}}\bigl (a(t) \in \,\cdot \,|y(t),\ldots ,y(0)\bigr ) \\&= P^{\pi ^*}\bigl (a(t) \in \,\cdot \,|y(t),\ldots ,y(0)\bigr ). \end{aligned}$$

Hence, \((\pi ^*,{\varvec{\mu }}^*)\) is also a mean-field equilibrium. In the sequel, we use \((\pi ^*,{\varvec{\mu }}^*)\) to prove the approximation result. The reason for passing from \(f_t\) to \(\tilde{f}_t\) is that the latter policy becomes Markov in the equivalent game model that will be introduced in the proof of Theorem 4. Then, we can prove the existence of an approximate Nash equilibrium by adapting the proof techniques and results in [40, 42] to the game models with expanding state spaces and non-homogeneous system components.

The following theorem is the main result of this section, which states that the policy \({\varvec{\pi }}^{(N,*)} = (\pi ^*,\ldots ,\pi ^*)\), where \(\pi ^*\) is repeated N times, is an \(\varepsilon \)-Nash equilibrium for sufficiently large N. Its proof appears in the next section.

Theorem 4

For any \(\varepsilon >0\), there exists \(N(\varepsilon )\) such that for \(N\ge N(\varepsilon )\), the policy \({\varvec{\pi }}^{(N,*)}\) is an \(\varepsilon \)-Nash equilibrium for the game with N agents that is introduced in Sect. 2.2. Since the original N-agent game model is equivalent to the one in Sect. 2.2 by Proposition 1, the policy \({\varvec{\pi }}^{(N,*)}\) is also an \(\varepsilon \)-Nash equilibrium for the original game with N agents.

Remark 2

Note that to obtain an explicit relation between \(\varepsilon \) and \(N(\varepsilon )\), one needs to establish that the optimal policy \(\pi ^*\) in mean-field equilibrium is Lipschitz continuous. In the fully observed continuous-time setup, this is in general established easily due to very restrictive structural assumptions on the system components. In a recent monograph [16], Lipschitz continuity of the optimal policy in mean-field equilibrium was established in Lemma 3.3 using regularity properties of system components. However, in our setup, in order to establish this, we need Lipschitz continuity, strong convexity, and differentiability conditions on one-stage cost functions \(\{C_t\}\) and transition probabilities \(\{\eta _t\}\) of the fully observed reduction. However, establishing Lipschitzness of the transition probabilities \(\{\eta _t\}\) is in general prohibitive. Indeed, even weak continuity of the transition probabilities \(\{\eta _t\}\), which is a much weaker condition than Lipschitz continuity, has been established relatively recently in [19]. Moreover, it was discussed in that paper that even if very restrictive conditions are imposed on the system components, it is not possible to extend weak continuity of the transition probability to setwise continuity, which is also a very weak condition that is used in the stochastic control literature. Therefore, establishing Lipschitz continuity of the transition probabilities \(\{\eta _t\}\) is in general prohibitive. This would also be the case for the partially observed continuous-time setup, since the above-mentioned result pertains to the fully observed case.

Remark 3

In the mean-field games literature, uniqueness of the mean-field equilibrium can be established using a monotonicity condition as introduced by Lasry and Lions in [34] (see also [15]). However, in addition to the monotonicity condition, we should also have the following conditions in order to have uniqueness (see, e.g., [15, Assumption U]):

  1. a)

    The cost function should be in additive form.

  2. b)

    The one-stage cost function can be additively decomposed into two functions, where the first function is a function of the state and the mean-field term, and the second function is a function of the state and the action.

  3. c)

    The dynamics of a generic agent should be independent of the mean-field term.

  4. d)

    For any state-measure flow, there exists a unique optimal policy.

Under these conditions, one can prove that if \((\pi ^{{\varvec{\mu }}},{\varvec{\mu }})\) and \((\pi ^{{\varvec{\nu }}},{\varvec{\nu }})\) are two mean-field equilibria, then

$$\begin{aligned} J_{{\varvec{\mu }}}(\pi ^{{\varvec{\mu }}}) + J_{{\varvec{\nu }}}(\pi ^{{\varvec{\nu }}}) \ge J_{{\varvec{\mu }}}(\pi ^{{\varvec{\nu }}}) + J_{{\varvec{\nu }}}(\pi ^{{\varvec{\mu }}}) \end{aligned}$$
(11)

in the equivalent game model. This implies that \(J_{{\varvec{\mu }}}(\pi ^{{\varvec{\mu }}}) = J_{{\varvec{\mu }}}(\pi ^{{\varvec{\nu }}})\) and \(J_{{\varvec{\nu }}}(\pi ^{{\varvec{\mu }}}) = J_{{\varvec{\nu }}}(\pi ^{{\varvec{\mu }}})\). Then, conditions c) and d) ensure that these mean-field equilibria must be the same, which implies uniqueness. However, note that to have inequality (11), conditions a), b), and c) must hold. Indeed, to state the monotonicity condition, we should have condition b).

In our case, the cost function in the equivalent game model is in additive form, and thus we do have condition a). Moreover, we can assume the decomposition in condition b). However, if we assume that transition probabilities \(\{p_t\}\) are independent of the mean-field term, then it implies that the transition probability q and the one-stage cost function m of the original game model are independent of the mean-field term since

$$\begin{aligned} p_t\bigl (B \times D \big | x(t),a(t),{\mu _t}\bigr ) = q(B|s(t),a(t),{\mu _{t,1}}) \otimes \delta _{m(t) + \beta ^t m(s(t),a(t),{\mu _{t,1}})}(D). \end{aligned}$$

But this is merely a risk-sensitive stochastic control setup.

Conversely, if we consider the original game model instead of the equivalent one, then, in this case, the cost function is not in additive form and thus, we cannot achieve inequality (11) because we cannot have conditions a) and b), which are needed along with the monotonicity condition to have unique mean-field equilibrium.

6 Proof of Theorem 4

For the game model introduced in Sect. 2.2, the policy \(\pi ^*\) in the mean-field equilibrium is not necessarily Markov, and so, the joint process of the state, observation, and mean-field term does not have the Markov property as well. To prove Theorem 4, we will first introduce another equivalent game model whose states are the state of the original game modelFootnote 3 plus the current and past observations. In this new model, the mean-field equilibrium policy automatically becomes Markov.

In the infinite-population limit, this new mean-field game model is specified by

$$\begin{aligned} \biggl ( \{{\mathsf S}_t\}_{t=0}^{T+1}, {\mathsf A}, \{P_t\}_{t=0}^{T+1}, \{{\mathcal C}_t\}_{t=0}^{T+1}, \lambda _0 \biggr ), \end{aligned}$$

where, for each t, \( {\mathsf S}_t = {\mathsf X}\times \underbrace{{\mathsf Y}\times \ldots \times {\mathsf Y}}_{t+1\text {-times}}\) and \({\mathsf A}\) are the Polish state and action spaces at time t, respectively. The stochastic kernel \(P_t : {\mathsf S}_t \times {\mathsf A}\times {\mathcal P}({\mathsf S}_t) \rightarrow {\mathcal P}({\mathsf S}_{t+1})\) is defined as:

$$\begin{aligned}&P_t\bigl (B_{t+1} \times D_{t+1} \times \ldots \times D_0 \big | b(t),a(t),\varDelta _t\bigr ) \\&= \int _{B_{t+1}} r(D_{t+1}|x(t+1)) \prod _{k=0}^t 1_{D_k}(y(k)) p_t({\text {d}}x(t+1)|x(t),a(t),\varDelta _{t,1}) , \end{aligned}$$

where \(B_{t+1} \in {\mathcal B}({\mathsf X})\), \(D_k \in {\mathcal B}({\mathsf Y})\) (\(k=0,\ldots ,t+1\)), \(b(t) = (x(t),y(t),y(t-1),\ldots ,y(0))\), and \(\varDelta _{t,1}\) is the marginal of \(\varDelta _t\) on \({\mathsf X}\). Indeed, \(P_t\) is the controlled transition probability of next state-observation pair, current observation, and past observations, i.e., \(\bigl (x(t+1),y(t+1),y(t),\ldots ,y(0)\bigr ),\) given the current state-observation pair and past observations, i.e., \(\bigl (x(t),y(t),y(t-1),\ldots ,y(0)\bigr ),\) in the original mean-field game. For each t, the one-stage cost function \({\mathcal C}_t: {\mathsf S}_t \times {\mathsf A}\times {\mathcal P}({\mathsf S}_t) \rightarrow [0,\infty )\) (do not confuse this with \(C_t\) in Sect. 4) is defined as:

$$\begin{aligned} {\mathcal C}_t(b(t),a(t),\varDelta _t) = c_t(x(t),a(t),\varDelta _{t,1}). \end{aligned}$$

Finally, the initial measure \(\lambda _0\) is given by \(\lambda _0({\text {d}}b) = r({\text {d}}y|x) \mu _0({\text {d}}x)\), where \(b = (x,y)\). Suppose that Assumption 1 and Assumption 2 hold. Then, for each t, the following are satisfied:

  1. (I)

    The one-stage cost function \({\mathcal C}_t\) is bounded and continuous.

  2. (II)

    The stochastic kernel \(P_t\) is weakly continuous.

It is straightforward to prove that (I) and (II) hold since \(c_t\) is continuous, \(p_t\) is weakly continuous, and r is continuous in total variation norm. Recall the set of policies \(\tilde{\varPi }\) in the original mean-field game which only use the observations; that is, \(\pi \in \tilde{\varPi }\) if \(\pi _t:{\mathsf Y}^{t+1} \rightarrow {\mathcal P}({\mathsf A})\) for each \(t\ge 0\). Note that \(\tilde{\varPi }\) is a subset of the set of Markov policies in the new model. For any measure flow \({\varvec{\varDelta }} = (\varDelta _t)_{t\ge 0}\), where \(\varDelta _t \in {\mathcal P}({\mathsf S}_t)\), we denote by \(\hat{J}_{{\varvec{\varDelta }}}(\pi )\) the finite-horizon risk-neutral total cost of the policy \(\pi \in \tilde{\varPi }\) in this new mean-field game model.

We also define the corresponding N agent game as follows. We have the Polish state spaces \(\{{\mathsf S}_t\}_{t=0}^{T+1}\) and action space \({\mathsf A}\). For every t and every \(i \in \{1,2,\ldots ,N\}\), let \(b^N_i(t) \in {\mathsf S}_t\) and \(a^N_i(t) \in {\mathsf A}\) denote the state and the action of Agent i at time t, and let

$$\begin{aligned} \varDelta _t^{(N)}(\,\cdot \,) = \frac{1}{N} \sum _{i=1}^N \delta _{b_i^N(t)}(\,\cdot \,) \in {\mathcal P}({\mathsf S}_t) \end{aligned}$$

denote the empirical distribution of the state configuration at time t. The initial states \(b^N_i(0)\) are independent and identically distributed according to \(\lambda _0\), and, for each t, the next-state configuration \((b^N_1(t+1),\ldots ,b^N_N(t+1))\) is generated according to the probability laws

$$\begin{aligned}&\prod ^N_{i=1} P_{t}\big ({\text {d}}b^N_i(t+1)\big |b^N_i(t),a^N_i(t),\varDelta ^{(N)}_t\big ). \end{aligned}$$

Recall that \(\tilde{\varPi }_i\) denotes the set of policies that only use local observations for Agent i in the original game. Note that policies in \(\tilde{\varPi }_i\) are Markov for the new model since they partly use the state information. We let \(\tilde{\varPi }_i^c\) denote the set of all policies in \(\tilde{\varPi }_i\) for Agent i that are weakly continuous; that is, \(\pi =\{\pi _t\}\in \tilde{\varPi }_i^c\) if for all \(t\ge 0\), \(\pi _t: {\mathsf Y}^{t+1} \rightarrow {\mathcal P}({\mathsf A})\) is continuous when \({\mathcal P}({\mathsf A})\) is endowed with the weak topology. For Agent i, the finite-horizon risk-neutral total cost under the initial distribution \(\lambda _0\) and N-tuple of policies \({\varvec{\pi }}^{(N)} \in \tilde{\varvec{\varPi }}^{(N)}\) is denoted by \(\hat{J}_i^{(N)}({\varvec{\pi }}^{(N)})\).

The following proposition makes the connection between this new model and the original model.

Proposition 6

For any \(N\ge 1\), \({\varvec{\pi }}^{(N)} \in \tilde{\varvec{\varPi }}^{(N)}\), and \(i=1,\ldots ,N\), we have \(\hat{J}_i({\varvec{\pi }}^{(N)}) = J_i({\varvec{\pi }}^{(N)})\). Similarly, for any \(\pi \in \tilde{\varPi }\) and measure flow \({\varvec{\varDelta }}\), we have \(\hat{J}_{\varvec{\varDelta }}(\pi ) = J_{{\varvec{\mu }}}(\pi )\) where \({\varvec{\mu }}= (\varDelta _{t,1})_{t\ge 0}\).

Proof

The result can easily be proved as in [41, Proposition 5.1], and thus, we do not include the details. \(\square \)

By Proposition 6, in the remainder of this section we consider the new game model in place of the one introduced in Sect. 2.2. Define the measure flow \({\varvec{\varDelta }} = (\varDelta _t)_{t\ge 0}\) as follows:

$$\begin{aligned}\varDelta _t = \mathcal{L}(x(t),y(t),\ldots ,y(0)),\end{aligned}$$

where \(\mathcal{L}(x(t),y(t),\ldots ,y(0))\) denotes the probability law of \((x(t),y(t),\ldots ,y(0))\) in the original mean-field game under the policy \(\pi ^*\) in the mean-field equilibrium. For each \(t\ge 0\), define the stochastic kernel \(P_t^{\pi ^*}(\,\cdot \,|b,\varDelta )\) on \({\mathsf S}_{t+1}\) given \({\mathsf S}_{t} \times {\mathcal P}({\mathsf S}_{t})\) as

$$\begin{aligned} P_t^{\pi ^*}(\,\cdot \,|b,\varDelta ) = \int _{{\mathsf A}} P_t(\,\cdot \,|b,a,\varDelta ) \pi _t^*({\text {d}}a|b). \end{aligned}$$

Since \(\pi _t^*\) is weakly continuous, \(P_t^{\pi ^*}(\,\cdot \,|b,\varDelta )\) is also weakly continuous in \((b,\varDelta )\). In the sequel, to ease the notation, we will also write \(P_t^{\pi ^*}(\,\cdot \,|b,\varDelta )\) as \(P_{t,\varDelta }^{\pi ^*}(\,\cdot \,|b)\).

Lemma 1

Measure flow \({\varvec{\varDelta }}\) satisfies

$$\begin{aligned} \varDelta _{t+1}(\,\cdot \,)&= \int _{{\mathsf S}_t} P_{t}^{\pi ^*}(\,\cdot \,|b,\varDelta _t) \varDelta _t({\text {d}}b) \\&= \varDelta _t P_{t,\varDelta _t}^{\pi ^*}(\,\cdot \,). \end{aligned}$$

Proof

The result can easily be proved as in [41, Lemma 5.1], and thus, we do not include the details. \(\square \)

For each \(N\ge 1\), let \(\bigl \{b_i^{N}(t)\bigr \}_{1\le i\le N}\) denote the states of agents at time t in the N-agent new game model under the policy \({\varvec{\pi }}^{(N,*)} = \{\pi ^*,\pi ^*,\ldots ,\pi ^*\}\). Define the empirical distribution

$$\begin{aligned} \varDelta _t^{(N)}(\,\cdot \,) = \frac{1}{N} \sum _{i=1}^N \delta _{b_i^{N}(t)}(\,\cdot \,). \end{aligned}$$

Proposition 7

For all \(t\ge 0\), we have \( \mathcal{L}(\varDelta _t^{(N)}) \rightarrow \delta _{\varDelta _t}\) weakly in \({\mathcal P}({\mathcal P}({\mathsf S}_t))\), as \(N\rightarrow \infty \).

Proof

Weak topology on \({\mathcal P}({\mathsf S}_t)\) can be metrized using the following metric:

$$\begin{aligned} \rho (\mu ,\nu ) = \sum _{m=1}^{\infty } 2^{-(m+1)} | \mu (f_m) - \nu (f_m) |, \end{aligned}$$

where \(\{f_m\}_{m\ge 1}\) is a sequence of real continuous and bounded functions on \({\mathsf S}_t\) such that \(\Vert f_m\Vert \le 1\) for all \(m\ge 1\) (see [38, Theorem 6.6, p. 47]). Define the Wasserstein distance of order 1 on the set of probability measures \({\mathcal P}({\mathcal P}({\mathsf S}_t))\) as follows (see [51, Definition 6.1]):

$$\begin{aligned} W_1(\varPhi ,\varPsi ) = \inf \bigl \{ E[\rho (X,Y)]: \mathcal{L}(X) = \varPhi \text { and } \mathcal{L}(Y) = \varPsi \bigr \}. \end{aligned}$$

Note that since \(\delta _{\varDelta _t}\) is a Dirac measure, we have

$$\begin{aligned} W_1(\mathcal{L}(\varDelta _t^{(N)}),\delta _{\varDelta _t})&= \bigl \{ E[\rho (X,Y)]: \mathcal{L}(X) = \mathcal{L}(\varDelta _t^{(N)}) \text { and } \mathcal{L}(Y) = \delta _{\varDelta _t} \bigr \} \\&= E\biggl [ \sum _{m=1}^{\infty } 2^{-(m+1)} | \varDelta _t^{(N)}(f_m) - \varDelta _t(f_m) | \biggr ]. \end{aligned}$$

Since convergence in \(W_1\) distance implies weak convergence (see [51, Theorem 6.9]), it suffices to prove that

$$\begin{aligned} \lim _{N\rightarrow \infty } E\bigl [|\varDelta _t^{(N)}(f) - \varDelta _t(f)|\bigr ] = 0 \end{aligned}$$

for any \(f \in C_b({\mathsf S}_t)\) and for all t. We prove this by induction on t.

As \(\{b_i^N(0)\}_{1\le i\le N}\) are i.i.d. with common distribution \(\varDelta _0\), the claim is true for \(t=0\). We suppose that the claim holds for t and consider \(t+1\). Fix any \(g \in C_b({\mathsf S}_{t+1})\). Then, we have

$$\begin{aligned}&|\varDelta _{t+1}^{(N)}(g) - \varDelta _{t+1}(g)| \nonumber \\&\le |\varDelta _{t+1}^{(N)}(g) - \varDelta _{t}^{(N)} P^{\pi ^*}_{t,\varDelta _t^{(N)}}(g)| + |\varDelta _t^{(N)} P^{\pi ^*}_{t,\varDelta _t^{(N)}}(g) - \varDelta _t P^{\pi ^*}_{t,\varDelta _t}(g) |. \end{aligned}$$
(12)

We first prove that the expectation of the second term on the right-hand side (RHS) of (12) converges to 0 as \(N\rightarrow \infty \). To that end, define \(F: {\mathcal P}({\mathsf S}_{t}) \rightarrow \mathbb {R}\) as:

$$\begin{aligned} F(\varDelta ) = \varDelta P^{\pi ^*}_{t,\varDelta }(g) = \int _{{\mathsf S}_t} \int _{{\mathsf S}_{t+1}} g(b') P^{\pi ^*}_t({\text {d}}b'|b,\varDelta ) \varDelta ({\text {d}}b). \end{aligned}$$

One can prove that \(F \in C_b({\mathcal P}({\mathsf S}_t))\). Indeed, suppose that \(\varDelta _n\) converges to \(\varDelta \). Let us define

$$\begin{aligned} l_n(b)&= \int _{{\mathsf S}_{t+1}} g(b') P^{\pi ^*}_t({\text {d}}b'|b,\varDelta _n) \text { } \text {and} \text { } l(b) = \int _{{\mathsf S}_{t+1}} g(b') P^{\pi ^*}_t({\text {d}}b'|b,\varDelta ). \end{aligned}$$

Since \(P^{\pi ^*}_t\) is weakly continuous, one can prove that \(l_n\) converges to l continuously. By [33, Theorem 3.5], we have \(F(\varDelta _n) \rightarrow F(\varDelta )\), and so, \(F \in C_b({\mathcal P}({\mathsf S}_t))\). This implies that the expectation of the second term on the RHS of (12) converges to zero as \(\mathcal{L}(\varDelta _t^{(N)}) \rightarrow \delta _{\varDelta _t}\) weakly, by the induction hypothesis.

Now, let us write the expectation of the first term on the RHS of (12) as:

$$\begin{aligned} E\biggl [ E\biggl [ |\varDelta _{t+1}^{(N)}(g) - \varDelta _t^{(N)} P^{\pi ^*}_{t,\varDelta _t^{(N)}}(g)| \biggr | b_1^N(t),\ldots ,b_N^N(t) \biggr ] \biggr ]. \end{aligned}$$

Then, by [11, Lemma A.2], we have

$$\begin{aligned} E\biggl [ |\varDelta _{t+1}^{(N)}(g) - \varDelta _t^{(N)} P^{\pi ^*}_{t,\varDelta _t^{(N)}}(g)| \biggr | b_1^N(t),\ldots ,b_N^N(t) \biggr ] \le 2 \frac{\Vert g\Vert }{\sqrt{N}}. \end{aligned}$$

Therefore, the expectation of the first term on the RHS of (12) also converges to zero as \(N\rightarrow \infty \). Since g was arbitrary, this completes the proof. \(\square \)

The implication of Proposition 7 is the key to prove the main theorem. It basically says that in the infinite-population limit, the empirical distribution of the states under the mean-field policy converges to the deterministic measure flow \({\varvec{\varDelta }}\) (i.e., the principle of law of large numbers). This result leads to the following important proposition.

Proposition 8

We have

$$\begin{aligned} \lim _{N\rightarrow \infty } \hat{J}_1^{(N)}({\varvec{\pi }}^{(N,*)}) = \hat{J}_{{\varvec{\varDelta }}}(\pi ^*) = \inf _{\pi ' \in \varPi } \hat{J}_{{\varvec{\varDelta }}}(\pi '). \end{aligned}$$

Proof

As the transition probabilities \(P_t(\,\cdot \,|d,a,\varDelta )\) are continuous in \(\varDelta \), the dynamics of the state of a generic agent in the finite-agent game with sufficiently many agents and the dynamics of the state in the mean-field game under policies \({\varvec{\pi }}^{(N,*)} = (\pi ^*,\ldots ,\pi ^*)\) and \(\pi ^*\), respectively, should therefore be close. Hence, the distributions of the states in these games should also be close, from which we obtain the proposition. The precise mathematical proof is given below.

For each \(t\ge 0\), let us define

$$\begin{aligned} {\mathcal C}_{\pi _t^*}(b,\varDelta ) = \int _{{\mathsf A}} {\mathcal C}_t(b,a,\varDelta ) \pi _t^*({\text {d}}a|b). \end{aligned}$$

Note that random elements \(\bigl (b_1^N(t),\ldots ,b_N^N(t),\varDelta _t^{(N)}\bigr )\) are exchangeable; that is, for any permutation \(\sigma \) of \(\{1,\ldots ,N\}\), we have

$$\begin{aligned} \mathcal{L}\bigl (b_1^N(t),\ldots ,b_N^N(t),\varDelta _t^{(N)}\bigr ) = \mathcal{L}\bigl (b_{\sigma (1)}^N(t),\ldots ,b_{\sigma (N)}^N(t),\varDelta _t^{(N)}\bigr ). \end{aligned}$$

Hence, the cost function at time t can be written as:

$$\begin{aligned} E\bigl [ {\mathcal C}_t(b_1^N(t),a_1^N(t),\varDelta _t^{(N)}) \bigr ]&= \frac{1}{N} \sum _{i=1}^N E\bigl [ {\mathcal C}_t(b_i^N(t),a_i^N(t),\varDelta _t^{(N)}) \bigr ] \\&= E\bigl [ \varDelta _t^{(N)}\bigl ({\mathcal C}_{\pi _t^*}(b,\varDelta _t^{(N)})\bigr ) \bigr ]. \end{aligned}$$

Define \(F: {\mathcal P}({\mathsf S}_t) \rightarrow \mathbb {R}\) as

$$\begin{aligned} F(\varDelta ) = \int _{{\mathsf S}_t} {\mathcal C}_{\pi _t^*}(b,\varDelta ) \varDelta ({\text {d}}b). \end{aligned}$$

One can show that \(F \in C_b({\mathcal P}({\mathsf S}_t))\) as \(\pi _t^*\) is weakly continuous. Hence, by Proposition 7, we obtain

$$\begin{aligned} \lim _{N\rightarrow \infty } E\bigl [ {\mathcal C}_t(b_1^N(t),a_1^N(t),\varDelta _t^{(N)}) \bigr ]&= \lim _{N\rightarrow \infty } E\bigl [ \varDelta _t^{(N)}\bigl ({\mathcal C}_{\pi _t^*}(b,\varDelta _t^{(N)})\bigr ) \bigr ] \nonumber \\&= \lim _{N\rightarrow \infty } E[F(\varDelta _t^{(N)})] \nonumber \\&= F(\varDelta _t) \nonumber \\&= \varDelta _t({\mathcal C}_{\pi _t^*}(\,\cdot \,,\varDelta _t)). \end{aligned}$$
(13)

Note that by Lemma 1, the cost in the mean-field game can be written as:

$$\begin{aligned} \hat{J}_{{\varvec{\varDelta }}}(\pi ^*) = \sum _{t=0}^{T+1} \varDelta _t({\mathcal C}_{\pi _t^*}(\,\cdot \,,\varDelta _t)). \end{aligned}$$

Therefore, by (13) and the dominated convergence theorem, we obtain

$$\begin{aligned} \lim _{N\rightarrow \infty } \hat{J}_1^{(N)}({\varvec{\pi }}^{(N,*)}) = \hat{J}_{{\varvec{\varDelta }}}(\pi ^*), \end{aligned}$$

which completes the proof. \(\square \)

To obtain the approximation result, we should show that if the policy of some agent deviates from the mean-field equilibrium policy, then the corresponding cost of this agent should be close to the cost in the mean-field limit as in Proposition 8, for N sufficiently large. Since the transition probabilities and the one-stage cost functions are identical for all agents in the game model, it is sufficient to change the policy of Agent 1 for each N. To that end, let \(\{{\tilde{\pi }}^{(N)}\}_{N\ge 1} \subset \tilde{\varPi }_1^c\) be an arbitrary sequence of policies for Agent 1; that is, for each \(N\ge 1\) and \(t\ge 0\), \({\tilde{\pi }}_t^{(N)}: {\mathsf Y}^{t+1} \rightarrow {\mathcal P}({\mathsf A})\) is weakly continuous. For each \(N\ge 1\), let \(\bigl \{{\tilde{b}}_i^N(t)\bigr \}_{1\le i \le N}\) be the collection of states in the N-person game under the policy \(\tilde{{\varvec{\pi }}}^{(N)} = \{{\tilde{\pi }}^{(N)},\pi ^*,\ldots ,\pi ^*\}\). Define

$$\begin{aligned} \tilde{\varDelta }_t^{(N)}(\,\cdot \,) = \frac{1}{N} \sum _{i=1}^N \delta _{{\tilde{b}}_i^{(N)}(t)}(\,\cdot \,). \end{aligned}$$

The following result says that the asymptotic behavior of the empirical distribution of the states at each time t is insensitive to local deviations from the mean-field equilibrium policy.

Proposition 9

For all \(t\ge 0\), we have \( \mathcal{L}(\tilde{\varDelta }_t^{(N)}) \rightarrow \delta _{\varDelta _t}\) weakly \({\mathcal P}({\mathcal P}({\mathsf S}_t))\), as \(N \rightarrow \infty \).

Proof

The proof can be done by slightly modifying the proof of Proposition 7, and therefore will not be included here. \(\square \)

For each \(N\ge 1\), let \(\{{\hat{b}}^N(t)\}_{t\ge 0}\) denote the state trajectory of the generic agent in the mean-field game (i.e., infinite-population limit) under policy \({\tilde{\pi }}^{(N)}\); that is, \({\hat{b}}^N(t)\) evolves as follows:

$$\begin{aligned} {\hat{b}}^N(0) \sim \lambda _0 \text { and } {\hat{b}}^N(t+1) \sim P^{{\tilde{\pi }}^{(N)}}_{t,\varDelta _t}(\,\cdot \,|{\hat{b}}^N(t)). \end{aligned}$$

The cost function of this mean-field game is given by

$$\begin{aligned} \hat{J}_{{\varvec{\varDelta }}}({\tilde{\pi }}^{(N)}) = \sum _{t=0}^{T+1} E\bigl [ C_t({\hat{b}}^N(t),\hat{a}^N(t),\varDelta _t)\bigr ], \end{aligned}$$
(14)

where the actions at each time \(t\ge 0\) is generated according to the probability law

$$\begin{aligned} \tilde{\pi }^{(N)}_t(d\hat{a}^N(t)|{\hat{b}}^N(t)) = \tilde{\pi }^{(N)}_t(d\hat{a}^N(t)|\hat{y}^N(t),\ldots ,\hat{y}^N(0)). \end{aligned}$$

The following result is a bit technical but very important for proving the main result. Its proof is quite long and complicated, and thus can be found in Appendix 2.

Proposition 10

For any \(t\ge 0\), we have

$$\begin{aligned} \lim _{N\rightarrow \infty } \bigl | \mathcal{L}({\tilde{b}}_1^N(t))(g_N) - \mathcal{L}({\hat{b}}^N(t))(g_N) \bigr | = 0 \end{aligned}$$

for any sequence \(\{g_N\} \subset C_b({\mathsf S}_t)\) such that \(\sup _{N\ge 1}\Vert g_N\Vert <\infty \) and \(\omega _g(r) \rightarrow 0\) as \(r \rightarrow 0\), where

$$\begin{aligned} \omega _g(r) = \sup _{\begin{array}{c} s \in {\mathsf S} \\ y^t \in {\mathsf Y}^t \end{array}} \sup _{N\ge 1} \sup _{\begin{array}{c} m,m' \\ |m - m'| \le r \end{array}} |g_N(s,m,y^t) - g_N(s,m',y^t)|. \end{aligned}$$

Using Proposition 10, we now prove the following result.

Theorem 5

Let \(\{{\tilde{\pi }}^{(N)}\}_{N\ge 1} \subset \tilde{\varPi }_1^c\) be an arbitrary sequence of policies for Agent 1. Then, we have

$$\begin{aligned} \lim _{N \rightarrow \infty } \bigl | \hat{J}_1^{(N)}({\tilde{\pi }}^{(N)},\pi ^*,\ldots ,\pi ^*) - \hat{J}_{{\varvec{\varDelta }}}({\tilde{\pi }}^{(N)}) \bigr | = 0, \end{aligned}$$

where \(\hat{J}_{{\varvec{\varDelta }}}({\tilde{\pi }}^{(N)})\) is given in (14).

Proof

Since \({\mathcal C}_t = 0\) for \(t \le T\), we set \(t=T+1\). We have

$$\begin{aligned}&\bigl | \hat{J}_1^{(N)}({\tilde{\pi }}^{(N)},\pi ^*,\ldots ,\pi ^*) - \hat{J}_{{\varvec{\varDelta }}}({\tilde{\pi }}^{(N)}) \bigr | = \bigl | E\bigl [ {\mathcal C}_t({\tilde{b}}_1^N(t)) \bigr ] - E\bigl [ {\mathcal C}_t({\hat{b}}_1^N(t)) \bigr ] \bigr |. \end{aligned}$$

Note that \({\mathcal C}_t(b) = {\mathcal C}_t((s,m,y_0,\ldots ,y_t)) = e^{\lambda m}\), where \(m \in [0,L]\), is Lipschitz. Therefore, the term in the above equation converges to zero by Proposition 10. \(\square \)

As a corollary of Proposition 8 and Theorem 5, we obtain the following result.

Corollary 1

We have

$$\begin{aligned} \lim _{N \rightarrow \infty } \hat{J}_1^{(N)}({\tilde{\pi }}^{(N)},\pi ^*,\ldots ,\pi ^*)&\ge \inf _{\pi ' \in \tilde{\varPi }} \hat{J}_{{\varvec{\varDelta }}}(\pi ') = \hat{J}_{{\varvec{\varDelta }}}(\pi ^*) \\&= \lim _{N \rightarrow \infty } \hat{J}_1^{(N)}(\pi ^*,\pi ^*,\ldots ,\pi ^*), \end{aligned}$$

where \(\{{\tilde{\pi }}^{(N)}\}_{N\ge 1} \subset \tilde{\varPi }_1^c\) is an arbitrary sequence of policies for Agent 1.

Now, we are ready to prove the main result of this section.

Proof of Theorem 4

One can prove that for any policy \({\varvec{\pi }}^{(N)} \in \tilde{\varvec{\varPi }}^{(N)}\), we have

$$\begin{aligned} \inf _{\pi ^i \in \tilde{\varPi }_i} \hat{J}_i^{(N)}({\varvec{\pi }}^{(N)}_{-i},\pi ^i) = \inf _{\pi ^i \in \tilde{\varPi }_i^c} \hat{J}_i^{(N)}({\varvec{\pi }}^{(N)}_{-i},\pi ^i) \end{aligned}$$

for each \(i=1,\ldots ,N\) (see the proof of [40, Theorem 2.3]). Hence, it is sufficient to consider weakly continuous policies in \(\mathbf{\varPi }^{(N)}\) to establish the existence of \(\varepsilon \)-Nash equilibrium in the new model.

We prove that, for sufficiently large N, we have

$$\begin{aligned} \hat{J}_i^{(N)}({\varvec{\pi }}^{(N,*)})&\le \inf _{\pi ^i \in \tilde{\varPi }_i^c} \hat{J}_i^{(N)}({\varvec{\pi }}^{(N,*)}_{-i},\pi ^i) + \varepsilon \end{aligned}$$
(15)

for each \(i=1,\ldots ,N\). As indicated earlier, since the transition probabilities and the one-stage cost functions are the same for all agents in the new game, it is sufficient to prove (15) for Agent 1 only. Given \(\epsilon > 0\), for each \(N\ge 1\), let \({\tilde{\pi }}^{(N)} \in \tilde{\varPi }_1^c\) be such that

$$\begin{aligned} \hat{J}_1^{(N)} ({\tilde{\pi }}^{(N)},\pi ^*,\ldots ,\pi ^*) < \inf _{\pi ' \in \tilde{\varPi }_1^c} \hat{J}_1^{(N)} (\pi ',\pi ^*,\ldots ,\pi ^*) + \frac{\varepsilon }{3}. \end{aligned}$$

Then, by Corollary 1, we have

$$\begin{aligned} \lim _{N\rightarrow \infty } \hat{J}_1^{(N)} ({\tilde{\pi }}^{(N)},\pi ^*,\ldots ,\pi ^*)&= \lim _{N\rightarrow \infty } \hat{J}_{{\varvec{\varDelta }}}({\tilde{\pi }}^{(N)}) \\&\ge \inf _{\pi '} \hat{J}_{{\varvec{\varDelta }}}(\pi ') \\&= \hat{J}_{{\varvec{\varDelta }}}(\pi ^*) \\&= \lim _{N\rightarrow \infty } \hat{J}_1^{(N)} (\pi ^*,\pi ^*,\ldots ,\pi ^*). \end{aligned}$$

Therefore, there exists \(N(\varepsilon )\) such that for \(N\ge N(\varepsilon )\), we have

$$\begin{aligned} \inf _{\pi ' \in \tilde{\varPi }_1^c} \hat{J}_1^{(N)} (\pi ',\pi ^*,\ldots ,\pi ^*) + \varepsilon&> \hat{J}_1^{(N)} ({\tilde{\pi }}^{(N)},\pi ^*,\ldots ,\pi ^*) + \frac{2\varepsilon }{3} \\&\ge \hat{J}_{{\varvec{\varDelta }}}(\pi ^*) + \frac{\varepsilon }{3} \\&\ge \hat{J}_1^{(N)} (\pi ^*,\pi ^*,\ldots ,\pi ^*). \end{aligned}$$

The result then follows from Proposition 6. \(\square \)

7 Infinite Horizon Cost Function

In this section, we extend Theorem 4 to games with infinite-horizon risk-sensitive cost functions; that is, a generic agent’s infinite-horizon risk-sensitive cost under the initial distribution \(\kappa _0\) and the N-tuple of infinite-horizon policies \({\varvec{\pi }}^{(N,\infty )}=(\pi ^{(1,\infty )},\ldots ,\pi ^{(N,\infty )}) \in \mathbf{\varPi }^{(N)}\) is given by

$$\begin{aligned} W_i^{(N,\infty )}({\varvec{\pi }}^{(N,\infty )})&= E^{{\varvec{\pi }}^{(N,\infty )}}\biggl [ e^{\lambda \sum _{t=0}^{\infty }\beta ^{t}m(s_{i}^N(t),u_{i}^N(t),d^{(N)}_t)}\biggr ], \end{aligned}$$

where, for each Agent j, \(\pi ^{(j,\infty )} = \{\pi ^{(j,\infty )}_0,\pi ^{(j,\infty )}_1,\ldots \}\) (i.e., infinitely many stochastic kernels). Note that, by [42, Lemma 4.3], any infinite-horizon risk sensitive cost can be approximated by finite T-horizon one with the error bound \(\theta \beta ^{T+1}\) for some constant \(\theta > 0\), which is independent of the policy \({\varvec{\pi }}^{(N,\infty )}\); i.e.,

$$\begin{aligned} \big |W_i^{(N,\infty )}({\varvec{\pi }}^{(N,\infty )}) - W_i^{(N)}({\varvec{\pi }}^{(N,\infty )})\big | \le \theta \beta ^{T+1}. \end{aligned}$$
(16)

Then, the following theorem is a consequence of (16) and Theorem 4.

Theorem 6

For any \(\varepsilon >0\), choose T such that \(\theta \beta ^{T+1} < \frac{\varepsilon }{3}\) and let \(N(\frac{\varepsilon }{3})\) be the constant in Theorem 4 for the finite horizon T. Then, for \(N\ge N(\frac{\varepsilon }{3})\), the policy \({\varvec{\pi }}^{(N,\infty )}\) is an \(\varepsilon \)-Nash equilibrium for the infinite-horizon risk-sensitive game with N agents, where \({\varvec{\pi }}^{(N,\infty )} = (\pi ^{\infty },\ldots ,\pi ^{\infty })\),

$$\begin{aligned}\pi ^{\infty } = \big \{\underbrace{\pi _0^*, \ldots ,\pi _T^*}_ {T+1\text {-times}} , \pi _{T+1},\pi _{T+2}, \ldots \big \},\end{aligned}$$

\(\pi ^* = \{\pi _t^*\}_{t=0}^T\) is the policy in the mean-field equilibrium of the T-horizon game, and \(\{\pi _t\}_{t=T+1}^{\infty }\) is some arbitrary policy.

8 An Example

In this section, we consider an additive noise model to illustrate our results. In this model, the state and observation dynamics of a generic agent for the infinite-population game are given, respectively, by

$$\begin{aligned} s(t+1)&= \int _{{\mathsf S}} f(s(t),u(t),s) d_t({\text {d}}s) + g(s(t),u(t)) w(t) \\&=:F(s(t),u(t),d_t) + g(s(t),u(t)) w(t) \end{aligned}$$

and

$$\begin{aligned} g(t)&= h(s(t)) + v(t), \end{aligned}$$

where \(s(t) \in {\mathsf S}\), \(g(t) \in {\mathsf Y}\), \(u(t) \in {\mathsf A}\), \(w(t) \in {\mathsf W}\), and \(v(t) \in {\mathsf V}\). Here, we assume that \({\mathsf S}= {\mathsf Y}= {\mathsf W}={\mathsf V}= \mathbb {R}\), \({\mathsf A}\subset \mathbb {R}\), and \(\{w(t)\}\) and \(\{v(t)\}\) are sequences of i.i.d. standard normal random variables independent of each other. The one-stage cost function of a generic agent is given by

$$\begin{aligned} m(s(t),u(t),d_t) = \int _{{\mathsf S}} b(s(t),u(t),s) d_t({\text {d}}s), \end{aligned}$$

for some measurable function \(b: {\mathsf S}\times {\mathsf A}\times {\mathsf S}\rightarrow [0,\infty )\).

This model is the infinite-population limit of the N-agent game model with state and observation dynamics

$$\begin{aligned} s_i^N(t+1)&= \frac{1}{N} \sum _{j=1}^N f(s_i^N(t),u_i^N(t),s_j^N(t)) + g(s_i^N(t),u_i^N(t)) w_i^N(t) \\ y_i^N(t)&= h(s_i^N(t)) + v_i^N(t) \end{aligned}$$

and the one-stage cost function

$$\begin{aligned} m(s_i^N(t),u_i^N(t),d_t^{(N)})&= \frac{1}{N} \sum _{j=1}^N b(s_i^N(t),u_i^N(t),s_j^N(t)). \end{aligned}$$

For this model, Assumption 1 holds with \(v(s) = 1 + s^2\) and \(\alpha = \max \{1 + \Vert f\Vert ^2, L\}\) under the following conditions: (i) \({\mathsf A}\) is compact, (ii) b is continuous and bounded, (iii) g is continuous, and f is bounded and continuous, (iv) \(\sup _{u \in {\mathsf A}} g^2(s,u) \le L s^2\) for some \(L>0\), (v) h is continuous and bounded. Note that \(\Vert f\Vert \) is defined as:

$$\begin{aligned} \Vert f\Vert :=\sup _{(s,u,s') \in {\mathsf S}\times {\mathsf A}\times {\mathsf S}} |f(s,u,s')|. \end{aligned}$$

Moreover, Assumption 2-(a) holds under the following conditions: (vi) \(b(s,u,s')\) is (uniformly) Lipschitz in \(s'\), (vii) \(f(s,u,s')\) is (uniformly) Lipschitz in \(s'\), and (viii) g is bounded and \(\inf _{(s,u) \in {\mathsf S}\times {\mathsf A}} |g(s,u)| > 0\). For the proofs of these facts, we refer the reader to [41, Section 7].

In order to have Assumption 2-(b), we need to assume that \({\mathsf A}\) is convex. In addition, suppose that \(q({\text {d}}s'|s,a,\mu ) = \varrho (s'|s,a,\mu ) \nu ({\text {d}}s')\) and \(l({\text {d}}y|s) = \zeta (y|s) \nu ({\text {d}}y)\), where \(\nu \) denotes the Lebesgue measure. Assume that both \(\varrho \) and \(\zeta \) are continuous and bounded, and \(\varrho \) and m are strictly convex in a. For the justification of Assumption 2-(b) in this case, we refer the reader to Sect. 1.

Remark 4

Note that Assumption 1 also holds for finite models (i.e., \({\mathsf S}\), \({\mathsf A}\), and \({\mathsf Y}\) are finite) without any structure on the dynamics of the state and observation if the transition probability and the one-stage cost function are continuous with respect to the mean-field term. Moreover, Assumption 2-(a) holds if the transition probability and the one-stage cost function are Lipschitz continuous with respect to the mean-field term. In finite models, the only missing condition is the existence of deterministic policy in mean-field equilibrium. This can be established if we have the uniqueness condition in (17).

9 Conclusion

This paper has considered discrete-time finite-horizon partially observed risk-sensitive mean-field games. We have first constructed an equivalent game model whose states are the state of the original model plus the one-stage costs incurred up to that time. In this new model, the finite-horizon risk-sensitive cost function can be written in an additive-form as in the risk-neutral case. Then, letting the number of agents go to infinity, we have first established the existence of a mean-field equilibrium in the limiting mean-field game problem. We have then shown that the policy in the mean-field equilibrium constitutes an approximate Nash equilibrium for similarly structured games with a sufficiently large number of agents. Finally, we have extended our results to the case of infinite-horizon cost functions.