Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In spite of decades long research activity on stochastic differential games, there still remain some outstanding fundamental questions on existence, uniqueness, and characterization of non-cooperative equilibria when players have access to noisy state information. Even in zero-sum games and with common measurement channel that feeds noisy state information to both players, derivation of saddle-point policies is quite an intricate task, as first identified in Başar (1981). That paper also addressed the issue of whether saddle-point equilibria (SPE) in such games is of the certainty-equivalent (CE) type (Witsenhausen 1971a), that is whether the solution of a similarly structured game but with perfect state measurements for both players can be used in the construction of SPE for the stochastic game with noisy measurement, by simply replacing the state with an appropriately constructed conditional estimate. The answer was a “cautious conditional yes,” in the sense that not all SPE are of the CE type, and when they are in both the construction of the conditional estimate and the derivation of conditions for existence many perils exist. This chapter picks up where Başar (1981) had left, and develops further insights into the intricacies and pitfalls in the derivation of SPE of the CE as well as non-CE types. It also provides a complete solution to a two-stage stochastic game of the linear-quadratic-Gaussian (LQG) type where the common measurement channel is not only noisy but also fails intermittently.

Research on stochastic differential games with noisy state measurements goes back to the 1960’s, where two-person zero-sum games with linear dynamics and measurement equations, Gaussian statistics, and quadratic cost functions (that is, LQG games) were addressed when players have access to different measurements, within however some specific information structures (Behn and Ho 1968; Rhodes and Luenberger 1969; Willman 1969). A zero sum differential game where one player’s information is nested in the other player’s was considered in Ho (1974), and a class of zero-sum dynamic games where one player has noisy state information while the other one plays open loop was considered in Başar and Mintz (1973) which showed that the open-loop player’s saddle-point strategy is mixed. A class of zero-sum stochastic games where the information structure is of the nonclassical type was considered in Başar and Mintz (1972), which showed that some zero-sum games could be tractable even though their team counterparts, as in Witsenhausen (1968), Bansal and Başar (1987), Ho (1980) are not; see also Başar (2008).

When a game is not of the zero-sum type, derivation of equilibria (which in this case would be Nash equilibria) is even more challenging, even when players have access to common noisy measurements, with or without delay, as discussed in Başar (1978a) where an indirect approach of the backward-forward type was developed and employed; see also Başar (1978b) for a different formulation and approach for derivation. Recently, a new class of discrete-time nonzero-sum games with asymmetric information was introduced in Nayyar and Başar (2012), where the evolution of the local state processes depends only on the global state and control actions and not on the current or past values of local states. For this class of games, it was possible to obtain a characterization of some Nash equilibria by lifting the game and converting it to a symmetric one, solving the symmetric one in terms of Markov equilibria, and then converting it back. Among many others, two other papers of relevance to stochastic nonzero-sum dynamic games are Altman et al. (2009) and Hespanha and Prandini (2001), and one of relevance to teams with delayed sharing patterns is Nayyar et al. (2011).

The paper is organized as follows. In the next section, we introduce LQG zero-sum stochastic differential/dynamic games (ZSDGs) with common noisy measurements, first in continuous time and then in discrete time, and for the latter we also include the possibility of intermittent failure of the measurement channel (modeled through a Bernoulli process), leading to occasionally missing measurements. In the section we also introduce the concept of certainty equivalence, first in the context of the classical LQG optimal control problem and then generalized (in various ways) to the two classes of games formulated. In Sect. 3, we introduce a two-stage stochastic dynamic game, as a special case of the general discrete time LQG game of Sect. 2, which is solved completely for its SPE in both pure and mixed strategies, some of the CE type and others non-CE (see Theorem 1 for the complete solution). Analysis of the two-stage game allows us to develop insight into the intricate role information structures play in the characterization and existence of SPE for the more general ZSDGs of Sect. 2, and what CE means in a game context. This insight is used in Sect. 4 in the derivation of generalized CE SPE for the continuous-time LQG ZSDG with noisy state measurements (see Theorem 2 for the penultimate result) as well as for the continuous-time LQG ZSDG with noisy state measurements and perfect state measurements with intermittent losses. The paper ends with a recap of the results of the paper and a discussion on extensions and open problems, in Sect. 5.

2 Zero-Sum Stochastic Differential and Discrete-Time Games with a Common Measurement Channel and Issue of Certainty Equivalence

2.1 Formulation of the Zero-Sum Stochastic Differential Game

We first consider the class of so-called Linear-Quadratic-Gaussian zero-sum differential games (LQG ZSDGs), where the two players’ actions are inputs to a linear system driven also by a Wiener process, and the players have access to the system state through a common noisy measurement channel which is also linear in the state and the driving Wiener noise process. The objective function, to be minimized by one player and maximized by the other, is quadratic in the state and the actions of the two players.

For a precise mathematical formulation, let {x t ,y t ,t≥0}, be respectively the n-dimensional state and m-dimensional measurement processes, generated by

$$\begin{aligned} dx_t =& (Ax_t + Bu_t + Dv_t)dt + F dw_t, \quad t\geq0, \end{aligned}$$
(1)
$$\begin{aligned} dy_t =&H x_t dt + G dw_t, \quad t\geq0, \end{aligned}$$
(2)

where {u t ,t≥0} and {v t ,t≥0} are respectively Player 1’s and Player 2’s controls (say of dimensions r 1 and r 2, respectively), nonanticipative with respect to the measurement process, and generated by measurable control policies {γ t } and {μ t }, respectively, that is

$$ u_t = \gamma_t(y_{[0,t)}), \qquad v_t = \mu_t(y_{[0,t)}), \quad t \geq0. $$
(3)

In (1) and (2), x 0 is a zero-mean Gaussian random vector with covariance Λ 0 (that is x 0N(0,Λ 0)), {w t ,t≥0} is a vector-valued standard Wiener process independent of x 0, and A, B, D, F, H, G are constantFootnote 1 matrices of appropriate dimensions, with (to avoid singularity) FF T>0, GG T>0, and FG T=0, where the last condition assures that system and channel noises are independent. We let Γ and \(\mathcal{M}\) denote the classes of admissible control policies for Player 1 and Player 2, respectively, with elements γ:={γ t } and μ:={μ t }, as introduced earlier. The only restriction on these policies is that when (3) is used in (1), we have unique second-order stochastic process solutions to (1) and (2), with almost sure continuously differentiable sample paths. Measurability and uniform Lipschitz continuity will be sufficient for this purpose.

To complete the formulation of the differential game, we now introduce a quadratic performance index over a finite interval [0,t f ]:

$$ J(\gamma, \mu) = E \biggl\{ |x_{t_f}|^2_{Q_f} + \int^{t_f}_0 \bigl[ |x_t|^2_Q + \lambda|u_t|^2 - |v_t|^2 \bigr] dt \Big\vert u=\gamma(\cdot), v = \mu(\cdot ) \biggr\} , $$
(4)

where expectation E{⋅} is over the statistics of x 0 and {w t }; further, \(|x|^{2}_{Q} := x^{T}Qx\), Q and Q f are non-negative definite matrices, and λ>0 is a scalar parameter. Note that any objective function with nonuniform positive weights on u and v can be brought into the form above by a simple rescaling and re-orientation of u and v and a corresponding transformation applied to B and D, and hence the structure in (4) does not entail any loss of generality as a quadratic performance index.

The problem of interest in the context of LQG ZSDGs is to find conditions for existence and characterization of saddle-point strategies, that is \((\gamma^{*}\in \varGamma, \mu^{*}\in\mathcal{M})\) such that

$$ J\bigl(\gamma^*, \mu\bigr) \leq J\bigl(\gamma^*, \mu^*\bigr) \leq J\bigl(\gamma, \mu^*\bigr), \quad\forall \gamma\in\varGamma, \mu\in\mathcal{M}. $$
(5)

A question of particular interest in this case is whether the saddle-point equilibrium (SDE) has the certainty equivalence property, that is whether it can be obtained directly from the perfect state-feedback SPE of the corresponding deterministic differential game, by simply replacing the state by its “best estimate,” as in the one-player version, the so-called LQG optimal control problem. This will be discussed later in the section.

If a saddle-point equilibrium (SDE) does not exist, then the next question is whether the upper value of the game is bounded, and whether there exists a control strategy for the minimizer that achieves it, that is existence of a \(\bar{\gamma}\in\varGamma \) such that

$$ \inf_\gamma \sup_\mu J( \gamma, \mu) = \sup_\mu J(\bar{\gamma}, \mu). $$
(6)

Note that the lower value of the game, sup μ inf γ J(γ,μ), is always bounded away from zero, and hence its finiteness is not an issue.

2.2 Formulation of the Discrete-Time Zero-Sum Stochastic Dynamic Game with Failing Channels

A variation on the LQG ZSDG is its discrete-time version, which will allow us also to introduce intermittent failure of the common measurement channel. The system equation (1) is now replaced by

$$ x_{t+1}= Ax_t + Bu_t + Dv_t + F w_t, \quad t= 0, 1, \ldots, $$
(7)

and the measurement equation (2) by

$$ y_t= \beta_t (Hx_t + G w_t), \quad t= 0, 1, \ldots, $$
(8)

where x 0N(0,Λ 0); {w t } is a zero-mean Gaussian process, independent across time and of x 0, and with \(E\{w_{t}w_{t}^{T}\} = I \), ∀t∈[0,T−1]:={0,1,…,T−1}; and {β t } is a Bernoulli process, independent across time and of x 0 and {w t }, with \(\operatorname{Probability}(\beta_{t} = 0) = p \), ∀t. This essentially means that the channel that carries information on the state to the players fails with equal probability p at each stage, and these failures are statistically independent. A different expression for (8) which essentially captures the same would be

$$ y_t= \beta_t Hx_t + G w_t, \quad t= 0, 1, \ldots, $$
(9)

where what fails is the sensor that carries the state information to the channel and not the channel itself. In this case, when β t =0, then this means that the channel only carries pure noise, which of course is of no use to the controllers.

Now, if the players are aware of the failure of the channel or the sensor when it happens (which we assume to be the case), then what replaces (3) is

$$ u_t = \gamma_t(y_{[0,t]}, \beta_{[0,t]}), \qquad v_t = \mu_t(y_{[0,t]}, \beta_{[0,t]}), \quad t = 0, 1, \ldots, $$
(10)

where {γ t } and {μ t } are measurable control policies; let us again denote the spaces where they belong respectively by Γ and M.

The performance index replacing (4) for the discrete-time game, over the interval {0,1,…,T−1} isFootnote 2

$$ J(\gamma, \mu) = E \Biggl\{ \sum_{t=0}^{T-1} \bigl[ |x_{t+1}|^2_Q + \lambda|u_t|^2 - |v_t|^2 \bigr] dt \Big\vert u=\gamma(\cdot), v = \mu ( \cdot) \Biggr\} , $$
(11)

where the expectation is over the statistics of x 0, {w t } and {β t }.

The goal is again the one specified in the case of the LQG ZSDG—to study existence and characterization of SPE (defined again by (5), appropriately interpreted), boundedness of upper value if a SPE does not exist, and certainty-equivalence property of the SPE. We first recall below the certainty-equivalence property of the standard LQG optimal control problem, which is a special case of the LQG ZSDG obtained by leaving out the maximizer, that is by letting D=0. We discuss only the continuous-time case; an analogous result holds for the discrete-time case (Witsenhausen 1971a; Yüksel and Başar 2013).

2.3 The LQG Optimal Control Problem and Certainty Equivalence

Consider the LQG optimal control problem, described by the linear state and measurement equations

$$\begin{aligned} dx_t =& (Ax_t + Bu_t)dt + F dw_t, \quad t\geq0, \end{aligned}$$
(12)
$$\begin{aligned} dy_t =&H x_t dt + G dw_t, \quad t\geq0, \end{aligned}$$
(13)

and the quadratic cost function

$$ J(\gamma) = E \biggl\{ |x_{t_f}|^2_{Q_f} + \int^{t_f}_0 \bigl[ |x_t|^2_Q + \lambda|u_t|^2 \bigr] dt \Big\vert u=\gamma(\cdot) \biggr\} , $$
(14)

where F and G satisfy the earlier conditions, and as before γΓ.

It is a standard result in stochastic control (Fleming and Soner 1993) that there exists a unique γ Γ that minimizes J(γ) defined by (14), and \(\gamma_{t}^{*}(y_{[0,t)})\) is linear in y [0,t). Specifically,

$$ u^*(t) = \gamma_t^*(y_{[0,t)}) = {\tilde{\gamma}}_t(\hat{x}_t) = -{1\over\lambda} B^T P(t)\hat{x}_t, \quad t\geq0, $$
(15)

where P is the unique non-negative definite solution of the retrograde Riccati differential equation

$$ \dot{P} + PA + A^TP - {1\over\lambda} PBB^TP + Q =0, \quad P(t_f) = Q_f, $$
(16)

where \(\{\hat{x}_{t}\}\) is generated by the Kalman Filter:

$$\begin{aligned} d\hat{x}_t =& (A\hat{x}_t + Bu_t)dt + K(t) (dy_t - H\hat{x}_tdt), \quad {\hat{x}}_0 = 0, t\geq0, \end{aligned}$$
(17)
$$\begin{aligned} K(t) =& \varLambda(t) H^T \bigl[GG^T \bigr]^{-1} \end{aligned}$$
(18)

with Λ being the unique non-negative definite solution of the forward Riccati differential equation

$$ \dot{\varLambda} - A\varLambda- \varLambda A^T + \varLambda H^T \bigl[GG^T\bigr]^{-1}H \varLambda- FF^T =0, \quad \varLambda(0) = \varLambda_0. $$
(19)

Note that this is a certainty-equivalent (CE) controller, because it has the structure of the optimal controller for the deterministic problem, that is \(-{1\over \lambda} B^{T} P(t) x_{t}\), with the state x t replaced by its conditional mean, E[x t |y [0,t),u [0,t)], which is given by (17). The controller gain (\(-{1\over\lambda} B^{T} P(t)\)) is constructed independently of what the estimator does, while the estimation or filtering is also essentially an independent process with however the past values of the control taken as input to the Kalman filter. Hence, in a sense we have a separation of estimation and control, but not complete decoupling. In that sense, we can say that the controller has to cooperate with the estimator as the latter needs to have access to the output of the control box for the construction of the conditional mean. Of course, an alternative representation for (17) would be the one where the optimal controller is substituted in place of u:

$$ d\hat{x}_t = \biggl(\biggl(A - {1\over\lambda}BB^TP(t) \biggr)\hat{x}_t\biggr)dt + K(t) (dy_t - H \hat{x}_t dt), \quad {\hat{x}}_0 = 0, t\geq0, $$
(20)

but in this representation also there is a need for collaboration or sharing of information, since the estimator has to have access to P(⋅) or the cost parameters that generate it. Hence, the solution to the LQG problem has an implicit cooperation built into it, but this does not create any problem or difficulty in this case, since the estimator and the controller are essentially a single unit.

2.4 The LQG ZSDG and Certainty Equivalence

Now we move on to the continuous-time (CT) LQG ZSDG, to obtain a CE SPE, along the lines of the LQG control problem discussed above. The corresponding deterministic LQ ZSDG, where both players have access to perfect state measurements, admits the state-feedback SPE (Başar and Olsder 1999):

$$\begin{aligned} u^*(t) =& \gamma_t^*(x_{[0,t]}) = {\tilde{\gamma}}_t({x}_t) = -{1\over \lambda} B^T Z(t) x_t, \quad t\geq0, \end{aligned}$$
(21)
$$\begin{aligned} v^*(t) =& \mu _t^*(x_{[0,t]}) = {\tilde{\mu}}_t({x}_t) = D^T Z(t) x_t, \quad t\geq0, \end{aligned}$$
(22)

where Z is the unique non-negative definite continuously differentiable solution of the following Riccati differential equation (RDE) over the interval [0,t f ]:

$$ \dot{Z} + A^TZ + Z A - Z \biggl({1\over\lambda}BB^T - DD^T\biggr) Z +Q =0, \quad Z(t_f) = Q_f. $$
(23)

Existence of such a solution (equivalently nonexistence of a conjugate point in the interval (0,t f ), or no finite escape) to the RDE (23) is also a necessary condition for existence of any SPE (Başar and Bernhard 1995), in the sense that even if any (or both) of the players use memory on the state, the condition above cannot be further relaxed. This conjugate-point condition translates, in this case, on a condition on λ, in the sense that there exists a critical value of λ, say λ (which will depend on the parameters of the game and the length of the time horizon, and could actually be any value in (0,∞)), so that for each λ∈(0,λ ), the pair (21)–(22) provides a SPE to the corresponding deterministic ZSDG.

Now, if a natural counterpart of the CE property of the solution to the LQG optimal control problem would hold for the LQG ZSDG, we would have as SPE:

$$\begin{aligned} u^*(t) =& \gamma_t^*(y_{[0,t)}) = {\tilde{\gamma}}_t({\hat{x}}_t) = -{1\over\lambda} B^T Z(t) {\hat{x}}_t, \quad t\geq0, \end{aligned}$$
(24)
$$\begin{aligned} v^*(t) =& \mu_t^*(y_{[0,t)}) = {\tilde{\mu}}_t({\hat{x}}_t) = D^T Z(t) {\hat{x}}_t, \quad t\geq0, \end{aligned}$$
(25)

where

$${\hat{x}}_t := E\bigl[x_t | y_{[0,t)}, \bigl\{ u_s = \gamma^*_s(y_{[0,s)}), v_s = \mu^*_s(y_{[0,s)}), 0\leq s < t\bigr\} \bigr] $$

is generated by (as counterpart of (20)):

$$ d\hat{x}_t = {\hat{A}}\hat{x}_tdt + K(t) (dy_t - H\hat{x}_t dt),\quad {\hat{x}}_0 = 0, t\geq0, $$
(26)

where

$$ {\hat{A}} := A - \biggl({1\over\lambda}BB^T-DD^T \biggr)Z(t) $$
(27)

and K is the Kalman gain, given by (18), with Λ now solving

$$ \dot{\varLambda} - {\hat{A}}\varLambda- \varLambda{\hat{A}}^T + \varLambda H^T \bigl[GG^T \bigr]^{-1}H \varLambda- FF^T =0, \quad \varLambda(0) = \varLambda_0. $$
(28)

The question now is whether the strategy pair (γ ,μ ) above constitutes a SPE for the LQG ZSDG, that is whether it satisfies the pair of inequalities (5). We will address this question in Sect. 4, after discussing in the next section some of the intricacies certainty equivalence entails, within the context of a two-stage discrete-time stochastic dynamic game. But first, we provide in the subsection below the counterpart of the main result of this subsection for the discrete-time dynamic game.

2.5 The LQG Discrete-Time ZS Dynamic Game and Certainty Equivalence

Consider the discrete-time (DT) LQG ZS dynamic game (DG) formulated in Sect. 2.2, but with non-failing channels (that is, with p=0). We provide here a candidate CE SPE for this game, by following the lines of the previous subsection, but in discrete time. First, the corresponding deterministic LQ ZSDG, where both players have access to perfect state measurements admits the state-feedback SPE (Başar and Olsder 1999) (as counterpart of (21)–(22))

$$\begin{aligned} u^*_t =& \gamma_t^*(x_{[0,t]}) = {\tilde{\gamma}}_t({x}_t) = -{1\over\lambda} B^T Z_{t+1}\bigl(N_t^{-1} \bigr)^{-1} A x_t, \quad t= 0, 1, \ldots, \end{aligned}$$
(29)
$$\begin{aligned} v^*_t =& \mu_t^*(x_{[0,t]}) = { \tilde{\mu}}_t({x}_t) = D^T Z_{t+1} \bigl(N_t^{-1}\bigr)^{-1} A x_t, \quad t= 0, 1, \ldots, \end{aligned}$$
(30)

where

$$ N_t=I+ \biggl({1\over\lambda} BB^T - DD^T \biggr) Z_{t+1}, \quad t=0, 1, \ldots, $$
(31)

and Z t is a non-negative definite matrix, generated by the following discrete-time game Riccati equation (DTGRE):

$$ Z_t = Q + A^T Z_{t+1} \bigl(N_t^{-1}\bigr)^T A, \quad Z(T) = Q. $$
(32)

Under the additional condition

$$ I - D^T Z_{t+1}D > 0, \quad t=0, 1, \ldots, T-1, $$
(33)

which also guarantees invertibility of N, the pair (29)–(30) constitutes a SPE. If, on the other hand, the matrix (33) has a negative eigenvalue for some t, then the upper value of the game is unbounded (Başar and Bernhard 1995). As in the CT conjugate point condition, the condition (33) translates into a condition on λ, in the sense that there exists a critical value of λ, say λ c (which will depend on the parameters of the game and the number of stages in the game), so that for each λ∈(0,λ c ), the pair (29)–(30) provides a SPE to the corresponding deterministic ZSDG.

Now, the counterpart of (24)–(25), as a candidate CE SPE, would be

$$\begin{aligned} u^*_t =& \gamma_t^*(y_{[0,t]}) = {\tilde{\gamma}}_t({\hat{x}}_{t|t} ) = -{1\over\lambda} B^T Z_{t+1}\bigl(N_t^{-1} \bigr)^T A {\hat{x}}_{t|t}, \quad t= 0, 1, \ldots, \end{aligned}$$
(34)
$$\begin{aligned} v^*_t =& \mu_t^*(y_{[0,t]}) = { \tilde{\mu}}_t({\hat{x}}_{t|t}) = D^T Z_{t+1} \bigl(N_t^{-1}\bigr)^T A {\hat{x}}_{t|t}, \quad t= 0, 1, \ldots, \end{aligned}$$
(35)

where

$${\hat{x}}_{t|t} := E \bigl[x_t | y_{[0,t]}, \bigl\{ u_s = \gamma^*_s(y_{[0,s]}), v_s = \mu^*_s(y_{[0,s]}), s = 0, \ldots, t-1\bigr\} \bigr] $$

is generated by, with \({\hat{x}}_{0|-1} = 0\):

$$ \begin{aligned} {\hat{x}}_{t|t} ={} &{\hat{x}}_{t|t-1} + \varLambda_t H^T \bigl(H\varLambda_tH^T + GG^T \bigr)^{-1} (y_t - H {\hat{x}}_{t|t-1} ) \\ {\hat{x}}_{t+1|t} ={}& (N_t)^{-1} A {\hat{x}}_{t|t-1} \\ &{}+ (N_t)^{-1} A\varLambda_t H^T \bigl(H\varLambda_tH^T + GG^T \bigr)^{-1} (y_t - H {\hat{x}}_{t|t-1} ), \end{aligned} $$
(36)

where the sequence {Λ t ,t=1,2,…,T} is generated by

$$ \begin{aligned} \varLambda_{t+1} ={}& (N_t)^{-1} A \varLambda_t \bigl[ I - H^T \bigl(H\varLambda_tH^T + GG^T \bigr)^{-1} H\varLambda_t \bigr] A^T \bigl((N_t)^{-1} \bigr)^T \\ &{}+ FF^T, \end{aligned} $$
(37)

with the initial condition Λ 0 being the covariance of x 0.

Now, if instead of the noisy channel, we have intermittent failure of a channel which otherwise carries perfect state information, modeled as in Sect. 2.2 but with clean transmission when the channel operates (failure being according to an independent Bernoulli process as before), then as a candidate CE SPE, the pair (34)–(35) is replaced by

$$\begin{aligned} u^*_t =& \gamma_t^*(y_{[0,t]}) = {\tilde{\gamma}}_t({\zeta}_{t}) = -{1\over\lambda} B^T Z_{t+1}\bigl(N_t^{-1} \bigr)^T A \zeta_t, \quad t= 0, 1, \ldots, \end{aligned}$$
(38)
$$\begin{aligned} v^*_t =& \mu_t^*(y_{[0,t]}) = { \tilde{\mu}}_t(\zeta_t) = D^T Z_{t+1} \bigl(N_t^{-1}\bigr)^T A \zeta_t, \quad t= 0, 1, \ldots, \end{aligned}$$
(39)

where the stochastic sequence {ζ t ,t=0,1,…} is generated by

$$ \zeta_{t} = \beta_{t} y_{t} + (1-\beta_t) \biggl( I-\biggl({1\over\lambda}B B^T +DD^T\biggr)Z_{t+1} \bigl(N_t^{-1} \bigr)^T \biggr)A\zeta_{t-1}, \quad \zeta_0 = y_0. $$
(40)

We will explore in Sect. 4 whether these CE policies are in fact SPE policies, and under what conditions.

3 A Two-Stage Discrete-Time Game with Common Measurement

To explicitly demonstrate the fact that certainty equivalence generally fails in games (but holds in a restricted sense), we consider here a specific 2-stage version of the formulation (7), (8), (11):

$$\begin{aligned} x_2 =&x_1-u+w_1 ;\qquad x_1=x_0+2v+w_0, \end{aligned}$$
(41)
$$\begin{aligned} y_1 =&\beta_1 (x_1 + r_1) ;\qquad y_0 = \beta_0 (x_0 + r_0), \end{aligned}$$
(42)
$$\begin{aligned} J(\gamma,\mu) =& E\bigl\{ (x_2)^2 + \lambda u^2 - v^2 | u=\gamma(\cdot), v=\mu(\cdot)\bigr\} , \end{aligned}$$
(43)

with u=γ(y 1,y 0;β 1,β 0), v=μ(y 0;β 0), where the random variables x 0, w 0, w 1, r 0, r 1 are independent, Gaussian, with zero mean and unit variance, and β 1, β 0 are independent Bernoulli random variables with \(\operatorname{Probability}(\beta_{t} = 0) = p \), for t=0,1.

3.1 Certainty-Equivalent SPE

The deterministic version of the game above, with u=γ(x 1,x 0), v=μ(x 0), admits a unique saddle-point solution (Başar and Olsder 1999), given by

$$ \gamma^*(x_1, x_0) = {1\over1+\lambda} x_1, \qquad \mu^*(x_0) = {2\lambda\over1-3\lambda} x_0, $$
(44)

whenever

$$ 0< \lambda< {1\over3}, $$
(45)

and for λ>1/3, the upper value is unbounded.

Now, a certainty-equivalent (CE) SPE for the original stochastic game, if exists, would be one obtained from the SPE of the related deterministic dynamic game, as above, by replacing x 0 and x 1 by their conditional means, which in the case of x 1 would require the SP policy at the earlier stage (that is, stage 0). Carrying this out, we have

$$ \mu^*(y_0; \beta_0) = {2\lambda\over1-3\lambda} E[x_0|y_0; \beta_0] = {\lambda\over1-3\lambda} y_0, $$
(46)

and

$$\begin{aligned} \gamma^*(y_{[0, 1]}; \beta_{[0,1]}) =& {1\over1+\lambda} E \bigl[x_1|y_1, y_0; \beta_1, \beta_0; v=\mu^*(y_0; \beta_0) \bigr] \\ =& {1\over1+\lambda} \biggl[\beta_1 \biggl( {2\over3}y_1 - {3\over10} y_0 - {6\over5} \mu^*(y_0; \beta_0) \biggr) \\ & {}+ \beta_0 \biggl( -{1\over15}y_1 + {1\over2} y_0 +2 \mu^*(y_0; \beta_0) \biggr) \biggr]. \end{aligned}$$
(47)

Note that if the channel does not fail at all (that is, β 0=β 1=1), then one can have a simpler expression for (47), given by:

$$ \gamma^*(y_1, y_0)= {3\over5(1+\lambda)} y_1 + {1\over5(1-3\lambda )} y_0. $$
(48)

3.2 Analysis for p=0 for CE SPE and Beyond

We assume in this subsection that p=0, in which case the CE SPE (whose SPE property is yet to be verified) is given by (46)–(48). It is easy to see that J(γ ,v), with γ as in (48) is unbounded in v unless λ<3/25, which means that the CE pair (46)–(48) cannot be a SPE for λ∈[3/25,1/3), even though the pair (44) was for the deterministic game. For the interval λ∈(0,3/25), however, the CE pair (46)–(48) is a SPE for the stochastic game, as it can easily be shown to satisfy the pair of inequalities (5). Further, for this case, since μ is the unique maximizer to J(γ ,μ), and γ is the unique minimizer to J(γ,μ ), it follows from the ordered interchangeability property of multiple SP equilibria (Başar and Olsder 1999) that the SPE is unique. Hence, for the parametrized stochastic dynamic game, a “restricted” CE property holds—restricted to only some values of the parameter.

Now the question is whether the parametrized stochastic game admits a SPE for λ∈[3/25,1/3). Clearly, it cannot be a CE SPE, that is the SPE of the deterministic version cannot be used to obtain it. Note that, for λ∈[3/25,1/3), if one picks γ(y 1,y 0)=[1/(1+λ)]y 1 in J(γ,μ), then the maximum of this function with respect to μ exists and is bounded, which means that the upper value of the stochastic game is bounded. Its lower value is also clearly bounded (simply pick v=0).

Again working with the special case p=0 (that is no failure of the noisy channels), we now claim that there indeed exists a SPE for λ∈[3/25,1/3), but it entails a mixed strategy for the maximizer (Player 2) and still a pure strategy for the minimizer (Player 1). These are:

$$ \begin{aligned} v&= \mu^*(y_0) = {2\lambda\over1-3\lambda} E[x_0|y_0] + \xi={\lambda\over1-3\lambda} y_0 + \xi, \\ \xi&\sim N\bigl(0, \sigma^2\bigr),\quad \sigma^2 = {4-5\sqrt{1-3\lambda} \over8 \sqrt{1-3\lambda}}, \end{aligned} $$
(49)

and

$$\begin{aligned} u =&\gamma^*(y_1, y_0) = {1\over1+\lambda} E\bigl[x_1|y_1, y_0, v=\mu^*(y_0)\bigr] \\ =& {2-\sqrt{1-3\lambda} \over2(1+\lambda)} y_1 + {1\over4\sqrt {1-3\lambda}} y_0. \end{aligned}$$
(50)

First note that σ 2>0 for λ∈(3/25,1/3), and σ 2=0 at λ=3/25, and further that the policies (49)–(50) agree with (46)–(47) at λ=3/25, and hence transition from CE SPE to the non-CE one is continuous at λ=3/25. Now, derivation of (49)–(50) as a SPE uses the conditional equalizer property of the minimizer’s policy (that is (50)). One constructs a policy γ for the minimizer, under which (that is, with u=γ(y 0,y 1)) the conditional cost

$$E \bigl\{ (x_2)^2 + \lambda u^2 - v^2 | y_0 \bigr\} $$

becomes independent of v, and (50) is such a policy; it is in fact the unique such policy in the linear class. Hence, any choice of μ, broadened to include also mixed policies, would be a maximizing policy for Player 2, and (49) is one such policy. This establishes the left-hand-side inequality in (5). For the right-hand-side inequality, it suffices to show that (50) minimizes J(γ,μ ); this is in fact a strictly convex LQG optimization problem, whose unique solution is (50). Because of this uniqueness, and ordered interchangeability of multiple SPE (Başar and Olsder 1999), the SPE (49)–(50) is unique.

3.3 The Case p>0

We now turn to the case where the channels fail with positive probability, for which a candidate pair of SPE policies, based on CE, was given by (46)–(47). Their SPE property is yet to be shown, as well as the range of values of λ for which it is valid as a SPE is yet to be determined, which we address in this subsection. Toward that end, let us first consider, as a benchmark, the special case with noise-free (but still failure prone) measurement channel. This would therefore correspond to the formulation where (42) is replaced by

$$y_1 = \beta x_1 ;\qquad y_0 = \beta_0 x_0. $$

The counterpart of (46)–(47) in this case would be (this can also be obtained directly from the perfect-state SPE):

$$\begin{aligned} v =& \mu^*(y_0; \beta_0) = {2\lambda\over1-3\lambda} y_0, \end{aligned}$$
(51)
$$\begin{aligned} u =&\gamma^*(y_{[0, 1]}; \beta_{[0,1]}) = \beta_1{1\over1+\lambda} y_1 + (1- \beta_1) {1\over1+\lambda} \bigl[y_0+2 \mu^*(y_0) \bigr]. \end{aligned}$$
(52)

Now, if Player 2 employs (51), then the unique response of Player 1 will be (1/(1+λ))x 1 for β 1=1, and (1/(1+λ))E[x 1|x 0]=(1/(1+λ))(x 0+2μ (x 0)) if β 1=0 and β 0=1, which agrees with (52); if both β’s are zero, then clearly Player 1’s response will also be zero. Note that the responses by Player 1 in all these cases are unique.

If Player 1 employs (52), then the conditional cost (conditioned on y 0, β 0) seen and to be maximized by Player 2 is:

For β 0=1 (after some simplifications):

$$\begin{aligned} &1+p + (1-p){\lambda\over1+\lambda}\bigl[1+(x_0+2v)^2 \bigr] + p \biggl[{\lambda\over1+\lambda} x_0 + 2v - {2\over1+\lambda}\mu ^*(y_0) \biggr]^2 p \\ &\quad{}+ {\lambda\over(1+\lambda)^2} \bigl[x_0 + 2 \mu^*(y_0) \bigr]^2 p - v^2 \end{aligned}$$
(53)

and for β 0=0 (after some simplifications):

$$ 1+p + {\lambda\over 1+\lambda} (2-p) + {3\lambda+ 4p -1\over 1+\lambda} v^2. $$
(54)

Both (53) and (54) are strictly concave in v if and only if

$$ p < {1\over4} \quad\mbox{and}\quad \lambda< {1-4p \over3}, $$
(55)

in which case the unique maximizing solution to (54) is v =0, and likewise to (53) is v =μ (x 0)=(2λ/1−3λ)x 0, both of which agree with (51). Hence, the CE pair (51)–(52) indeed provides a SPE if the condition (55) holds, that is the failure probability should be less than 1/4, and the parameter λ should be less than a specific threshold, which decreases with increasing p. Note that if p=0, we recover the earlier condition (45) for the deterministic game, where we know that if λ>1/3, then the upper value is unbounded. The question is whether the same applies here, for λ>(1−4p)/3. This indeed is the case, as one can easily argue that the concavity condition of (54) cannot be improved further as universally optimal choice for γ when Player 1 has access to x 1 but not x 0 has led to that conditional cost. Hence, the pair (51)–(52) is the complete set of SPE for the game of this subsection (with channel failure but no noise in the channel), and the condition (55) is tight.

We are now in a position to discuss the SPE of the original stochastic game of this section, to find the conditions (if any) under which the CE policies (46)–(47) are in SPE, and whether those conditions can be relaxed by employing structurally different policies (as in Sect. 3.2).

3.4 CE SPE and Beyond for the 2-Stage Game

To obtain the complete set of SPE for the original stochastic 2-stage game, our starting point will be the pair of CE policies (γ ,μ ) given by (47) and (46), and to find the region in the λp space for which these policies constitute a SPE. Clearly, we would expect that region (if exists) to be no larger than the one described by (55) since that one corresponded to the noise-free channel.

Let us first consider the right-hand inequality of (5) for this game, with μ (y 0;β 0) given by (46). In terms of γ this is a strictly convex quadratic optimization problem, which one minimizes with respect to u after conditioning the cost on y [0,1] and β [0,1]; the result is the unique solution given by (47). This part of the inequality does not bring in any additional restriction on λ and p, other than the condition λ<1/3 needed in the expression for μ .

The left-hand inequality of (5) for this game is a bit more involved. We now pick γ as given by (47), and maximize the resulting cost over μ, which is equivalent to maximizing the conditional cost with respect to v where conditioning is with respect to y 0 and β 0. Even though this is also a quadratic optimization problem, existence and uniqueness of maximum are not guaranteed for all values of λ and p, and we have to find (necessary and sufficient) conditions for strict concavity (in v). Now, the conditional cost (conditioned on (y 0,β 0), and with v=μ(y 0,β 0)) is:

For β 0=1 (after some simplifications):

$$\begin{aligned} &p \biggl[x_0+2v-{1\over2(1-3\lambda)}y_0 \biggr]^2 + (1-p) \biggl[ {2+5\lambda\over 5(1+\lambda)} (x_0 + 2v) - {1\over5(1-3\lambda)}y_0 \biggr]^2 \\ &\quad{} + \lambda(1-p) \biggl[ {3 \over5(1+\lambda)} (x_0 + 2v) + {1\over 5(1-3\lambda)}y_0 \biggr]^2 -v^2 \\ &\quad{} +2p + (1-p) {50\lambda^2 + 88\lambda+ 38\over25(1+\lambda)^2} + {\lambda p\over4(1-3\lambda)^2} (y_0)^2, \end{aligned}$$
(56)

and for β 0=0 (after some simplifications):

$$\begin{aligned} &(1-p) \biggl[ {2+5\lambda\over5(1+\lambda)} (x_0 + 2v) \biggr]^2 + \lambda(1-p) \biggl[ {3 \over5(1+\lambda)} (x_0 + 2v) \biggr]^2 -v^2 \\ &\quad{} + p [x_0+2v]^2 +2p + (1-p) {50\lambda^2 + 88\lambda+ 38\over25(1+ \lambda)^2}. \end{aligned}$$
(57)

Both (56) and (57) are strictly concave in v if and only if the coefficients of the quadratic terms in v (identical in the two cases) are negative, that is

$$(1-p) \biggl[ {4(2+5\lambda)^2 \over25(1+\lambda)^2} + {36\lambda\over25(1+\lambda)^2} \biggr]^2 +4p-1 < 0, $$

which is equivalent to

$$ p < {3\over28} \quad\mbox{and}\quad \lambda< {3-28p \over25}. $$
(58)

We note that the upper bound on the failure probability p is precisely the condition that makes the upper bound on λ in (58) positive. Another point worth making is that we naturally would expect the conditions on p and λ as given above in (58) to be more stringent than the ones in (55), for the noise-free case. Clearly the condition on p is more restrictive, as 3/28<1/4. For the bound on λ, it again immediately follows that

$$ {3-28p \over25} < {1- 4p\over3}, $$
(59)

whenever p<1.

Now, to complete the verification of the SPE property of the pair (47) and (46), we still have to show that the unique maximizers of the strictly concave (under (58)) quadratic conditional costs (57) and (56) are given by (46). For the former, the result follows readily since its maximizer is v=0. For the latter, a simple differentiation with respect to v, and using E[x 0|y 0,β 0=1]=(1/2)y 0, leads after some extensive calculations and simplifications to v=[λ/(1−3λ)]y 0, which is the same as (46).

Hence, the SPE for the 2-stage game of this section (with noisy channels and nonzero failure probabilities) is a CE SPE, but for a more restrictive set of values for p and λ (compare (58) with (55), as we have noted earlier). The question now is whether the gap can be closed by using non-CE policies, as was done in the failure-free case (p=0). Clearly, the upper bound of the game is finite for the entire set of values of p and λ in (55) (simply substitute (52) into the cost for u, with additive noise in y 1 and y 0) and note that the presence of additive noise in the channels does not alter the required concavity condition, and hence we have a well-defined strictly concave quadratic maximization problem for v under the same condition (55).

As already mentioned, the region in the parameter space λp that corresponds to the CE SPE (47) and (46) is smaller than the region corresponding to the SPE of the noise-free channel case, and the question now is whether region of existence of a SPE can be enlarged by transitioning to a pair of non-CE policies, as it was done in Sect. 3.4 for the case when p=0. We will see that this is indeed the case, and an equalizer policy for Player 1 does the job. Its derivation, however, is a bit more complicated than the one of Sect. 3.4 because the possibility of channel failures brings in an additional element of complexity (even though the basic idea is still the same). Let us first assume that β 0=1, and start with a general linear policy for Player 1:

$$ u= \hat{\gamma}(y_1, y_0, \beta_1) = \alpha_1 y_1 + \alpha_0(\beta_1) y_0, $$
(60)

where α 1, \(\alpha_{0}(\beta_{1}=1)=: \alpha_{0}^{1}\), \(\alpha_{0}(\beta_{1}=0)=: \alpha_{0}^{0}\) are constant parameters yet to be determined.Footnote 3 They will be determined in such a way that with (60) used in J(γ,v), the latter expression becomes independent of v (when conditioned on y 0). Skipping the details, the expression for

$$ J(\hat{\gamma}, v) = E \bigl\{ \bigl(x_1 - \hat{\gamma}(y_1, y_0, \beta_1) + w_1 \bigr)^2 + \lambda \bigl( \hat{\gamma}(y_1, y_0, \beta_1) \bigr)^2 - v^2 | y_0, \beta_0=1 \bigr\} $$
(61)

is

$$\begin{aligned} & p \bigl[x_0+2v-\alpha^0_0y_0 \bigr]^2 + (1-p) \bigl[ (1-\alpha_1) (x_0 + 2v) - \alpha_0^1y_0 \bigr]^2 \\ &\quad {}+ \lambda(1-p) \bigl[ \alpha_1 (x_0 + 2v) + \alpha_0^1y_0 \bigr]^2 -v^2 + (1-p)\lambda\bigl(2(\alpha_1)^2 + \bigl(\alpha_0^1\bigr)^2 + 1\bigr) \\ &\quad {} +2p + (1-p) \bigl( (\alpha_1)^2 + (1-\alpha_1)^2 + 1\bigr) + \lambda p \bigl( \alpha_0^0\bigr)^2(y_0)^2 + 1 . \end{aligned}$$
(62)

This is a quadratic function of v; the coefficient of v 2 can be annihilated by choosing

$$ \alpha_1 = {1\over1+\lambda} \biggl[ 1 - {\sqrt{(1-4p-3\lambda )(1-p)}\over2 (1-p)} \biggr], $$
(63)

which is a well-defined expression provided that 4p+3λ<1, and naturally (since λ>0) also p<1/4 which are identical to (55). For annihilation of the coefficient of v, on the other hand, we need the following relationship between \(\alpha^{1}_{0}\) and \(\alpha_{0}^{0}\):

$$ 2p\alpha_0^0+ \alpha_0^1 \sqrt{(1-4p-3\lambda) (1-p)}= {1\over4}. $$
(64)

Now, we have to show that these are best responses to some policy by Player 2, which will necessarily be a mixed strategy, as in Sect. 3.4. The process now is to assume that v has the formFootnote 4

$$ v= {\hat{\mu}}(y_0) = k_0 y_0 + \xi,\quad \xi\sim N\bigl(0, \sigma^2\bigr), $$
(65)

for some k 0 and σ 2; find the best response of Player 1 to this by minimizing \(J(\gamma,\hat{\mu})\) with respect to γ (which, by strict convexity, will clearly be unique, and be in the structural form (60)); require consistency with (63)–(64); and solve for k 0, σ 2, \(\alpha^{1}_{0}\)\(\alpha^{0}_{0}\). The outcome is the following set of unique expressions:

$$\begin{aligned} k_0 =& {\lambda\over1-3\lambda}, \qquad \sigma^2 = {\sqrt{(1-p)} \over2\sqrt{(1-4p-3\lambda)}} - {5\over8} \end{aligned}$$
(66)
$$\begin{aligned} \alpha_0^0 =& {1\over2(1-3\lambda)}, \qquad \alpha^1_0 = {\sqrt{(1-4p-3\lambda)(1-p)} \over4(1-p)(1-3\lambda)}. \end{aligned}$$
(67)

To complete the construction of the SPE, we still have to find the conditions under which σ 2 as given above is well defined (that is, it is positive). The required condition (both necessary and sufficient, provided that (55) holds, which is a natural condition) is

$$ 4(1-p) > 5\sqrt{(1-4p-3\lambda) (1-p)} \quad\Leftrightarrow\quad \lambda> \max \biggl(0, {3-28p \over25} \biggr). $$
(68)

Note that the (lower) bound on λ matches exactly the upper bound in (58), and hence non-CE SPE policies make up for the restriction brought in by the CE SPE.

To gain further insight (for purposes of establishing continuity) we can look at two limiting cases: (i) For p=0, the non-CE solution matches exactly the one given in Sect. 3.2 for the failure-free case. (ii) At λ=(3−28p)/25, with p>3/28, which is the boundary between the two regions corresponding CE and non-CE SPE, α 1 in (63) is 3/[5(1+λ)], which is exactly the coefficient of y 1 in (47) with β 1=β 0=1; \(\alpha_{0}^{1}\) in (67) is 1/[5(1−3λ)], which is exactly the coefficient of y 0 in (47) with β 1=β 0=1; and finally, \(\alpha_{0}^{0}\) in (67) (which does not depend on p) is exactly the coefficient of y 0 in (47) with β 1=0, β 0=1, and this one is for all λ satisfying all other conditions, and not only at the boundary.

The remaining case to analyze is β 0=0. The CE SPE in this case would be (from (46)–(47)):

$$ \gamma^*(y_1; \beta_1) = {2\over3(1+\lambda)} y_1, \qquad v^*=0, $$
(69)

which is a valid one (that is the cost under γ is strictly concave in v) if and only if the multiplying term for v 2 is negative, that is

$$(1-p) \biggl[ {4(3\lambda+ 1)^2 \over9(1+\lambda)^2} + {16\lambda \over9(1+\lambda)^2}\biggr] -1 + 4p < 0, $$

which simplifies to λ<(5−32p)/27, for which we need p<5/32 (for positivity). To extend the solution to a larger region, we again have to look for an equalizer policy that annihilates v, and is also best response to v=ξN(0,σ 2) for some σ 2. Following the same process as earlier, we start with u=α 1 y 1, and compute the cost faced by Player 2, where the multiplying term for v 2 is:

$$4(1-p) \bigl[(1-\alpha_1)^2 + \lambda( \alpha_1)^2 \bigr] -1 + 4p. $$

Setting this equal to zero, and solving for α 1 we obtain the expression given by (63). Now, the best response by Player 1 to v=ξ is

$$u={1\over1+\lambda} E[x_0+2\xi+ w_0 | y_1] = {1\over\lambda} \biggl( 1- {1\over2} \sqrt{1-3\lambda} \biggr) y_1, $$

where we then invoke the multiplying term above to equal α 1 given by (63), which leads to the following unique expression for σ 2:

$$ \sigma^2 = {\sqrt{1-p} \over2\sqrt{1-4p-3\lambda}} - {3\over4}, $$
(70)

which is well-defined and positive provided that

$$ \max \biggl(0, {5-32p\over27} \biggr) < \lambda< {1-4p\over3}. $$
(71)

Note that the lower bound on λ matches the upper bound in the case of the CE SPE, and that the SPE policy of Player 1 is continuous across the boundary λ=(5−32p)/27.

We now collect all this in the following theorem, which is the main result of this section.

Theorem 1

The two stage discrete-time stochastic game formulated in this section admits a saddle-point equilibrium (SPE) provided that

$$ 0\leq p < {1\over4} \quad\textit{and}\quad 0< \lambda< {1-4p \over3} ; $$

otherwise, the upper value of the game is unbounded. The SPE policies of the players, (γ ,μ ), admit different characterizations in two different regions of the parameter space, and also depending on whether β 0=1 or 0:

  • For λ≤(5−32p)/27 and p<5/32 when β 0=0, and λ≤(3−28p)/25 and p<3/28 when β 0=1,γ and μ are given by (47) and (46), respectively; this constitutes a certainty-equivalent (CE) SPE.

  • For max(0,(5−32p)/27)<λ<1−(4p/3) and p<1/4 when β 0=0, and max(0,(3−28p)/25)<λ<1−(4p/3) and p<1/4 when β 0=1, the SPE policies are of the non-CE type, given by

    $$\begin{aligned} &\gamma^*(y_1, y_0; \beta_1, \beta_0) = \alpha_1 y_1 + \bigl(\beta _1\alpha_0^1 + (1-\beta_1) \alpha_0^0 \bigr) y_0 \end{aligned}$$
    (72)
    $$\begin{aligned} &\qquad\,\,\,\,\,\begin{aligned} \mu^*(y_0; \beta_0) &= k_0 y_0 + \xi, \\ \xi&\sim N\bigl(0, \sigma^2\bigr), \quad\sigma^2 = {\sqrt{1-p} \over2\sqrt{1-45-3\lambda}} - {3\over4} + {1\over 8}\beta_0, \end{aligned} \end{aligned}$$
    (73)

    where α 1 is given by (63), and the pair \((\alpha_{0}^{0}, \alpha_{0}^{1})\) is given by (67).

4 CE SPE of the LQG ZSDG in Continuous and Discrete Time

4.1 Various Approaches Toward Construction of SPE

For a two-person ZSDG (in normal or strategic form), with strategy spaces Γ (for Player 1, the minimizer) and \(\mathcal{M}\) (for Player 2, the maximizer), with (expected) cost function J, defined on \(\varGamma\times\mathcal{M}\), let us recall from (5) that a pair (γ Γ,μ M) is in SPE if

$$J\bigl(\gamma^*, \mu\bigr) \leq J\bigl(\gamma^*, \mu^*\bigr) \leq J\bigl(\gamma, \mu^*\bigr), \quad \forall \gamma\in\varGamma, \mu\in\mathcal{M}. $$

The general direct approach toward derivation of a SPE would be:

  • Fix \(\mu\in\mathcal{M}\) as an arbitrary policy for Player 2, and minimize J(γ,μ) with respect to γ on Γ.

  • Fix γΓ as an arbitrary policy for Player 1, and maximize J(γ,μ) with respect to μ on \(\mathcal{M}\).

  • Look for a fixed point, which would then be a SPE.

Even though direct, this approach would entail a very complex process for dynamic games (in continuous or discrete time), even if they are of the linear-quadratic type. Unless the information structure is static, the optimization problems involved structurally depend on the selection of arbitrarily fixed policies, rendering the underlying optimization problems unwieldy.

An alternative (still direct) approach would be a recursive (backward-forward) one, applicable to discrete-time dynamic games with particular information structures, and possibly extendable to some classes of continuous-time ZSDGs:

  • Proceed recursively at t=T−1,T−2,… .

  • At t, solve for SPE (if exists) of the 1-stage game by fixing in J policies for t+1,…,T−1 at their optimum choices and for 0,…,t−1 arbitrarily, with the former possibly depending on the optimum policies (yet to be determined) at 0,1,…,t.

Such a construction is doable, but it is quite tedious (and depends on the specific information structure, and applies primarily to discrete-time games); for such a derivation, in a broader Nash equilibrium context, see Başar (1978a).

A third, indirect approach, entails expansion of information structures of the players, obtaining a SPE in the induced expanded (richer) policy spaces, and then projecting the solution (contracting it) back to the original policy spaces. Such an approach works when the SPE values of the two games (one on original policy spaces and the other one on the expanded ones) are the same, and this is generally the case if the expansion involves only past actions of the players. Hence, we have the following process:

  • Endow both players with past actions, assuming that they already have access to the past measurements in terms of which the actions were generated.

  • Any SPE to the original stochastic dynamic game (SDG) is also a SPE to the new one (but naturally not vice versa).

  • Any two SPE of the new SDG are ordered interchangeable.

  • Solve for some (conveniently constructed) SP policies for the new SDG, and find representations (Başar and Olsder 1999) in the original policy spaces.

  • Verify for the original SDG that the policies arrived at are indeed in SPE (this step is a verification of existence, which is much simpler than verifying characterization).

A further justification of this indirect approach can be found in Başar (1981). In the next two subsections, we illustrate the approach on the two LQ ZSDGs introduced and discussed in Sects. 2.1, 2.2, 2.4, and 2.5. While doing this, we have to keep in mind the features we have observed within the context of the 2-stage ZS SDG of Sect. 3.

4.2 SPE Property of CE Policies of the LQG ZSDG

We turn here to the continuous-time LQG ZSDG of Sect. 2.1, for which the CE policies (24)–(25) were offered as a candidate SPE for the original SDG with noise in the common channel. We now investigate whether these policies are indeed in SPE for at least some region of the parameter space (as was the case for the 2-stage game of Sect. 3). Toward this end, we first enlarge the policy spaces of the players to include also past actions, that is, the players now have access to (y [0,t),u [0,t),v [0,t)) at time t. Denote the corresponding expanded policy spaces for Players 1 and 2 by \(\tilde{\varGamma}\) and \(\tilde{\mathcal{M}}\), respectively. If y [0,t) was replaced by x t (that is, the perfect state measurement case) and still allowing players to have access to past actions, the pair of policies (21)–(22) would still be in SPE (Başar and Olsder 1999), whose CE counterpart in \(\tilde{\varGamma}\times\tilde{\mathcal{M}}\) would still be of the form (24)–(25), with however \(\{\hat{x}_{t}, t\geq0\}\) replaced by {ζ t ,t≥0}, generated by

$$ d\zeta_t = (A \zeta_t + Bu_t + Dv_t)dt + K(t) (dy_t - H \zeta_t dt),\quad {\zeta}_0 = 0, t\geq0. $$
(74)

Note that the above is still the Kalman filter equation, but driven not only by the measurement but also by the past actions. Now, one can show using the ordered interchangeability property of multiple SPE policies that any pair of SP policies in \(\varGamma\times {\mathcal{M}}\) also constitute a SPE in the expanded policy spaces \(\tilde{\varGamma}\times\tilde{\mathcal {M}}\) (but not vice versa) (Başar and Olsder 1999; Başar 1981), and further that by some standard properties of the LQG control problem discussed in Sect. 2.3, the pair (24)–(25) indeed constitutes a SPE for the new SDG with expanded policy spaces, provided that the RDE (23) does not have a conjugate point in the interval [0,t f ], which is exactly the condition of existence of SPE to the LQG ZSDG with perfect-state measurements.

Clearly, however, the SP policies above for the SDG with expanded policy spaces are not implementable even for that game, because they require cooperation on the generation of the conditional mean of x, or that estimate ζ (as in (74)) to be generated by a third party, and supplied to the two antagonistic players, which is not realistic. To make it real-time implementable, and in line with the adversarial aspect of the game, we have to replace these policies with ones that allow players to run their own filters, driven by the common measurement (but not with actions of the players), as given below:

$$\begin{aligned} u^*(t) =& \gamma_t^*(y_{[0,t)}) = { \gamma}^{\mathrm{CE}}_t({z}_t) = -{1\over\lambda} B^T Z(t) {z}_t, \quad t\geq0, \end{aligned}$$
(75)
$$\begin{aligned} v^*(t) =& \mu_t^*(y_{[0,t)}) = { \mu}^{\mathrm{CE}}_t({\eta}_t) = D^T Z(t) { \eta}_t, \quad t\geq0, \end{aligned}$$
(76)

where z and η are generated by (as counterpart of (74)):

$$\begin{aligned} dz_t =& \hat{A}z_tdt + K(t) (dy_t - Hz_t dt), \quad z_0 = 0, t\geq0, \end{aligned}$$
(77)
$$\begin{aligned} d\eta_t =& \hat{A}\eta_tdt + K(t) (dy_t - H\eta_t dt), \quad \eta_0 = 0, t\geq0, \end{aligned}$$
(78)

where

$$\hat{A}:= A - \biggl({1\over\lambda}BB^T-DD^T \biggr)Z(t), $$

K is the Kalman gain, given again by (18), with Λ solving (28).

The policies (γ CE,μ CE) given by (75)–(76) constitute representations of the SP policies in the expanded policy spaces and now belong to \(\varGamma\times{\mathcal{M}}\), and as such also constitute SPE for the original SDG, as argued earlier, provided that the response of Player 1 to (76) and that of Player 2 to (75) are well defined, leading to bounded costs. For the former, it can be shown easily (and in fact argued without any explicit computation) that

$$\min_{\gamma\in\varGamma} J\bigl(\gamma, {\mu}^{\mathrm{CE}}\bigr) = J \bigl({\gamma}^{\mathrm{CE}}, {\mu}^{\mathrm{CE}}\bigr), $$

and particularly that the quadratic function J(u,μ CE) is strictly convex in u. This establishes the right-hand-side of the SP inequality (5). For the left-hand-side inequality, on the other hand, we have the LQG optimal control problem

$$\max_{\mu\in\mathcal{M}} J\bigl({\gamma}^{\mathrm{CE}}, \mu\bigr), $$

with 2n-dimensional differential constraints:

$$\begin{aligned} dx_t =& (Ax+Dv_t)dt -{1\over\lambda} B^T Z(t) {z}_t dt + Fdwt, \quad t\geq 0, \\ dz_t =& \tilde{A}z_tdt + K(t) (dy_t - Hz_t dt), \quad {z}_0 = 0, t\geq0. \end{aligned}$$

The conjugate-point condition on (23) is not sufficient for this LQG optimal control problem to be well defined, as the cost J(γ CE,v) could be non-concave in v. Strict concavity here is in fact the only condition that would be needed for the pair (γ CE,μ CE) in (75)–(76) to constitute a SPE. Now note that J(γ CE,v) can be written as

$$\begin{aligned} J\bigl({\gamma}^{\mathrm{CE}}, v\bigr) =& E \biggl\{ |x_{t_f}|^2_{Q_f} + \int^{t_f}_0 \biggl[ |x_t|^2_Q + {1\over\lambda} \bigl|B^T Z(t) {z}_t\bigr|^2 - |v_t|^2 \biggr] dt \biggr\} \\ =:& E \biggl\{ |m_{t_f}|^2_{{\tilde{Q}}_f} + \int^{t_f}_0 \bigl[ |m_t|^2_{\tilde{Q}} - |v_t|^2 \bigr] dt \biggr\} , \end{aligned}$$

where m:=(x T z T)T, \({\tilde{Q}}_{f} := \operatorname{block\ diag} (Q_{f}, 0 )\), and \({\tilde{Q}} := \operatorname{block\ diag} (Q, (1/ \lambda) ZBB^{T}Z )\). Further, m evolves according to

$$dm_t = \tilde{A} m_t dt + \tilde{D} v_t dt + {\tilde{F}} dw_t, $$

where \(\tilde{D} := [D^{T}, 0]^{T}\), \({\tilde{F}}:= [F^{T}, G^{T} K^{T}]^{T}\), and \(\tilde{A}\) is a 2n×2n matrix, whose ij-th block is, for i,j=1,2: \([\tilde {A}]_{11} := A\), \([\tilde{A}]_{21} := KH\), \([\tilde{A}]_{12} := -(1/ \lambda) BB^{T}Z\), and \([\tilde{A}]_{22} := {\hat{A}} - KH\).

The condition for strict concavity for this optimization problem, regardless of the nature of the information available to Player 2, is (Başar and Bernhard 1995) nonexistence of a conjugate point to the RDE below on the interval [0,t f ]:

$$ \dot{S} + S\tilde{A} + \tilde{A}^TS + S\tilde{D} \tilde{D}^T S + \tilde {Q}=0, \quad S(t_f) = {\tilde{Q}}_f. $$
(79)

We now collect all this in the theorem below.

Theorem 2

The continuous-time LQG ZSDG of Sect2.1 admits a pure-strategy SPE provided that the RDEs (23) and (79) have well-defined nonnegative-definite solutions on the interval [0,t f ], in which case the corresponding policies for the players, in SPE, are given by (75)(76). These feature a restricted certainty equivalence property.

Remark 1

A number of observations are in order here. First, the policies in SPE for the SDG are not simple CE versions of the SPE of the deterministic game, that is they are not the pair (24)–(25), even though they can be derived from the SPE of the deterministic game by endowing the players with two separate filter equations even though the players have access to a common measurement channel. Second, the condition of existence of a pure-strategy SPE for the SDG is more restrictive than its counterpart for the perfect-state version (or essentially equivalently the deterministic game). This would not be surprising in view of the results of Sect. 3 for perhaps the simplest stochastic dynamic game, where the gap between the two conditions (for existence of pure-strategy SPE in the games with perfect state and noisy state information) was completely covered by allowing for mixed strategies (for the maximizing player). It is quite plausible that the same would hold here, but derivation of such a mixed-strategy SPE for the continuous-time LQG ZSDG still remains a challenging task.

4.3 SPE Property of CE Policies of the LQG Discrete-Time ZSDG

We now proceed with an analysis that is the counterpart of the one above (in Sect. 4.2) for the discrete-time game of Sect. 2.2 and Sect. 2.5, not for the most general case, but for two scenarios: (i) when there is no failure of channels (that is p=0, as in Sect. 3.2), and (ii) the channel provides perfect state measurement, but intermittently fails (as in Sect. 3.3). In both cases, we obtain restricted CE SPE. The derivation is a direct counterpart of the one in Sect. 4.2), and hence to avoid duplication we will just provide the basic results without providing details of the reasoning and the pathway.

Let us first discuss case (i). In Sect. 2.5, we had offered (34)–(35) as a candidate SPE pair for this scenario, but as we have argued in the previous subsection, having a single filter to be shared by both players is not a realistic situation, and hence we will have to introduce individualized compensators. In view of this, (34)–(35) will have to be modified as follows:

$$\begin{aligned} u^*_t =& \gamma_t^*(y_{[0,t]}) = {\gamma}^{\mathrm{CE}}_t({z}_{t|t} ) = - {1\over\lambda} B^T Z_{t+1}\bigl(N_t^{-1} \bigr)^T A {z}_{t|t}, \quad t= 0, 1, \ldots, \end{aligned}$$
(80)
$$\begin{aligned} v^*_t =& \mu_t^*(y_{[0,t]}) = {\mu}^{\mathrm{CE}}_t({\eta}_t) = D^T Z_{t+1} \bigl(N_t^{-1}\bigr)^T A { \eta}_{t|t}, \quad t= 0, 1, \ldots, \end{aligned}$$
(81)

where z t|t and η t|t are generated by, with z 0|−1=0:

$$ \begin{aligned} {z}_{t|t} ={}& {z}_{t|t-1} + \varLambda_t H^T \bigl(H\varLambda_tH^T + GG^T \bigr)^{-1} (y_t - H {z}_{t|t-1} ) \\ {z}_{t+1|t} ={}& (N_t)^{-1} A {z}_{t|t-1} \\ &{}+ (N_t)^{-1} A\varLambda_t H^T \bigl(H\varLambda_tH^T + GG^T \bigr)^{-1} (y_t - H {z}_{t|t-1} ), \end{aligned} $$
(82)

and, with η 0|−1=0,

$$ \begin{aligned} {\eta}_{t|t} ={}& {\eta}_{t|t-1} + \varLambda_t H^T \bigl(H\varLambda_tH^T + GG^T \bigr)^{-1} (y_t - H { \eta}_{t|t-1} ) \\ {\eta}_{t+1|t} ={}& (N_t)^{-1} A { \eta}_{t|t-1} \\ &{}+ (N_t)^{-1} A\varLambda_t H^T \bigl(H\varLambda_tH^T + GG^T \bigr)^{-1} (y_t - H {\eta}_{t|t-1} ), \end{aligned} $$
(83)

and the sequence {Λ t ,t=1,2,…,T} is as in (37). By going through similar arguments as in the previous subsection, the pair (80)–(81) provides a SPE, provided that (33) holds and the quadratic function J(γ CE,v) is strictly concave in v. An explicit condition can be obtained for the latter in terms of a 2n×2n discrete-time Riccati equation, which involves a recursive verification as in (33).

For case (ii), that is when y t =β t x t , t=0,1,…,T−1, the starting point is the pair of policies (38)–(39), where as before we endow the players with two separate compensators, with states ζ 1 and ζ 2, generated by, for i=1,2,

$$\begin{aligned} \zeta^i_{t} = \beta_{t} y_{t} + (1-\beta_t) \biggl( I-\biggl( {1\over\lambda}B B^T +DD^T\biggr)Z_{t+1} \bigl(N_t^{-1}\bigr)^T \biggr)A \zeta^i_{t-1}, \quad \zeta^i_0 = y_0. \end{aligned}$$
(84)

Hence, the players’ CE policies become

$$\begin{aligned} u^*_t =& \gamma_t^*(y_{[0,t]}) = {\tilde{\gamma}}^{\mathrm{CE}}_t({\zeta }_{t}) = - {1\over\lambda} B^T Z_{t+1}\bigl(N_t^{-1} \bigr)^{-1} A \zeta^1_t, \quad t= 0, 1, \ldots, \end{aligned}$$
(85)
$$\begin{aligned} v^*_t =& \mu_t^*(y_{[0,t]}) = {\tilde{\mu}}^{\mathrm{CE}}_t\bigl(\zeta^2_t \bigr) = D^T Z_{t+1} \bigl(N_t^{-1} \bigr)^{-1} A \zeta_t, \quad t= 0, 1, \ldots, \end{aligned}$$
(86)

which, by an argument similar to the earlier case, are in SPE provided that (33) holds and the quadratic function \(J({\tilde{\gamma}}^{\mathrm{CE}}, v)\) is strictly concave in v. As before, an explicit condition can be obtained for the latter in terms of a 2n×2n discrete-time Riccati equation, which involves a recursive verification as in (33).

In view of the complete set of results of Sect. 3 for a 2-stage version of this game, for case (ii), we would not expect a less stringent condition to be obtained (that is, there would not be any need to expand the policy spaces to include mixed strategies), whereas for case (i) the extra condition introduced in terms of strict concavity of J(γ CE,v) in v can be dispensed with by inclusion of mixed strategies. We do not pursue this any further here.

For the more general case, however, when the channel is noisy and failure probability is p>0, still a restricted CE will hold, with z and η in (82)–(83) now incorporating the possibility of failures, as in the case of derivation of Kalman filters with missing measurements (Shi et al. 2010). Here also a strict concavity condition will be needed for the existence of a pure-strategy SPE, in addition to the one for p=0, which however can be dispensed with by inclusion of mixed strategies.

5 Discussion, Extensions, and Conclusions

One important message that this chapter conveys (which applies to more general differential/dynamic games with similar information structures) is that in zero-sum stochastic differential/dynamic games (ZS SDGs) a restricted certainty equivalence (CE) applies if players have a common measurement channel, but the adversarial nature of the problem creates several caveats not allowing the standard notions of certainty equivalence or separation prominent in stochastic control problems (Witsenhausen 1971a, 1971b; Fleming and Soner 1993; Yüksel and Başar 2013) to find an immediate extension. Expansion of information structures to include also action information compatible with the original information, and without increasing payoff relevant information, appears to be a versatile tool in an indirect derivation of pure-strategy saddle-point equilibria (SPE), which however does not apply to derivation of mixed-strategy SPE, as it relies heavily on the ordered interchangeability property of multiple pure-strategy SPE. For the same reason, the indirect approach does not apply to nonzero-sum dynamic games; in fact, Nash equilibria of genuinely nonzero-sum stochastic games (unless they are strategically equivalent to zero-sum games or team problems) never satisfy CE (Başar 1978b). Now, coming back to ZS SDGs, when a generalized CE SPE exists in some region of the parameter space, this is not the full story because the game may also admit mixed-strategy SPE outside that region, which however has to be obtained using a different approach–using notions of annihilation and conditional equalization, as it has been demonstrated in Sect. 3. Hence, expansion of strategy (policy) spaces from pure to mixed helps to recover the missing SPE.

We have deliberately confined our treatment in this paper to ZS SDGs with symmetric information, to be able to introduce a restricted (and in some sense generalized) notion of CE and to show that any attempt of directly extending CE from stochastic optimal control to games is a path full of pitfalls, even though the problem (of derivation of SPE) is still tractable, but using an indirect approach (that makes use of expansion of strategy spaces and ordered interchangeability property of multiple pure-strategy SPE). As indicated earlier, this approach does not extend to nonzero-sum games (NZSGs), because expansion of strategy spaces (through actions) leads to multiplicity of Nash equilibria, and in fact a continuum of them (Başar and Olsder 1999), and multiple Nash equilibria (NE) are not orderly interchangeable. Still, there is another approach to derivation of NE with nonredundant information, as briefly discussed in Sect. 1, provided that we have a discrete-time game, with complete sharing of information (that is, a common measurement channel) or sharing of observations with one step delay (Başar 1978a). The same approach would of course apply to ZSDGs as well (with one-step delayed sharing), but then the SPE will not be of the CE type. If there is no sharing of information (or with delay of two units or more), and players receive noisy state information through separate channels, then the problem remains to be challenging in both ZS and NZS settings, unless there is a specific structure of the system dynamics along with the information structure, as in Nayyar and Başar (2012).

Several fairly direct extensions of the results of this chapter are possible, all in the ZS setting. First, it is possible to introduce intermittent failure of the common measurement channel (2) in the continuous-time case, by mimicking (8):

$$dy_t=\beta_t (H x_t dt + G dw_t) \quad\mbox{or}\quad dy_t=\beta_t H x_t dt + G dw_t, \quad t\geq0, $$

where β t is an independent two-state Markov jump process (or a piecewise deterministic process) with a given rate (jumps are between β t =1 and β t =0), and both players observe realization of this process. The counterpart of the analysis for the discrete-time case could be carried over to this case also (for a related framework, see Pan and Başar 1995). A variant of this, in both discrete and continuous time, is the more challenging class of problems where the failure of the transmission of the common noisy measurement of the state to the players is governed by two independent Bernoulli processes with possibly different rates. Such ZS SDGs would involve primarily two scenarios: (i) the players are not aware of the failure of links corresponding to each other, and (ii) this information is available (that is players share explicitly or implicitly the failure information) but with one step delay. Further extensions to (i) multi-player ZS SDGs (with teams playing against teams, where agents in each team do not have identical information), and (ii) nonzero-sum stochastic differential games (with particular type of asymmetric information among the players) constitute yet two other classes of challenging problems. In all these problems, including the ones discussed in Sect. 4, characterization of mixed-strategy SPE (as extension of the analysis of Sect. 3) or NE stand out as challenging but tractable avenues for future research.