1 Introduction

1.1 Entropy Games and Matrix Multiplication Games

Entropy games have been introduced by Asarin et al. [2]. They model the situation in which two players with conflicting interests, called “Despot” and “Tribune”, wish to minimize or to maximize a topological entropy representing the freedom of a half-player, “People”. Entropy games are special “matrix multiplication games”, in which two players alternatively choose matrices in certain prescribed sets; the first player wishes to minimize the growth rate of the infinite matrix product obtained in this way, whereas the second player wishes to maximize it. Whereas matrix multiplication games are hard in general (computing joint spectral radii is a special case), entropy games correspond to a tractable subclass of multiplication games, in which the matrix sets have the property of being invariant by row interchange, the so called independent row uncertainty (IRU) assumption, sometimes also called row-wise or rectangularity assumption. In particular, Asarin et al. showed in [2] that the problem of comparing the value of an entropy game to a given rational number is in NP ∩ coNP, giving to entropy games a status somehow comparable to other important classes of games with an unsettled complexity, including mean payoff games, simple stochastic games, or stochastic mean payoff games, see [6] for background.

Another motivation to study entropy games arises from risk sensitive control [1, 16, 17]: as we shall see, essentially the same class of operators arise in the latter setting. A recent application of entropy games to the approximation of the joint spectral radius of nonnegative matrices (without making the IRU assumption) can be found in [21]. Other motivations originate from symbolic dynamics [32, Chapter 1.8.4].

1.2 Contribution

We first show that entropy games, which were introduced as a new class of games, are equivalent to a class of zero-sum mean payoff stochastic games with perfect information, in which some action spaces are simplices, and the instantaneous payments are given by a Kullback-Leibler entropy. Hence, entropy games fit in a classical class of games, with a “nice” payment function over infinite action spaces.

To do so, we introduce a slightly more expressive variant of the model of Asarin et al. [2], in which the initial state is prescribed (the initial state is chosen by a half-player, People, in the original model). This may look like a relatively minor extension, so we keep the name “entropy game” for it, but this extension is essential to develop an operator approach and derive consequences from it. We show that the main results known for stochastic mean payoff games with finite actions space and perfect information, namely the existence of the value and the existence of optimal positional strategies, are still valid for entropy games (Theorems 10 and 9). This is derived from a model theory approach of Bolte, Gaubert, and Vigeral [9], together with the observation that the dynamic programming operators of entropy games are definable in the real exponential field. Then, a key ingredient is the proof of existence of Blackwell optimal policies, as a consequence of o-minimality, see Theorem 8. Another consequence of the operator approach is the existence of Collatz-Wielandt optimality certificates for entropy games, Theorem 13. When specialized to the one player case, this leads to a convex programming characterization of the value, Corollary 14, which can also be recovered from a characterization of Anantharam and Borkar [1].

Our main result, Theorem 16, shows that entropy games in which Despot has a fixed number of significant states (states with a nontrivial choice) can be solved strategically in polynomial time, meaning that optimal (stationary) strategies can be found in polynomial time. Thus, entropy games are somehow similar to stochastic mean payoff games, for which an analogous fixed-parameter tractability result holds (by reducing the one player case to a linear program). This approach also reveals a fundamental asymmetry between the players Despot and Tribune: our approach does not lead to a polynomial bound if one fixes the number of states of Tribune. In our proof, o-minimality arguments allow a reduction from the two-player to the one-player case (Theorem 9). Then, the one-player case is dealt with using several ingredients: ellipsoid method, separation bounds between algebraic numbers, and results from Perron-Frobenius theory.

The operator approach also allows one to obtain practically efficient algorithms to solve entropy games. In this way, the classical policy iteration of Hoffman-Karp [23] can be adapted to entropy games. We report experiments showing that when specialized to one player problems, policy iteration yields a speedup by one order of magnitude by comparison with the “spectral simplex” method recently introduced by Protasov [38].

Let us finally complete the discussion of related works. The formulation of entropy games in terms of “classical” mean payoff games in which the payments are given by a Kullback-Leibler entropy builds on known principles in risk sensitive control [1, 17]. It can be thought as a version for two player problems of the Donsker-Varadhan characterization of the Perron-eigenvalue [15]. The latter is closely related to the log-convexity property of the spectral radius established by Kingman [27]. A Donsker-Varadhan type formula for risk sensitive problems, which can be applied in particular to Despot-free player entropy games, has been recently obtained by Anantharam and Borkar, in a wider setting allowing an infinite state space [1]. In a nutshell, for Despot-free problems, the Donsker-Varadhan formula appears to be the (convex-analytic) dual of the Collatz-Wielandt formula. Chen and Han [13] developed a related convex programming approach to solve the entropy maximization problem for Markov chains with uncertain parameters. We also note that the present Collatz-Wielandt approach, building on [5], yields an alternative to the approach of [2] using the “hourglass alternative” of [28] to produce concise certificates allowing one to bound the value of entropy games. By comparison with [2], a essential difference is the use of o-minimality arguments: these are needed because we study the more precise version of the game, in which the initial state is fixed. Indeed, a counter example of Vigeral shows that the mean payoff may not exist in such cases without an o-minimality assumption [46], whereas the existence of the mean payoff holds universally (without restrictions of an algebraic nature on the Shapley operator) if one allows one player to choose the initial state, see e.g. Proposition 2.12 of [3]. Finally, the identification of tractable subclasses of matrix multiplication games can be traced back at least to the work of Blondel and Nesterov [11].

2 Entropy Games

An entropy game \(\mathcal {E}\) is a perfect information game played on a finite directed weighted graph G. There are 2 players, “Despot”, “Tribune”, and a half-player with a nondeterministic behavior, “People”. The set of nodes of the graph is written as the disjoint union DTP, where D,T and P represent sets of states in which Despot, Tribune, and People play. We assume that the set of arcs E is included in (D × T) ∪ (T × P) ∪ (P × D), meaning that Despot, Tribune, and People alternate their actions. A weightmpd, which is a positive real number, is attached to every arc (p,d) ∈ P × D. All the other arcs in E have weight 1. An initial state, \(\bar {d}\in D\), is known to the players. A token, initially in node \(\bar {d}\), is moved in the graph according to the following rule. If the token is currently in a node d belonging to D, then, Despot chooses an arc (d,t) ∈ E and moves the token to a node t. Similarly, if the token is currently in a node tT, Tribune chooses an arc (t,p) ∈ E and moves the token to node p. Finally, if the token is in a node pP, People chooses an arc (p,d) ∈ E and moves the token to a node dD. We will assume that every player has at least one possible action in each state in which it is his or her turn to play. In other words, for all dD, the set of actions {(d,t) ∈ E} must be nonempty, and similar conditions apply to tT and pP.

A history of the game consists of a finite path in the directed graph G, starting from the initial node \(\bar {d}\). The number of turns of this history is defined to be the length of this path, each arc counting for a length of one third. The weight of a history is defined to be the product of the weights of the arcs arising on this path. For instance, a history (d0,t0,p0,d1,t1,p1,d2,t2) where diD, tiT and piP, makes 2 and 1/3 turn, and its weight is \(m_{p_{0}d_{1}}m_{p_{1}d_{2}}\).

A strategy of Player Despot is a map δ which assigns to every history ending in some node d in D an arc of the form (d,t) ∈ E. Similarly, a strategy of Player Tribune is a map τ which assigns an arc (t,p) ∈ E to every history ending with a node t in T. The strategy δ is said to be positional if it only depends on the last node d which has been visited and eventually of the number of turns. Similarly, the strategy τ is said to be positional if it only depends on t and eventually of the number of turns. These strategies are in addition stationary, if they do not depend on the number of turns.

For every integer k, we define as follows the game in horizon k with initial state \(\bar {d}\), \(\mathcal {E}^k_{\bar {d}}\). We assume that Despot and Tribune play according to the strategies δ,τ. Then, People plays in a nondeterministic way. Therefore, the pair of strategies δ,τ allows for different histories. The payment received by Tribune, in k turns, is denoted by \(R_{\bar {d}}^{k}(\delta ,\tau )\). It is defined as the sum of the weights of all the paths of the directed graph G of length k with initial node \(\bar {d}\) determined by the strategies δ and τ: each of these paths corresponds to different successive choices of People, leading to different histories allowed by the strategies δ,τ. The payment received by Despot is defined to be the opposite of \(R_{\bar {d}}^{k}(\delta ,\tau )\), so that the game in horizon k is zero-sum. In that way, the payment \(R_{\bar {d}}^{k}\) measures the “freedom” of People, Despot wishes to minimize it whereas Tribune wishes to maximize it.

We say that the game \(\mathcal {E}^k_{\bar {d}}\) in horizon k with initial state \(\bar {d}\)has the value\(V^{k}_{\bar {d}}\) if for all 𝜖 > 0, there is a strategy \(\delta ^{*}_{\epsilon }\) of Despot such that for all strategies τ of Tribune,

$$ \begin{array}{@{}rcl@{}} \epsilon+ V^{k}_{\bar{d}} \geqslant R^{k}_{\bar{d}}(\delta^{*}_{\epsilon},\tau) \enspace, \end{array} $$
(1)

and similarly, there is a strategy \(\tau ^{*}_{\epsilon }\) of Tribune such that for all strategies δ of Despot,

$$ \begin{array}{@{}rcl@{}} R^{k}_{\bar{d}}(\delta,\tau^{*}_{\epsilon}) \geqslant V^{k}_{\bar{d}}-\epsilon \enspace . \end{array} $$
(2)

The strategies \(\delta ^{*}_{\epsilon }\) and \(\tau ^{*}_{\epsilon }\) are said to be 𝜖-optimal. In other words, Despot can make sure his loss will not exceed \(V^{k}_{\bar {d}}+\epsilon \) by playing \(\delta ^{*}_{\epsilon }\), and Tribune can make sure to win at least \(V^{k}_{\bar {d}}-\epsilon \) by playing \(\tau ^{*}_{\epsilon }\). The strategies δ and τ are optimal if they are 0-optimal, i.e., if we have the saddle point property:

$$ \begin{array}{@{}rcl@{}} R^{k}_{\bar{d}}(\delta,\tau^{*}) \geqslant R^{k}_{\bar{d}}(\delta^{*},\tau^{*}) = V^{k}_{\bar{d}} \geqslant R^{k}_{\bar{d}}(\delta^{*},\tau) \enspace , \end{array} $$
(3)

for all strategies δ,τ of Despot and Tribune. If the value \(V^{k}_{\bar {d}}\) exists for all choices of the initial state \(\bar {d}\), we define the value vector of the family of games \(({\mathcal {E}^{k}_{d}})_{d\in D}\) in horizon k, to be \(V^{k}:=({V^{k}_{d}})_{d\in D}\in \mathbb {R}^{D}\).

We now define the infinite horizon game\(\mathcal {E}^{\infty }_{\bar {d}}\), in which the payment received by Tribune is given by

$$ R_{\bar{d}}^{\infty}(\delta,\tau):= \limsup_{k\to \infty} (R_{\bar{d}}^{k}(\delta,\tau))^{1/k} $$

and the payment received by Despot is the opposite of the latter payment. (The choice of limsup is somehow arbitrary, we could choose liminf instead without affecting the results which follow.) The value\(V^{\infty }_{\bar {d}}\) of the infinite horizon game \(\mathcal {E}^{\infty }_{\bar {d}}\), and the optimal strategies in this game, are still defined by a saddle point condition, as in (1), (2) and (3), the payment \(R^{k}_{\bar {d}}(\delta ,\tau )\) being now replaced by \(R_{\bar {d}}^{\infty }(\delta ,\tau )\).

We denote by \(V^{\infty }=(V^{\infty }_{d})_{d\in D}\in \mathbb {R}^{D}\) the value vector of the infinite horizon games \((\mathcal {E}^{\infty }_{d})_{d\in D}\).

We associate to the latter games the dynamic programming operator \(F: \mathbb {R}^{D}\to \mathbb {R}^{D}\), such that, for all \(X\in \mathbb {R}^{D}\), and dD,

$$ \begin{array}{@{}rcl@{}} F_{d}(X) = \min_{(d,t)\in E} \max_{(t,p)\in E} \sum\limits_{(p,d^{\prime}) \in E} m_{pd^{\prime}}X_{d^{\prime}}\enspace . \end{array} $$
(4)

To relate this operator with the value of the above finite or infinite horizon games, we shall interpret these games as zero-sum stochastic games with expected multiplicative criteria. The one-player case was studied in particular by Howard and Matheson under the name of risk-sensitive Markov decision processes [24] and by Rothblum under the name of multiplicative Markov decision processes, see for instance [40].

For any node pP, we denote by Ep := {(p,d) ∈ E} the set of actions available to People in state p, and we denote by qp the probability measure on Ep obtained by normalizing the restriction of the weight function m to Ep: qpd = mpd/γ(p) with \(\gamma (p)={\sum }_{(p,d^{\prime })\in E_{p}} m_{pd^{\prime }}\). Then, F can be rewritten as

$$ F_{d}(X) = \min_{(d,t)\in E} \max_{(t,p)\in E} \left( \gamma(p) \sum\limits_{(p,d^{\prime}) \in E} q_{pd^{\prime}}X_{d^{\prime}}\right) \enspace . $$

A pair of strategies δ and τ of both players, determine the stochastic process \((D_{k},T_{k},P_{k})_{k\geqslant 0}\) with values in D × T × P, such that \(P(D_{k+1}=d^{\prime }\mid H)=q_{pd^{\prime }}\) for all dD and all histories H having k − 1/3 turns and ending in pP, and such that the transitions from D to T and T to D are deterministicaly determined by the strategies δ and τ respectively as in the above description of the entropy games \(\mathcal {E}\). Then, the payoff of the entropy game with horizon k starting in \(\bar {d}\), \(\mathcal {E}^k_{\bar {d}}\), is equal to the following expected multiplicative/ risk-sensitive criterion:

$$ R_{\bar{d}}^{k}(\delta, \tau)= \mathbb{E}\left( \gamma(P_{0}){\cdots} \gamma(P_{k-1})\mid D_{0}= \bar{d}\right)\enspace .$$

Proposition 1

The value of the entropy game in horizon k with initial state d,\({\mathcal {E}^{k}_{d}}\),does exists. The value vectorVkof this game is determined by the relationsV0 = e,Vk = F(Vk− 1),k = 1,2,…,whereeis the unit vector (1,...,1)of\(\mathbb {R}^{D}\).Moreover, there exist optimal strategies for Despot and Tribune that arepositional.

Proof

This result follows from a classical dynamic programming argument. Indeed, in the one player case, that is when there is only one choice of δ or one choice of τ, that is when the operator F contains only a “min” or a “max”, the game is in the class of Markov Decision Problems with multiplicative criterion and the Dynamic Programming Principle has already been proved in this setting in [24, 40], see also [47, Th. 1.1, Chap 11]. This shows that the game has a value which satifies Vk = F(Vk− 1) and V0 = e, and that an optimal strategy is obtained using these equations. For instance for a “max” (when Despot has only one choice), Tribune chooses any action (t,p) attaining the maximum in

$$ \max_{(t,p)\in E} \sum\limits_{(p,d^{\prime}) \in E} m_{pd^{\prime}}V^{k-1}_{d^{\prime}} = \max_{(t,p)\in E} \left( \gamma(p) \sum\limits_{(p,d^{\prime}) \in E}q_{pd^{\prime}}V^{k-1}_{d^{\prime}} \right) \enspace . $$

The resulting strategy τ is positional and it is optimal among all strategies τ. A similar result holds for a “min”, leading to a positional strategy δ for Despot.

Let us now consider the general two-player case. Define the sequence of vectors Vk by

$$ \begin{array}{@{}rcl@{}} {V^{k}_{d}} = \min_{(d,t)\in E} \max_{(t,p)\in E} \sum\limits_{(p,d^{\prime}) \in E} m_{pd^{\prime}}V^{k-1}_{d^{\prime}}\enspace . \end{array} $$
(5)

with \({V^{0}_{d}}=1\), for all dD. We construct candidate strategies δ and τ, depending on the current position and number of turns, as follows. In state d, if there remains k turns to be played, Despot selects an action (d,t) achieving the minimum in (5). We denote by δ(d,k) the value of t such that (d,t) is selected. In state t, if there remains k − 1/3 turns to be played, Tribune chooses any action (t,p) attaining the maximum in

$$ \max_{(t,p)\in E} \sum\limits_{(p,d^{\prime}) \in E} m_{pd^{\prime}}V^{k-1}_{d^{\prime}} \enspace . $$

Now, if Player Despot plays according to δ, we obtain a reduced one player game. It follows from the same dynamic programming principle as above (applied here to time dependent transition probabilities q and factors γ(⋅)) that the value vector \(V^{\delta ^{*},k}\) of this reduced game in horizon k does exist and satisfies the recursion

$$ V^{\delta^{*},k}_{d} = \max_{(\delta^{*}(d,k),p)\in E} \left( \gamma(p)\sum\limits_{(p,d^{\prime}) \in E} q_{pd^{\prime}}V^{\delta^{*},k-1}_{d^{\prime}}\right)\enspace , $$

with \(V^{\delta ^{*},0}_{d}=1\), for all dD. Since \(V^{\delta ^{*},k}\) is the value, we have \(V^{\delta ^{*},k}_{d}\geqslant {R_{d}^{k}}(\delta ^{*}, \tau )\) for all strategies τ of Tribune. Noting that \(V^{\delta ^{*},k}_{d}={V^{k}_{d}}\) by definition of δ, we deduce that Despot, by playing δ, can guarantee that his loss in the horizon k game starting from state d will not exceed \({V^{k}_{d}}\). A dual argument shows that by playing τ, Tribune can guarantee that his win will be at least \({V^{k}_{d}}\). □

Example 1

Consider the entropy game whose graph and dynamic programming operator are given by:

For readability, the states of Despot are shown twice on the picture. Here, D = {d1,d2,d3}, T = {t1,t2,t3,t4}, P = {a,b,c,d}, E = {(d1,t1), (d1,t2), (d2,t2), (d3,t3), (d3,t4), (t1,a), (t2,a), (t2,b), (t3,c), (t3,d), (t4,c), (a,d1), (b,d2), (b,d3), (c,d2), (d,d2), (d,d3)}, and all the weights are equal to 1, i.e., \(m_{p d_{i}}=1\) for all pP and 1 ≤ i ≤ 3 such that (p,di) ∈ E.

One can check that Vk = (1,ϕk+ 1,ϕk), where ϕ0 = ϕ1 = 1 and ϕk+ 2 = ϕk + ϕk+ 1 is the Fibonacci sequence. As an application of Theorem 10 below, it can be checked that the value vector of this entropy game is V = (1,φ,φ) where \(\varphi :=(1+\sqrt {5})/2\) is the golden mean.

3 Stochastic Mean Payoff Games with Kullback-Leibler Payments

We next show that entropy games are equivalent to a class of stochastic mean payoff games in which some action spaces are simplices, and payments are given by a Kullback-Leibler divergence.

To the entropy game \(\mathcal {E}\), we associate a stochastic zero-sum game with Kullback-Leibler payments, denoted \(\mathcal {K}\mathcal {L}\) and defined as follows, referred to as “Kullback-Leibler game” for brevity. This new game is played by the same players, Despot, and Tribune, on the same weighted directed graph G (so with same sets E,P,D and same weight function m). The nondeterministic half-player, People, will be replaced by a standard probabilistic half-player, Nature.

For any node pP, recalling that Ep := {(p,d) ∈ E} is the set of actions available to People in state p, we denote by Δp the set of probability measures on Ep. Therefore, an element of Δp can be identified to a vector \(\vartheta =(\vartheta _{pd})_{(p,d)\in E_{p}}\) with nonnegative entries and sum 1. The admissible actions of Despot and Tribune in the states dD and tT are the same in the game \(\mathcal {K}\mathcal {L}\) and in the entropy game \(\mathcal {E}\). However, the two games have different rules when the state pP belongs to the set of People’s states. Then, Tribune is allowed to play again, by selecting a probability measure 𝜗 ∈Δp; in other words, Tribune plays twice in a row, selecting first an arc (t,p) ∈ E, and then a measure 𝜗 ∈Δp. Then, Nature chooses the next state d according to probability 𝜗pd, and Tribune receives the payment − Sp(𝜗;m), where Sp(𝜗;m) is the relative entropy or Kullback-Leibler divergence between 𝜗 and the measure obtained by restricting the weight function m to Ep:

$$ S_{p}(\vartheta; m) := \sum\limits_{(p,d)\in E_{p}} \vartheta_{pd}\log (\vartheta_{pd}/m_{pd}) \enspace . $$
(6)

Therefore, using the notations of Section 2, we get that

$$ S_{p}(\vartheta; m) = -\log\gamma(p)+\sum\limits_{(p,d)\in E_{p}} \vartheta_{pd}\log (\vartheta_{pd}/q_{pd}) $$

is minimal when the chosen probability distribution 𝜗 on Ep is equal to the probability distribution qp of the transitions from state p in the stochastic game defined in Section 2. Recall that relative entropy is related to information theory and statistics [30]. An interesting special case arises when m ≡ 1, as in [2], thus qp is the uniform distribution on Ep. Then, \(S_{p}(\vartheta ;m)=S_{p}(\vartheta ):= {\sum }_{(p,d)\in E_{p}} \vartheta _{pd} \log \vartheta _{pd}\) is nothing but the Shannon entropy of 𝜗.

A history in the game \(\mathcal {K}\mathcal {L}\) now consists of a finite sequence (d0,t0,p0,𝜗0,d1, t1, p1,… ), which encodes both the states and actions which have been chosen. A strategy δ of Despot is still a function which associates to a history ending in a state in d an arc (d,t) in E. A strategy of Tribune has now two components (τ,π), τ is a map which assigns to a history ending in a state in t an arc (t,p) ∈ E, as before, whereas π assigns to the same history and to the next state p chosen according to τ a probability measure on Δp.

To each history corresponds a path in G, obtained by ignoring the occurrences of probability measures. For instance, the path corresponding to the history h = (d0,t0,p0,𝜗0,d1,t1,p1) is (d0,t0,p0,d1,t1,p1). Again, the number of turns of a history is defined as the length of this path, each arc counting for 1/3. So the number of turns of h is 1 and 2/3. Choosing strategies δ and (τ,π) of both players and fixing the initial state \(d_{0}=\bar {d}\) determines a probability measure on the space of histories h. We denote by

$$ r^{k}_{\bar{d}}(\delta,(\tau,\pi)):= -\mathbb{E}\left( S_{p_{0}}(\vartheta_{0};m)+{\dots} + S_{p_{k-1}}(\vartheta_{k-1};m)\right) $$

the expectation of the payment received by Tribune, in k turns, with respect to this measure, where Sp is as in (6) and m is the weight function of the graph of the game. We denote by \(v_{\!\bar {d}}^{k}\) the value of the game in horizon k, with initial state \(\bar {d}\), and we denote by \(v^{k}=(v^{k}_{d})_{d\in D}\) the value vector. As in the case of entropy games, we shall use subscripts and superscripts to indicate special versions of the game, e.g., \({\mathcal {K}\mathcal {L}^{k}_{d}}\) refers to the game in horizon k with initial state d. Note also our convention to use lowercase letters (as in \(v_{\!{d}}^{k}\)) to refer to the game with Kullback-Leibler payments, whereas we used uppercase letters (as in \(V^{k}_{{d}}\)) to refer to the entropy game.

It will be convenient to consider more special games in which the actions of one of the players are restricted. We will call policy of Despot a stationary positional strategy of this player, i.e., a map which assigns to every node dD a node δ(d) = tT such that (d,t) ∈ E. Similarly, we will call policy of Tribune a map which assigns to every node tT a node τ(t) = pP such that (t,p) ∈ E. Observe, in this definition of policy, the symmetry between Despot and Tribune, while the game is asymetric: the policy τ is not enough to determine a positional strategy of Tribune, because the probability distribution at every state pP is not specified by the policy τ. The set of policies of Despot and Tribune are denoted by \(\mathcal {P}_{D}\) and \(\mathcal {P}_{T}\), respectively.

If one fixes a policy δ of Despot, we end up with a reduced game \(\mathcal {K}\mathcal {L}^{k}(\delta ,\ast )\) in which only Tribune has actions. We denote by \(v^{k}(\delta ,\ast )=(v^{k}_{d}(\delta ,\ast ))_{d\in D}\in \mathbb {R}^{D}\) the value vector of this game in horizon k. Similarly, if one fixes a policy τ of Tribune, we obtain a reduced game denoted by \(\mathcal {K}\mathcal {L}^{k}(\ast ,(\tau ,\ast ))\), in which Despot plays when the state is in D, Tribune selects an action according to the policy τ when the state is in T, and Tribune plays when the state is in P. The value vector of this reduced game is denoted by \(v^{k}(\ast ,(\tau ,\ast ))=(v^{k}_{d}(\ast ,(\tau ,\ast )))_{d\in D}\in \mathbb {R}^{D}\). We also denote by \(v^{k}(\delta ,(\tau ,\ast ))=(v^{k}_{d}(\delta ,(\tau ,\ast )))_{d\in D}\in \mathbb {R}^{D}\) the value of the reduced game in which both policies δ of Despot and τ of Tribune are fixed, which means that only Tribune plays when the state is in P. The systematic character of notation used here should be self explanatory: the symbol ∗ refers to the actions which are not fixed by the policy.

We also consider the infinite horizon or mean payoff game \(\mathcal {K}\mathcal {L}^{\infty }\), in which the payment of Tribune is now

$$ r^{\infty}_{\bar{d}}(\delta,(\tau,\pi)) := \limsup_{k\to \infty} k^{-1} r^{k}_{\bar{d}}(\delta,(\tau,\pi))\enspace . $$

For 0 < α < 1, we also consider the discounted game\(^{\alpha }\mathcal {K}\mathcal {L}\) with a discount factor α, in which the payment of Tribune is

$$ {~}^{\alpha}r_{\bar{d}}(\delta,(\tau,\pi)) := -\mathbb{E}\left( S_{p_{0}}(\vartheta_{0};m)+ \alpha S_{p_{1}}(\vartheta_{1};m)+ \alpha^{2} S_{p_{2}}(\vartheta_{2};m)+ {\cdots} \right) $$

The value of the mean payoff game is denoted by \(v^{\infty }_{\bar {d}}\), whereas the value of the discounted game is denoted by \({~}^{\alpha }v_{\bar {d}}\). As above, we denote by \(\mathcal {K}\mathcal {L}^{\infty }(\delta ,\ast )\) and \(\mathcal {K}\mathcal {L}^{\infty }(\ast ,(\tau ,\ast ))\) the games restricted by the choice of policies δ,τ, and use an analogous notation for the corresponding value vectors. For instance, αv(∗,(τ,∗)) refers to the value vector of the game \({~}^{\alpha }\mathcal {K}\mathcal {L}(\ast ,(\tau ,\ast ))\) with a discount factor α. We define the notion of value, as well as the notion of optimal strategies, by saddle point conditions, as in Section 2.

The following dynamic programming principle entails that the value of the stochastic game with Kullback-Leibler payments in horizon k is the log of the value of the entropy game.

Proposition 2

The value vector\(v^{k}=({v^{k}_{d}})_{d\in D}\)ofthe Kullback-Leibler game in horizon k does exist. It is determined by therelationsv0 = 0,vk = f(vk− 1),k = 1,2,…,where

$$ \begin{array}{@{}rcl@{}} f_{d}(x) = \min_{(d,t)\in E} \max_{(t,p)\in E} \log\left( \sum\limits_{(p,d^{\prime}) \in E} m_{pd^{\prime}}\exp(x_{d^{\prime}})\right) \enspace , \end{array} $$
(7)

and we have \({v_{d}^{k}} = \log {V_{d}^{k}}\).

In order to prove Proposition 2, we recall the following classical result in convex analysis showing that the “log-exp”function is the Legendre-Fenchel transform of Shannon entropy.

Lemma 3

The function \(x\mapsto \log ({\sum }_{1\leqslant i\leqslant n} e^{x_{i}})\) is convex and it satisfies

$$ \log\left( \sum\limits_{1\leqslant i\leqslant n} e^{x_{i}}\right) = \max \sum\limits_{1\leqslant i\leqslant n} \vartheta_{i} (x_{i} - \log \vartheta_{i}); \quad \vartheta_{i} \geqslant 0 , 1\leqslant i\leqslant n, \sum\limits_{1\leqslant i\leqslant n} \vartheta_{j} =1 \enspace . $$

This result is mentioned in [42], Example 11.12. This convexity property is a special instance of the general fact that the log of the Laplace transform of a positive measure is convex (which follows from the Cauchy-Schwarz inequality), whereas the explicit expression as a maximum follows from a straightforward computation (apply Lagrange multipliers rule).

Proof of Proposition 2

For a zero-sum game with finite horizon and additive criterion, the existence of the value is a standard fact, proved in a way similar to Proposition 1. The value vector vk satisfies the following dynamic programming equation

$$ \begin{array}{@{}rcl@{}} {v^{k}_{d}} &= \min\limits_{(d,t)\in E} \max\limits_{(t,p)\in E} \max\limits_{\vartheta\in {\Delta}_{p}} \left( -S_{p}(\vartheta;m) + \langle \vartheta, v^{k-1}\rangle \right) \enspace , \end{array} $$
(8)

where \(\langle \vartheta ,x\rangle = {\sum }_{(p,d^{\prime })\in E_{p}} \vartheta _{pd^{\prime }}x_{d^{\prime }}\) for \(x\in \mathbb {R}^{D}\), and \({v^{0}_{d}}=0\). By Lemma 3,

$$ \begin{array}{@{}rcl@{}} \log\left( \sum\limits_{(p,d^{\prime}) \in E} m_{pd^{\prime}}\exp(x_{d^{\prime}})\right) &=& \log\left( \sum\limits_{(p,d^{\prime}) \in E_{p}} \exp(x_{d^{\prime}}+\log m_{pd^{\prime}})\right) \\ & =& \max\limits_{\vartheta\in {\Delta}_{p}} \sum\limits_{(p,d^{\prime})\in E_{p}} \vartheta_{pd^{\prime}}(x_{d^{\prime}}+\log m_{pd^{\prime}}- \log\vartheta_{pd^{\prime}})\\ &=& \max\limits_{\vartheta\in {\Delta}_{p}} \left( -S_{p}(\vartheta;m) + \langle \vartheta, x\rangle \right) \end{array} $$

and so, (8) can be rewritten as vk = f(vk− 1) where f is given by (7). Observe that the operator f is the conjugate of the operator F of the original entropy game: f = log ∘F ∘ exp. It follows that vk = fk(v0) = log Fk(V0) = log Vk, where for a vector \(Y \in (\mathbb {R}^{*}_{+})^{D}\) the notation log(Y ) denotes the vector (log(Yi))1≤iD, and exp := log− 1. □

The map f arising in (7) is obviously order preserving and it commutes with the addition of a constant, meaning that f(x + λe) = f(x) + λe where e is the unit vector (1,...,1) of \(\mathbb {R}^{D}\), and \(\lambda \in \mathbb {R}\). Any map with these two properties is nonexpansive in the sup-norm, meaning that ∥f(x) − f(y)∥≤∥xy, see [14]. Hence, the map xf(xα) has a unique fixed point. For discounted games, the existence of the value and of optimal positional strategies is a known fact:

Proposition 4

The discounted game\(^{\alpha }\mathcal {K}\mathcal {L}\)withdiscount factor 0 < α < 1 has a value and it admits optimal strategies thatare positional and stationary. The value vectorαvis the unique solution ofαv = f(αvα).

Proof

The existence and the characterization of the value are standard results, see e.g. the discussion in [36]. It is also known that the optimal strategies are obtained by selecting actions of the player attaining the minimum and maximum when evaluating every coordinate of f(αvα), in a way similar to the proof of Proposition 1, Vk− 1 there being replaced by αv. Since αv does not depend on the number of turns, the optimal strategies are also stationary. □

Nonexpansive maps can be considered more generally with respect to an arbitrary norm. In this setting, the issue of the existence of the limit of vk/k = fk(v0)/k as k, and of the limit of (1 − α)(αv), as α → 1, where αv is the unique fixed point of xf(xα), has received much attention. The former limit is sometimes called escape rate vector. Nonexpansiveness implies that the set of accumulation points of the sequence vk/k is independent of the choice of v0, but it does not suffice to establish the existence of the limit; some additional “tameness” condition on the map f is needed. Indeed, a result of Neyman [36], using a technique of Bewley and Kohlberg [10], shows that the two limits \(\lim _{k\to \infty } f^{k}(v^{0})/k\) and \(\lim _{\alpha \to 1^{-}} (1-\alpha ){~}^{\alpha }v\) do exist and coincide if f is semi-algebraic. More generally, Bolte, Gaubert and Vigeral [9] showed that the same limits still exist and coincide if the nonexpansive mapping f is definable in an o-minimal structure. A counter example of Vigeral shows that the latter limit may not exist, even if the action spaces are compact and the payment and transition probability functions are continuous, so the o-minimality assumption is essential in what follows [46].

In order to apply this result, let us recall the needed definitions, referring to [44, 45] for background. An o-minimal structure consists, for each integer n, of a family of subsets of \(\mathbb {R}^{n}\). A subset of \(\mathbb {R}^{n}\) is said to be definable with respect to this structure if it belongs to this family. It is required that definable sets are closed under the Boolean operations, under every projection map (elimination of one variable) from \(\mathbb {R}^{n}\) to \(\mathbb {R}^{n-1}\), and under the lift, meaning if \(A\subset \mathbb {R}^{n}\) is definable, then \(A\times \mathbb {R}\subset \mathbb {R}^{n+1}\) and \(\mathbb {R}\times A\subset \mathbb {R}^{n+1}\) are also definable. It is finally required that when n = 1, definable subsets are precisely finite unions of intervals. A function f from \(\mathbb {R}^{n}\) to \(\mathbb {R}^{k}\) is said to be definable if its graph is definable.

An important example of o-minimal structure is the real exponential field\(\mathbb {R}_{\text {alg},\exp }\). The definable sets in this structure are the subexponential sets [45], i.e., the images under the projection maps \(\mathbb {R}^{n+k}\to \mathbb {R}^{n}\) of the exponential sets of \(\mathbb {R}^{n+k}\), the latter being sets of the form \(\{x\mid P(x_{1},\dots ,x_{n+k},e^{x_{1}},\dots ,e^{x_{n+k}})=0\}\) where P is a real polynomial. A theorem of Wilkie [48] implies that \(\mathbb {R}_{\text {alg},\exp }\) is o-minimal, see [45]. Observe in particular that the set \(\{x\in \mathbb {R}^{2}\mid x_{1}\leqslant x_{2}\}\) is definable in this structure, being the projection of \(\{x\in \mathbb {R}^{3}\mid x_{2}-x_{1}={x_{3}^{2}}\}\). Using the o-minimal character of this structure, this implies that definable maps are stable by the operations of pointwise maximum and minimum. We deduce the following key fact.

Fact 5

The dynamic programming operator f of the Kullback-Leibler game, definedby (7), is definable in the real exponential field.

Theorem 6 (9)

Let\(f:\mathbb {R}^{n} \to \mathbb {R}^{n}\)benonexpansive in any norm, and suppose that f is definable in an o-minimalstructure. Then,

$$ \lim_{k\to \infty} f^{k}(0)/k $$

does exists, and it coincides with

$$ \lim_{\alpha \to 1^{-}} (1-\alpha) {~}^{\alpha} v \enspace . $$

Corollary 7

Let\(v^{k}=({v^{k}_{d}})_{d\in D}\)bethe value vector in horizon k of the stochastic game with Kullback-Leiblerpayments,\(\mathcal {K}\mathcal {L}^{k}\),and for 0 < α < 1,letαvdenote the value vector of the discountedgame\(^{\alpha }\mathcal {K}\mathcal {L}\)withdiscount factor 0 < α < 1.Then\(\lim _{k\to \infty } v^{k}/k\)doesexist and it coincides with\(\lim _{\alpha \to 1^{-}} (1-\alpha )({~}^{\alpha }v)\).

Proof

We already noted that the map f in (7) is nonexpansive in the sup-norm. It is definable in the real exponential field. So Theorem 6 can be applied to it. □

Corollary 7 will allow us to establish the existence of the value of the mean payoff game, and to obtain optimal strategies, by considering the discounted game, for which, as noted in Proposition 4, the existence of the value and of optimal policies are already known.

Let us recall that a strategy in a discounted game is said to be Blackwell optimal if it is optimal for all discount factors sufficiently close to one. The existence of Blackwell optimal positional strategies is a basic feature of perfect information zero-sum stochastic games with finite action spaces (see [39, Chap. 10] for the one-player case, the two-player case builds on similar ideas, e.g. [18, Lemma 26]). We next show that this result has an analogue for entropy games. To get a Blackwell type optimality result, we need to restrict to a setting with finitely many positional strategies. Recall that \(\mathcal {P}_{D}\) (resp. HCode \(\mathcal {P}_{T}\)) denotes the set of policies of Despot (resp. Tribune). We also recall our notation v(δ,∗) for the value of the mean payoff game \(\mathcal {K}\mathcal {L}^{\infty }(\delta ,\ast )\) in which Despot plays according to the policy δ.

We define the projection of a pair of strategies (δ,(τ,π)) in the game \(\mathcal {K}\mathcal {L}\) to be the strategy (δ,τ) in the game \(\mathcal {E}\). In the present setting, it is appropriate to say that a pair of policies \((\delta ,\tau )\in \mathcal {P}_{D}\times \mathcal {P}_{T}\) is Blackwell optimal if there is a real number 0 < α0 < 1 such that, for all α ∈ (α0,1), (δ,τ) is the projection of a pair of optimal strategies (δ,(τ,π)) in the discounted game \({~}^{\alpha }\mathcal {K}\mathcal {L}\).

Theorem 8

The family of discounted Kullback-Leibler games \((^{\alpha }\mathcal {K}\mathcal {L})_{\alpha \in (0,1)}\) has positional Blackwell optimal strategies.

Proof

For all α ∈ (0,1), the discounted game has positional optimal strategies δ,(τ,π). This follows from the standard dynamic programming argument mentioned in the proofs of Proposition 1 and 4, noting that δ(d) is obtained by choosing any tT such that (d,t) ∈ E attains the minimum in the expression

$$ {~}^{\alpha}v_{d} = \min_{(d,t)\in E} \max_{(t,p)\in E} \max_{\vartheta\in {\Delta}_{p}} \left( -S_{p}(\vartheta;m) + \langle \vartheta, \alpha ({~}^{\alpha} v)\rangle \right) \enspace . $$

Similarly, τ(t) is chosen to be any pP such that (t,p) ∈ E attains the maximum in

$$ \max_{(t,p)\in E} \max_{\vartheta\in {\Delta}_{p}} \left( -S_{p}(\vartheta;m) + \langle \vartheta, \alpha ({~}^{\alpha} v)\rangle \right) \enspace , $$

and π(p) is chosen to be the unique action 𝜗 attaining the maximum in

$$ \max_{\vartheta\in {\Delta}_{p}} \left( -S_{p}(\vartheta;m) + \langle \vartheta, \alpha ({~}^{\alpha} v)\rangle \right) \enspace $$

(observe that the function to be maximized is strictly concave and continuous on Δp, and that Δp is compact and convex, so the maximum is achieved at a unique point).

By definition of the value and of optimal strategies, we have, for all strategies δ and (τ,π) of Despot and Tribune respectively,

$$ \begin{array}{@{}rcl@{}} {~}^{\alpha}r_{d}(\delta^{*},(\tau,\pi)) \leqslant {~}^{\alpha}v_{d}= {~}^{\alpha}r_{d}(\delta^{*},(\tau^{*},\pi^{*})) \leqslant {~}^{\alpha}r_{d}(\delta,(\tau^{*},\pi^{*}))\enspace , \end{array} $$
(9)

which is equivalent to

$$ \begin{array}{@{}rcl@{}} {~}^{\alpha}v_{d}= {~}^{\alpha}r_{d}(\delta^{*},(\tau^{*},\pi^{*}))= {~}^{\alpha}v_{d}(\delta^{*},\ast)={~}^{\alpha}v_{d}(\ast,(\tau^{*},\pi^{*}))\enspace . \end{array} $$
(10)

Specializing the first inequality in (9) to τ = τ, and bounding above the last term, we deduce that, for all for all strategies δ and τ of Despot and Tribune respectively, we have

$$ {~}^{\alpha}v(\delta^{*},(\tau,\ast)) \leqslant {~}^{\alpha}v_{d}={~}^{\alpha}v(\delta^{*},(\tau^{*},\ast)) \leqslant {~}^{\alpha}v(\delta,(\tau^{*},\ast))\enspace , $$
(11)

where αvd(δ,(τ,∗)) is the value of the reduced discounted 1-player discounted game \(^{\alpha }\mathcal {K}\mathcal {L}(\delta ,(\tau ,\ast ))\) starting at dD, in which the (not necessarily positional) strategies δ of Despot and τ of Tribune are fixed. The inequalities (11) can be specialized in particular to policies \(\delta \in \mathcal {P}_{D}\) and \(\tau \in \mathcal {P}_{T}\). Then, by Proposition 4, αv(δ,(τ,∗)) is the unique fixed point of the self-map \(x\mapsto {~}^{\tau }f^{\delta }_{d}(x\alpha )\) of \(\mathbb {R}^{D}\), where τfδ is the dynamic programming operator given by

$$ \begin{array}{@{}rcl@{}} {~}^{\tau}f^{\delta}_{d}(x) = \log\left( \sum\limits_{(\tau\circ \delta(d),d^{\prime}) \in E} m_{\tau\circ \delta(d),d^{\prime}}\exp(x_{d^{\prime}})\right) \enspace . \end{array} $$
(12)

It follows that the map ααv(δ,(τ,∗)) is definable in the real exponential field \(\mathbb {R}_{\text {alg},\exp }\). (To see this, observe that, by Fact 5, the set \(\{(x,y)\mid x={~}^{\tau }f^{\delta }_{d}(y)\}\times \mathbb {R}\) is definable in this structure; then, taking the intersection of this set with the definable sets {(x,y,α)∣yd = xdα}, for dD, and projecting the intersection keeping only the x and α variables, we obtain a definable set which is precisely the graph of the map ↦ αv(δ,(τ,∗))).

For all \((\bar {\delta },\bar {\tau })\in \mathcal {P}_{D}\times \mathcal {P}_{T}\), let \(I(\bar {\delta },\bar {\tau })\) denote the set of α ∈ (0,1) such that

$$ {~}^{\alpha}v(\bar{\delta},(\tau,\ast)) \leqslant {~}^{\alpha}v(\bar{\delta},(\bar{\tau},\ast)) \leqslant {~}^{\alpha}v(\delta,(\bar{\tau},\ast)) $$
(13)

holds for all \((\delta ,\tau )\in \mathcal {P}_{D}\times \mathcal {P}_{T}\). Since the saddle point property (11) holds for all α (δ and τ depend on α, of course), we have

$$ \begin{array}{@{}rcl@{}} \cup_{(\bar{\delta},\bar{\tau})} I(\bar{\delta},\bar{\tau})= (0,1)\enspace . \end{array} $$
(14)

Observe that the set \(I(\bar {\delta },\bar {\tau })\) is a subset of \(\mathbb {R}\) definable in the real exponential field, which is o-minimal. It follows that \(I(\bar {\delta },\bar {\tau })\) is a finite union of intervals. Hence, (14) provides a covering of (0,1) by finitely many intervals, and so, one of the sets \(I(\bar {\delta },\bar {\tau })\) must include an interval of the form (1 − 𝜖,1).

To show that the policies \(\bar {\delta },\bar {\tau }\) obtained in this way are Blackwell optimal, it remains to show that if \((\bar {\delta },\bar {\tau })\) satisfies (13) for some α, then it is the projection of a pair of optimal strategies \((\bar {\delta },(\bar {\tau },\bar {\pi }))\) in the discounted game \({~}^{\alpha }\mathcal {K}\mathcal {L}\). For this, we shall apply the existence of optimal strategies that are positional and the resulting (10) and (11) to the reduced games \(^{\alpha }\mathcal {K}\mathcal {L}(\bar {\delta },\ast )\) and \(^{\alpha }\mathcal {K}\mathcal {L}(\ast , (\bar {\tau },\ast ))\), respectively.

The first game leads to the existence of positional stationary strategies τ1,π1 of Tribune such that, for all dD,

$$ {~}^{\alpha}v_{d}(\bar{\delta},\ast)= {~}^{\alpha}r_{d}(\bar{\delta},(\tau^{1},\pi^{1})) ={~}^{\alpha}v(\bar{\delta},(\tau^{1},\ast))\enspace . $$

Then, using (13), we get that \({~}^{\alpha }v(\bar {\delta },(\bar {\tau },\ast ))\geqslant {~}^{\alpha }v(\bar {\delta },(\tau ^{1},\ast ))={~}^{\alpha }v_{d}(\bar {\delta },\ast )\geqslant {~}^{\alpha }v(\bar {\delta },(\bar {\tau },\ast ))\), hence the equality \({~}^{\alpha }v(\bar {\delta },(\bar {\tau },\ast ))={~}^{\alpha }v_{d}(\bar {\delta },\ast )\).

The second one leads to the existence of positional stationary strategies δ2,π2 of Despot and Tribune respectively such that, for all dD,

$$ {~}^{\alpha}v_{d}(\ast, (\bar{\tau},\ast))= {~}^{\alpha}r_{d}(\delta^{2},(\bar{\tau},\pi^{2}))= {~}^{\alpha}v_{d}(\delta^{2},(\bar{\tau},\ast))={~}^{\alpha}v_{d}(\ast,(\bar{\tau},\pi^{2})) \enspace . $$

Then, using (13), we deduce that \({~}^{\alpha }v_{d}(\bar {\delta },(\bar {\tau },\ast ))\leqslant {~}^{\alpha }v_{d}(\delta ^{2},(\bar {\tau },\ast ))={~}^{\alpha }v_{d}(\ast , (\bar {\tau },\pi ^{2})) \leqslant {~}^{\alpha }v_{d}(\bar {\delta },(\bar {\tau },\pi ^{2})) \leqslant {~}^{\alpha }v_{d}(\bar {\delta },(\bar {\tau },\ast ))\), hence the equality \({~}^{\alpha }v_{d}(\bar {\delta },(\bar {\tau },\ast )) ={~}^{\alpha }v_{d}(\ast , (\bar {\tau },\pi ^{2})) ={~}^{\alpha }v_{d}(\bar {\delta },(\bar {\tau },\pi ^{2}))\). With the equality proved with the first game, this leads to \({~}^{\alpha }v_{d}(\bar {\delta },\ast )={~}^{\alpha }v(\bar {\delta },(\bar {\tau },\ast )) ={~}^{\alpha }v_{d}(\bar {\delta },(\bar {\tau },\pi ^{2})) ={~}^{\alpha }v_{d}(\ast , (\bar {\tau },\pi ^{2}))\). This shows that \((\bar {\delta },(\bar {\tau },\pi ^{2}))\) is a pair of optimal strategies for the discounted game \({~}^{\alpha }\mathcal {K}\mathcal {L}\). Since \((\bar {\delta },\bar {\tau })\) is its projection, we get that it is Blackwell optimal. □

Theorem 9

The valuevof the stochastic mean payoff game withKullback-Leibler payments does exist, and it coincideswith\(\lim _{k\to \infty } v^{k}/k = \lim _{\alpha \to 1^{-}} (1-\alpha )({~}^{\alpha }v)\).For all\((\delta ,\tau )\in \mathcal {P}_{D}\times \mathcal {P}_{T}\),the same properties hold for the valuesv(δ,∗) andv(∗,(τ,∗)) of the reduced games in which Despot plays according toδwhen the state is in D and Tribunes plays according toτwhen the state is in T, respectively. Moreover, the Blackwell optimalstrategies\((\delta ^{*},\tau ^{*})\in \mathcal {P}_{D}\times \mathcal {P}_{T}\)ofCorollary 8 satisfy, for alldD,

$$ \begin{array}{@{}rcl@{}} v^{\infty}_{d}=v^{\infty}_{d}(\delta^{*},(\tau^{*},\ast)) = v^{\infty}_{d}(\delta^{*},\ast) = v^{\infty}_{d}(\ast,(\tau^{*},\ast)) \enspace .\end{array} $$
(15)

In particular,

$$ \begin{array}{@{}rcl@{}} v^{\infty}_{d} = \min_{\delta \in \mathcal{P}_{D}} v^{\infty}_{d}(\delta,\ast) = \max_{\tau\in \mathcal{P}_{T}} v^{\infty}_{d}(\ast,(\tau,\ast)) \enspace. \end{array} $$

Proof

We already noted in Corollary 7 that

$$ \begin{array}{@{}rcl@{}} \lim_{k\to\infty} v^{k}/k = \lim_{\alpha \to 1^{-}} (1-\alpha)({~}^{\alpha}v)\enspace . \end{array} $$
(16)

Moreover, it is shown in [9, Corollary 2, (iii)], as a consequence of a theorem of Mertens and Neyman [34], that this limit coincides with the value of the game. These results rely on the definable and sup-norm nonexpansive character of the dynamic programming operator f. The dynamic programming operator fδ, associated to the reduced game determined by the strategy \(\delta \in \mathcal {P}_{D}\) can be written as

$$ \begin{array}{@{}rcl@{}} f^{\delta}_{d}(x):= \max_{(\delta(d),p)\in E} \log\left( \sum\limits_{(p,d^{\prime}) \in E} m_{pd^{\prime}}\exp(x_{d^{\prime}})\right) \enspace . \end{array} $$
(17)

It is definable and sup-norm nonexpansive, hence the same conclusions apply to the game \(\mathcal {K}\mathcal {L}(\delta ,\ast )\), i.e,

$$ \begin{array}{@{}rcl@{}} \lim_{k\to\infty} v^{k}(\delta,\ast)/k = \lim_{\alpha \to 1^{-}} (1-\alpha)({~}^{\alpha}v(\delta,\ast)) \end{array} $$
(18)

is the value of the game \(\mathcal {K}\mathcal {L}^{\infty }(\delta ,\ast )\). We argue in the same way for the game \(\mathcal {K}\mathcal {L}(\ast ,(\tau ,\ast ))\), noting that the associated dynamic programming operator is now

$$ \begin{array}{@{}rcl@{}} {~}^{\tau}f_{d}(x):= \min_{(d,t)\in E} \log\left( \sum\limits_{(\tau(t),d^{\prime}) \in E} m_{\tau(t)d^{\prime}}\exp(x_{d^{\prime}})\right) \enspace . \end{array} $$
(19)

which is still definable and sup-norm nonexpansive. Hence,

$$ \begin{array}{@{}rcl@{}} \lim_{k\to\infty} v^{k}(\ast,(\tau,\ast))/k = \lim_{\alpha \to 1^{-}} (1-\alpha)({~}^{\alpha}v(\ast,(\tau,\ast))) \end{array} $$
(20)

is the value of the game \(\mathcal {K}\mathcal {L}^{\infty }(\ast ,(\tau ,\ast ))\).

Let δ,τ denote the positional Blackwell optimal strategies constructed in Corollary 8. By definition, there is an interval (α0,1) such that for all α ∈ (α0,1), there exists π depending on α such that

$$ \begin{array}{@{}rcl@{}} {~}^{\alpha}v={~}^{\alpha}v(\delta^{*},(\tau^{*},\pi^{*}))={~}^{\alpha}v(\delta^{*},\ast) = {~}^{\alpha}v(\ast,(\tau^{*},\pi^{*})) \end{array} $$

which by (11) leads to

$$ \begin{array}{@{}rcl@{}} {~}^{\alpha}v={~}^{\alpha}v(\delta^{*},\ast) = {~}^{\alpha}v(\ast,(\tau^{*},\ast)) ={~}^{\alpha}v(\delta^{*},(\tau^{*},\ast))\enspace . \end{array} $$
(21)

Multiplying these expressions by (1 − α), passing to the limit, and using (16), (18) and (20), we obtain (15). □

Remark 1

Corollary 9 shows that Player Despot has an optimal positional strategy δ in the mean payoff Kullback-Leibler game. It also shows that in the same game, the actions of Player Tribune at states tT can be chosen according to the optimal positional strategy τ. This theorem does not imply, however, that at every state pP, the optimal action 𝜗 ∈Δp can be chosen optimally according to a positional strategy π. Indeed, the proof by an o-minimality argument uses in an essential way the fact that there are finite actions spaces at every states d and t, whereas Δp is infinite. We leave for further investigation the question of the existence of such a positional strategy, noting that it is not needed in the application to entropy games.

Remark 2

It is shown in [9, Corollary 2, (iii)], as a consequence of a theorem of Mertens and Neyman [34], that a stochastic game with a definable Shapley operator has a uniform value, a property which is stronger than the mere existence of the value. Loosely speaking, a stochastic game with initial state d is said to have a uniform value \(v^{\infty }_{d}\) if both players can almost guarantee \(v^{\infty }_{d}\) provided that the length of the k-stage game is large enough. In the present setting, we get the following property: for any 𝜖 > 0, there is a couple of strategies of (δ,(τ,π)) and a time K such that, for every \(k\geqslant K\), every starting state d and every strategies δ and (τ,π),

$$ \begin{array}{@{}rcl@{}} {r^{k}_{d}}(\delta,\tau^{\prime},\pi^{\prime})/k\leqslant v^{\infty}_{d}+\epsilon, \qquad {r^{k}_{d}}(\delta^{\prime},\tau,\pi)/k \geqslant v^{\infty}_{d}-\epsilon \enspace . \end{array} $$

4 Application to the Entropy Game Model

4.1 Existence of Optimal Positional Strategies in the Entropy Game

As an application of Corollary 9, we obtain the existence of optimal positional strategies in the entropy game model of Section 2.

Theorem 10

The infinite horizon entropy game has a value and it hasoptimal positional strategies, namely the Blackwell optimalstrategies\((\delta ^{*},\tau ^{*})\in \mathcal {P}_{D}\times \mathcal {P}_{T}\)ofCorollary 8. Moreover, for all initial states d,

$$ V_{d}^{\infty} = \lim_{k\to\infty} \left( {V_{d}^{k}}\right)^{1/k}\enspace . $$

Proof

By Proposition 1, Vk = F(Vk− 1) where F is as in (4). Moreover, F = exp∘f ∘ log is the conjugate of the dynamic programming operator f of the Kullback-Leibler game introduced in (7). Corollary 7 shows that \(v^{\infty }=\lim _{k\to \infty } v^{k}/k\) does exists. It follows that \(V^{\infty }_{d}:=\lim _{k\to \infty } ({V_{d}^{k}})^{1/k} = \exp (v^{\infty }_{d})\) does exist for all dD.

Let (δ,τ) denote the Blackwell optimal strategies given by Corollary 8. We showed in Corollary 9 that v = v(δ,∗), and by Corollary 7, we have

$$ v^{\infty}(\delta^{*},\ast)= \lim_{k\to\infty} v^{k}(\delta^{*},\ast)/k\enspace . $$

Using the dynamic programming principle for finite horizon 1-player games, or equivalently, by applying Proposition 2 to the reduced finite horizon game \(\mathcal {K}\mathcal {L}^{k}(\delta ^{*},\ast )\), we obtain that

$$ v^{k}(\delta^{*},\ast)/k= (f^{\delta^{*}})^{k}(0)/k \enspace , $$

where for all δ, fδ is the dynamic programming operator defined in (17) associated to the reduced game \(\mathcal {K}\mathcal {L}(\delta ,\ast )\). Consider now the conjugate Fδ := exp∘fδ ∘ log, so that

$$ F^{\delta}_{d}(X):= \max_{(\delta(d),p)\in E} \left( \sum\limits_{(p,d^{\prime}) \in E} m_{pd^{\prime}}X_{d^{\prime}}\right) \enspace . $$
(22)

By Proposition 1, for all strategies τ of Tribune, non necessarily positional, and all initial states dD, we have

$$ {R^{k}_{d}}(\delta^{*},\tau) \leqslant [(F^{\delta^{*}})^{k}(e)]_{d} \enspace . $$

Applying all the above equalities and inequalities, we deduce that

$$ \limsup_{k\to\infty} ({R^{k}_{d}}(\delta^{*},\tau))^{1/k} \leqslant \lim_{k\to\infty} [(F^{\delta^{*}})^{k}(e)]_{d}^{1/k}=\exp(v^{\infty}_{d})= V_{d}^{\infty} \enspace, $$

so Player Despot can guarantee his loss does not exceed \(V_{d}^{\infty }\) by playing δ.

Let us now consider the reduced infinite horizon or finite horizon entropy and Kullback-Leibler games in which the strategy of Tribune is fixed and equal to τ. By the same arguments as above, we show that the positional strategy τ guarantees to Player Tribune to win at least \(V_{d}^{\infty }\) in the entropy game. Indeed, applying successively Corollary 9 and Proposition 2, we deduce

$$ v^{\infty}= v^{\infty}(\ast,(\tau^{*},\ast))=\lim_{k\to\infty} v^{k}(\ast,(\tau^{*},\ast))/k = \lim_{k\to\infty} ({~}^{\tau^{*}}f)^{k}(0)/k\enspace , $$

where, for all τ, τf is as in (19). Then, considering the conjugate τF := exp∘ τf ∘ log, so that, for all τ,

$$ {~}^{\tau}F_{d}(X):= \min_{(d,t)\in E} \left( \sum\limits_{(\tau(t),d^{\prime}) \in E} m_{\tau(t)d^{\prime}}X_{d^{\prime}}\right) \enspace , $$
(23)

and applying Proposition 2 and 1, we deduce that

$$ V^{\infty}_{d}=\lim_{k\to\infty} ({~}^{\tau^{*}}F_{d})^{k}(e)^{1/k} \leqslant\liminf_{k\to\infty} {R^{k}_{d}}(\delta,\tau^{*})^{1/k} \enspace ,$$

for all strategies δ of Despot, non necessarily positional, and all initial states dD. So Player Tribune can win at least \(V_{d}^{\infty }\) by playing τ. □

4.2 Comparison with the Original Entropy Game Model

The original entropy game model of Asarin et al. [2] is a zero-sum game defined in a way similar to Section 2, up to a technical difference: in their model, the initial state is not prescribed. The payment of Tribune in horizon k, instead of being \(R_{\bar {d}}^{k}(\delta ,\tau )\), is the quantity \(\bar {R}^{k}(\delta ,\tau )\), defined now as the sum of weights of all paths of length k starting at a node in D and ending at a node in D. Hence, R̄k(δ,τ) = ∑ dDRdk(δ,τ). The payment of Tribune can be defined in their game as follows R̄(δ,τ) = limsupk(R̄k(δ,τ))1/k. This game is denoted by \(\mathcal {E}^{\text {orig},\infty }\), we denote by \(\bar {V}^{\infty }\) the value of this game, which is shown to exist in [2].

Note that in the initial model in [2], the weights \(m_{pd^{\prime }}\) are equal to 1. The generalization to weighted entropy games, in which the weights \(m_{pd^{\prime }}\) are integers is discussed in Section 6 of [2]. The case in which the weights \(m_{pd^{\prime }}\) take rational values can be reduced to the latter case by multiplying all the weights by an integer factor. Therefore, we will ignore the restriction that \(m_{pd^{\prime }}=1\) in our definition of \(\mathcal {E}^{\text {orig},\infty }\) and will refer to the entropy game model with rational weights as the entropy game model. The following observation follows readily from Theorem 10.

Proposition 11

The value of the original entropygame\(\mathcal {E}^{\text {orig},\infty }\)consideredby Asarin et al. [2] (with a free initial state), coincides with the maximum of the values of thegames\(\mathcal {E}^{\infty }_{d}\), taken over all initial statesdD:

$$ \bar{V}^{\infty} = \max_{d\in D} V_{d}^{\infty} \enspace . $$

Example 2

Proposition 11 is illustrated by the game of Example 1. In the original model of [2], the value, defined independently of the initial state, is \((1+\sqrt {5})/2\), whereas our model associates to the initial state d1 a value 1 which differs from the values of d2 and d3.

In [2], entropy games were compared with matrix multiplication games. We present here this correspondence in the case of general weights \(m_{pd^{\prime }}\). Given policies \(\delta \in \mathcal {P}_{D}\) and \(\tau \in \mathcal {P}_{T}\), let \(A(\delta )\in \mathbb {R}^{D\times T}\) and \(B(\tau )\in \mathbb {R}^{T\times D}\) be such that A(δ)dt = 1 if t = δ(d) and 0 otherwise, and B(τ)td = mτ(t)d if (τ(t),d) ∈ E and 0 otherwise, for all (d,t) ∈ D × T. We shall think of A(δ) and B(τ) as rectangular matrices. Then R̄k(δ,τ) = ∥(A(δ)B(τ))k1, where for any \(A\in \mathbb {R}^{D\times D}\), Ak denotes its k th power and \(\|A\|_{1}={\sum }_{dd^{\prime }}|A_{dd^{\prime }}|\) its 1 norm. From this, one deduces that R̄(δ,τ) = ρ(A(δ)B(τ)), where ρ(A) denotes the spectral radius of the matrix A. Moreover, let \(\mathcal {A}\) and \(\mathcal {B}\) denote the sets of all matrices of the form A(δ) and B(τ) respectively, and let \(\mathcal {A}\mathcal {B}\) be the set of all matrices AB with \(A\in \mathcal {A}\) and \(B\in \mathcal {B}\). The sets \(\mathcal {A}\), \(\mathcal {B}\) and \(\mathcal {A}\mathcal {B}\) are subsets of matrices \(\mathcal {M}\) satisfying the property that all elements of \(\mathcal {M}\) have same dimension and if \(\mathcal {M}_{i}\) is the set of i th rows of the elements of \(\mathcal {M}\), then \(\mathcal {M}\) is the set of matrices the i th row of which belongs to \(\mathcal {M}_{i}\). Such a property defines the notion of IRU matrix sets (for independent row uncertainty sets) in [2]. The following property proved in [2] is the analogue of Theorem 9, \(V_{d}^{\infty }\) being replaced by \(\bar {V}^{\infty }\):

$$ \bar{V}^{\infty}= \min_{A\in \mathcal{A}} \max_{B\in\mathcal{B}} \rho(AB)= \max_{B\in\mathcal{B}}\min_{A\in \mathcal{A}} \rho(AB)\enspace. $$
(24)

A more general property is proved in [5, Section 8], as a consequence of the Collatz-Wielandt theorem (see Corollary 13 below).

Remark 3

Our approach shows that entropy games reduce to Kullback-Leibler games, which are stochastic mean payoff games (with compact action spaces), Asarin et al. [2] remarked that the special deterministic entropy games, in which People has only one possible action in each state, can be re-encoded as deterministic mean payoff games. This can also be recovered from our approach: in this deterministic case, the simplices Δp are singletons in the Kullback-Leibler game, and the entropy function vanishes, so the Kullback-Leibler game degenerates to a deterministic mean payoff game.

5 Applying the Collatz-Wielandt Theorem to Entropy Games

The classical Collatz-Wielandt formulæ provide variational characterizations of the spectral radius ρ(M) of a nonnegative matrix \(M\in \mathbb {R}^{D\times D}\):

$$ \begin{array}{@{}rcl@{}} \rho(M)&=&\inf \{\lambda >0\mid \exists X\in \text{int}~ \mathbb{R}_{+}^{D}, MX \leqslant \lambda X\}\\ &=& \max\{\lambda \geqslant 0\mid \exists X\in \mathbb{R}_{+}^{D}\setminus\{0\}, MX = \lambda X \},\\ &=& \max\{\lambda \geqslant 0\mid \exists X\in \mathbb{R}_{+}^{D}\setminus\{0\}, MX \geqslant \lambda X \}, \end{array} $$

where \(\mathbb {R}_{+}^{D}\) denotes the nonnegative orthant of \(\mathbb {R}^{D}\), and \(\operatorname {int}\mathbb {R}_{+}^{D}\) its interior, i.e, the set of positive vectors. The infimum is not always attained in the first line, whereas by writing “max”, we mean that the suprema are always attained.

This has been extended to non-linear, order preserving and continuous self-maps of the standard positive cone by Nussbaum [37], see also [3, 5, 19, 22, 31]. Recall that a self-map of \(\mathbb {R}^{D}_{+}\) is said to be order preserving if uvF(u) ≤ F(v) for all \(u,v \in \mathbb {R}^{D}_{+}\), the relation ≤ being understood entrywise. This map is positively homogeneous of degree 1 if F(αu) = αu, for all α > 0 and \(u \in \mathbb {R}^{D}_{+}\).

Theorem 12

Let F be a continuous, order preserving,and positively homogeneous of degree 1 self-mapof\(\mathbb {R}_{+}^{D}\).Then the following quantities coincide

$$ \begin{array}{@{}rcl@{}} &&\lim_{k\to\infty} \max_{d\in D}[F^{k}(e)]_{d}^{1/k} \end{array} $$
(25)
$$ \begin{array}{@{}rcl@{}} && \inf \{\lambda >0\mid \exists X\in \text{int}\ \mathbb{R}_{+}^{D}, F(X) \leqslant \lambda X \} \end{array} $$
(26)
$$ \begin{array}{@{}rcl@{}} && \max\{\lambda \geqslant 0\mid \exists X\in \mathbb{R}_{+}^{D}\setminus\{0\}, F(X) = \lambda X \}\enspace, \end{array} $$
(27)
$$ \begin{array}{@{}rcl@{}} && \max\{\lambda \geqslant 0\mid \exists X\in \mathbb{R}_{+}^{D}\setminus\{0\}, F(X) \geqslant \lambda X \}\enspace, \end{array} $$
(28)

Proof

The existence of (25) and the fact it coincides with (26) is proved in [19, Prop 1]. The fact that (26) and (27) coincide is proved in [37, Theorem 3.1]. The fact that (27) and (28) coincide is proved in [3, Lemma 2.8]. □

The common value of the expressions in Theorem 12 is called the non-linear spectral radius of F.

We showed in Proposition 11 that the value of the original entropy game \(\mathcal {E}^{\text {orig}}\) is precisely

$$ \bar{V}^{\infty} = \max_{d\in D} V_{d}^{\infty} = \lim_{k\to\infty} \max_{d\in D} [F^{k}(e)]_{d}^{1/k} $$

where F is the dynamic programming operator defined in (4). This operator is continuous, order preserving, and homogeneous of degree one. Hence, we get as an immediate corollary of Theorem 12:

Corollary 13

The value \(\bar {V}^{\infty }\) of the original entropy game \(\mathcal {E}^{\text {orig}}\) (with a free initial state) coincides with any of the following expressions

$$ \begin{array}{@{}rcl@{}} &&\inf\{\lambda >0\mid \exists X\in \text{int}\ \mathbb{R}_{+}^{D}, F(X) \leqslant \lambda X \} \end{array} $$
(29)
$$ \begin{array}{@{}rcl@{}} &&\max\{\lambda >0\mid \exists X\in \mathbb{R}_{+}^{D}\setminus\{0\}, F(X) = \lambda X \} \end{array} $$
(30)
$$ \begin{array}{@{}rcl@{}} && \max\{\lambda >0\mid \exists X\in \mathbb{R}_{+}^{D}\setminus\{0\}, F(X) \geqslant \lambda X \} \enspace , \end{array} $$
(31)

where F is the dynamic programming operator (4).

The Collatz-Wielandt formulæ of Theorem 13 are helpful to establish strong duality results, like (24). Note that (24) is weaker than Theorem 13 since it does not imply the existence of a nonlinear vector whereas (30) of Theorem 13 does. See also [3] for an application to mean payoff games and tropical geometry. Our main interest here lies in the following application of (29). We say that a state d of Despot is significant if the set of actions of Despot in this state, {(d,t) ∈ E}, has at least two elements (i.e., Despot has to make a choice in this state). We say that an entropy game is Despot-free if the Despot player does not have any significant state. A Despot-free game is essentially a one (and half) player problem, since the minimum term in the corresponding dynamic programming operator (4) vanishes. Indeed, for each dD, there is a unique node t such that (d,t) ∈ E, and we define the map σ : DT by σ(d) = t. The following corollary, which follows from Corollary 13 by making the change of variables μ = log λ and x = log X, is also a special case of a result of Anantharam and Borkar [1].

Corollary 14

The logarithm of the value of a Despot-free original entropy game is given by the value of the optimization problem

$$ \begin{array}{@{}rcl@{}} &&\inf \mu \\ &&(\mu,x) \in \mathbb{R}\times\mathbb{R}^{D}, \text{satisfying}\\ && \mu + x_{d} \geqslant \log\left( \sum\limits_{d^{\prime}\in D} m_{p,d^{\prime}}e^{x_{d^{\prime}}}\right) \text{ for all}\ d \in D, p\in P \text{ such that} \ (\sigma(d),p)\in E \enspace. \\ \end{array} $$
(32)

Observe that the latter expression is the value of an optimization problem in which the variables are μ and x = (xd)dD, the objective function is the linear form (μ,x)↦μ, and the feasible set is convex. Hence, this will lead us to a polynomial time decision procedure in the Despot free case, which we develop in the next section.

6 Polynomial time Solvability of Entropy Games with a Few Significant Despot Positions

By solving strategically an entropy game, we mean finding a pair of optimal policies. We assume from now that the weights mp,d are integers. Since policies are combinatorial objects, solving strategically the game is a well posed problem in the Turing (bit) model of computation. Once optimal policies are known, the value of the game, which is an algebraic number, can be obtained as the Perron root of an associated integer matrix. Our first main result is the following.

Theorem 15

Despot-free entropy games can be solved strategically in polynomial time.

This will be proved in Section 7, by combining several ingredients: a reduction to the irreducible case, an application of the ellipsoid method, and separation bounds for algebraic numbers.

We will also show the following generalization of Theorem 15.

Theorem 16

Entropy games in which Despot has a fixed number of significant states can be solved strategically in polynomial time.

7 Proof of Theorem 15 and Theorem 16

We start by considering a Despot-free game. We decompose the proof of Theorem 15 in several steps, corresponding to different subsections.

7.1 Reduction to the Irreducible Case

First, we associate to a Despot-free entropy game a projected directed graph \(\bar G\), with node set D and an arc dd if there is a path (d,t,p,d) in the original directed graph G. We say that the game is irreducible if \(\bar G\) is strongly connected. Recall that we assumed that any node has a successor, so that strongly connected components are all non trivial (not reduced to a node with no edge).

Lemma 17

The value of an irreducible Despot-free entropy gameis independent of the initial state. Moreover, there is avector\(U\in \operatorname {int}\mathbb {R}_{+}^{D}\)anda scalarλ > 0 such thatF(U) = λU,andλcoincides with the value of any initial state in this game.

Proof

The non-linear Perron-Frobenius theorem in [19] provides a sufficient condition for the existence of a positive eigenvector of an order preserving positively homogeneous self-map F of the interior of the cone. It suffices to check that a certain directed graph G(F) associated to F is strongly connected. Specialized to the present setting, this directed graph is defined as follows: the node set is D, and there is an arc from dd if

$$ \lim_{s\to\infty}F_{d}(se_{d^{\prime}})=+\infty \enspace , $$

where \(e_{d^{\prime }}=(0,\dots ,0,1,0,\dots ,0)^{\top }\) is the dth vector of the canonical basis of \(\mathbb {R}^{D}\). By considering the explicit form of F, we see that G(F) is precisely the directed graph \(\bar {G}\). Hence, by Theorem 2 of [19], there exists a vector \(U\in \operatorname {int}\mathbb {R}_{+}^{D}\) and a scalar λ > 0 such that F(U) = λU. It follows from Theorem 13 and from the fact that the value of the original entropy game is \(\max _{d\in D}V^{\infty }_{d}\) that \(V^{\infty }_{d}\leqslant \lambda ^{*}\) holds for all dD.

We next show that the other inequality holds. Recall that \(V^{\infty }_{d} = \lim _{k\to \infty }\)\([F^{k}(e)]_{d}^{1/k}\) is the value of the entropy game with initial state d, where \(e=(1,...,1)^{\top } \in \mathbb {R}^{D}_{+}\).

Since D is finite, there is a positive scalar β > 0 such that \(e\geqslant \beta U\) (indeed, take β := (maxdDUd)− 1). Using the order preserving and positively homogeneous character of the dynamic programming operator F, we get

$$ F^{k}(e) \geqslant F^{k}(\beta U) = \beta F^{k}(U)=\beta (\lambda^{*})^{k} U $$

and so, using that U is positive,

$$ V^{\infty}_{d}=\lim_{k\to\infty}[F^{k}(e) ]_{d}^{1/k}\geqslant \lim_{k\to\infty} (\beta (\lambda^{*})^{k} U_{d})^{1/k} = \lambda^{*} \enspace. $$

Hence, \(V^{\infty }_{d}= \lambda ^{*}\) holds for all dD. □

Thanks to this lemma, we will speak of “value”, without mentioning the initial state, when the entropy game is irreducible.

Every strongly connected component C of \(\bar G\) with set of nodes DCD yields a reduced game, in which the set of states of Despot is DC, and Tribune only selects actions such that the next state of Despot will remain in DC for at least one action chosen by People. Moreover, People chooses only actions so that the next state remains in DC. By definition of \(\bar G\), this reduced game is irreducible. We denote it by \(\mathcal {E}[C]\). The following elementary observation allows us to reduce the general Despot-free case to the irreducible Despot-free case.

Lemma 18

In a Despot-free entropy game, the value of a state d is the maximum of the value of the irreduciblegames\(\mathcal {E}[C]\)correspondingto the different strongly connected components C of\(\bar G\)towhich d has access.

Proof

This follows from a more precise of Zijm, Theorem 5.1 in [50], which determines the asymptotic expansion of Fk(e) as k. However, the present lemma is more elementary. Alternatively, one can note that the operator f := log ∘F ∘ exp is precisely in the class of operators considered in [18, Section 4]. The lemma is a special case of Theorem 29 there. □

Therefore, from now on, we make the following assumption.

Assumption 19

The game is Despot-free and irreducible.

We also make the following assumption.

Assumption 20

The weights m p, d are integers.

The case in which the weights are rational numbers reduces to this one (multiplying all weights by a common denominator does not change optimal positional strategies).

7.2 Reduction to a Well Conditioned Convex Programming Problem

Our strategy, to prove Theorem 15 when the game is irreducible is to apply the ellipsoid method to the convex programming formulation (32). To do so, we must replace this formulation by another convex program whose feasible set is included in a ball B2(a,R), (the Euclidean ball with center a and radius R), and contains a Euclidean ball B2(a,r), where log(R/r) is polynomially bounded in the size of the input. The following key lemma allows us to do so. It bounds the non-linear eigenvalue and eigenvector of the dynamic programming operator F, which have been shown to exist in Lemma 17. We set W := max(p,d)∈Emp,d and n := |D|.

Lemma 21

Suppose the game is Despot-free and irreducible. Then, the valueλof the game is such that 1 ≤ λnW.Moreover, there exists a vector\(U\in \operatorname {int}\mathbb {R}_{+}^{n}\)suchthatF(U) = λU,and

$$ \begin{array}{@{}rcl@{}} 1\leqslant U_{d}\leqslant \lambda^{n-1} \enspace , \forall d\in D \enspace . \end{array} $$
(33)

Proof

(1) The fact that λnW follows from the first Collatz-Wielandt formula (29), which implies that λ ≤ maxdFd(e), where e is the unit vector of \(\mathbb {R}^{D}\). We have maxdFd(e) ≤ nW. Similarly, the last Collatz-Wielandt formula (31) implies that \(\lambda \geqslant \min _{d} F_{d}(e)\). Since we assumed that every node in G has at at least one successor, we have \(F_{d}(e)\geqslant 1\) for all dD, and so \(\lambda \geqslant 1\).

(2) Let (d,d) be an arc in \(\bar {G}\), corresponding to a path of length 1 in G, of the form (d,t,p,d). Then, \(m_{pd^{\prime }}U_{d^{\prime }}\leqslant F_{d}(U) = \lambda U_{d}\) holds, with λnW and \(m_{pd^{\prime }}\) integer. In particular, \(U_{d^{\prime }}\leqslant \lambda U_{d}\) holds. Since the game is irreducible, any two vertices of \(\bar {G}\) are connected by a path of length at most n − 1. It follows that \(U_{d^{\prime }}/U_{d}\leqslant \lambda ^{n-1}\) holds for all d,dD. We may assume that the minimal entry of U is equal to 1, by dividing U by this minimal entry. Then, Udλn− 1 holds for all d. □

We denote by \(\mathscr{K}\) the set of pairs \((u,\mu )\in \mathbb {R}^{D}\times \mathbb {R}\simeq \mathbb {R}^{n+1}\), such that

$$ \begin{array}{@{}rcl@{}} f(u) & \leqslant &\mu {e} + u \end{array} $$
(34a)
$$ \begin{array}{@{}rcl@{}} 0 & \leqslant &u_{d} \leqslant (n-1)\lceil \log(nW) \rceil, \forall d \in D, \end{array} $$
(34b)
$$ \begin{array}{@{}rcl@{}} 0 & \leqslant &\mu \leqslant \lceil \log(nW) \rceil+2 \enspace, \end{array} $$
(34c)

where ⌈t⌉ denotes the smallest integer greater than or equal to t, and f is given by (7), recalling that e denotes the unit vector of \(\mathbb {R}^{D}\). Recall that \(W\geqslant 1\) since this is an integer, and that if n≠ 1, then \((n-1)\lceil \log (nW) \rceil \geqslant 1\). By combining Corollary 14, Lemma 17 and Lemma 21, we arrive at the following result.

Proposition 22

The value of a Despot-free irreducible entropy game coincides with the exponential of the value of the convex program:

$$ \begin{array}{@{}rcl@{}} \min \mu, (u,\mu)\in \mathscr{K} \enspace , \end{array} $$
(35)

where \(\mathscr{K}\) is defined by (7.2).

Proof

If \((u,\mu )\in \mathscr{K}\), then it satisfies (32), so by Corollary 14, it is not smaller than the logarithm of the value of the game. Hence, the value of (35) is an upper bound for the logarithm of the value of the game. Now, if we take λ to be equal to the value of the game, we know by Lemma 21 and Lemma 17 that 1 ≤ λnW and that we can find a vector \(U\in \operatorname {int}\mathbb {R}_{+}^{n}\) such that F(U) = λU, and the bounds (33) on U hold. Then, setting u := log(U) and μ := log λ, we get that \((u,\mu )\in \mathscr{K}\). It follows that the exponential of the value of (35) coincides with the value of the game.

Finally, the convexity of \(\mathscr{K}\) follows from the convexity of every coordinate map of f , which is an immediate consequence of Lemma 3. □

We denote by B2(a,r) the Euclidean ball with center a and radius r. The sup-norm ball with the same radius and center is denoted by B(a,r). We have the following lemma.

Lemma 23

Let \(a=((1/2) e,\lceil \log (nW) \rceil +3/2)\in \mathbb {R}^{D}\times \mathbb {R}\) , and let

$$ \begin{array}{@{}rcl@{}} r:= 1/3, \qquad R := \sqrt{n+1}((n-1)\log (nW)+n+1) \enspace . \end{array} $$
(36)

Then,

$$ B_{2}(a,r) \subset \mathscr{K} \subset B_{2}(a,R) \enspace . $$

Proof

Any point (u,μ) in B(a,r) satisfies (1/2 − r)eu ≤ (1/2 + r)e and ⌈log(nW)⌉ + 3/2 − rμ ≤⌈log(nW)⌉ + 3/2 + r. Since n≠ 1, we get that \((n-1)\lceil \log (nW) \rceil \geqslant 1\), and since r ≤ 1/2, we obtain that (u,μ) satisfies the box constraints (34b) and (34c) defining \(\mathscr{K}\). Since f is order preserving and commutes with the addition of a constant,

$$ \begin{array}{@{}rcl@{}} f(u) & \leqslant& f((1/2+r)e) = (1/2+r) e + f(0) \\ & \leqslant& (1/2+r + \lceil \log(nW) \rceil)e \leqslant (1/2+r + \lceil \log(nW) \rceil)e + (u+(r-1/2)e)\\ &=& (2r+\lceil \log(nW) \rceil )e+u \leqslant (3r -1 +\mu)e +u \end{array} $$

and so f(u) ≤ μe + u as soon as r ≤ 1/3. We deduce that \(B_{2}(a,1/3)\subset B_{\infty }(a,1/3)\subset \mathscr{K}\).

Moreover, since \(\mathscr{K}\) is included in a box of width = (n − 1)⌈log(nW)⌉ + 2, for any choice of \(a^{\prime }\in \mathscr{K}\), \(\mathscr{K}\) is included in the sup-norm ball B(a,), and so, in the euclidean ball \(B_{2}(a^{\prime },\ell \sqrt {n+1})\). It follows that \(\mathscr{K}\subset B_{2}(a,R)\). □

7.3 Construction of a Polynomial Time Weak Separation Oracle

We shall solve Problem (35) by the ellipsoid method [20]. The latter needs the following notions.

Definition 1

Let \(\mathscr{K}\) denote a convex body in \(\mathbb {R}^{q}\). A weakseparation oracle for \(\mathscr{K}\) is a procedure, taking as input a rational number ν > 0 and a rational vector \(y\in \mathbb {R}^{q}\), which concludes one of the following: (i) asserting that y is at Euclidean distance at most ν from \(\mathscr{K}\); (ii) finding an approximate separating half-space of precision ν, i.e., a linear form ϕ : xcx, with \(c\in \mathbb {R}^{q}\), of Euclidean norm at least 1, such that for every \(x\in \mathscr{K}\),

$$ \phi(x) \leqslant \phi(y) + \nu \enspace . $$

Let us now recall the main complexity result about the ellipsoid method [20]. To do so, we denote by 〈r〉 the number of bits needed to code an object r, under the standard binary encoding. For instance, if r is an integer, 〈r〉 := ⌈log2(r)⌉ + 1, if r = p/q is a rational, 〈r〉 := 〈p〉 + 〈q〉, if r = (ri) is a rational vector, \(\langle {r}\rangle := {\sum }_{i} \langle {r_{i}}\rangle \), and if ψ is a linear form with rational coefficients over \(\mathbb {R}^{q}\), \(\psi (x)={\sum }_{i} r_{i} x_{i}\), for \(x\in \mathbb {R}^{q}\), then \(\langle \psi \rangle ={\sum }_{i} \langle r_{i}\rangle \). Here and after, the notion of length of an input refers to the binary encoding.

The ellipsoid method can be applied to solve the following problem consisting in finding an approximate minimum of precision 𝜖 of a linear form ψ with rational coefficients over a convex body \(\mathscr{K} \subset \mathbb {R}^{q}\). This means looking for a vector x such that \(d_{2}(x^{*},\mathscr{K})\leqslant \epsilon \) and \(\psi (x^{*}) \leqslant \min _{x\in \mathscr{K}} \psi (x) + \epsilon \), where d2 denotes the Euclidean distance. We assume that we know a vector \(a\in \mathscr{K}\) with rational coordinates, and rational numbers 0 < r < R such that

$$ B(a,r)\subset \mathscr{K} \subset B(a,R) \enspace . $$

In that case, the size of the input of the approximate minimization problem is measured by 〈ψ〉 + 〈a〉 + 〈r〉 + 〈R〉 + 〈𝜖〉.

It is shown in [20] that if the convex set \(\mathscr{K}\) admits a polynomial time weak separation oracle, the ellipsoid method computes an 𝜖-approximate solution of the minimization problem in a time polynomial in the size of the input. Specialized to the present setting, and taking into account the polynomial estimates for log r and log R in Lemma 23, we get the following result.

Theorem 24 (Corollary of [20, Th. 3.1])

Suppose that the set\(\mathscr{K}\)definedby (7.2) admits a weak separation oracle which runs in polynomial timein the bitsize of the input and in the bitsize of the game. Then, theellipsoid method returns an approximate optimal solution of precision𝜖of Problem (35), i.e., a vector (u,μ) such that that\(d_{2}((u,\mu ),\mathscr{K})\leqslant \epsilon \)andμdoes not exceed the value of Problem (35) by more than𝜖,in a time that is polynomial in𝜖〉 + |E| + log W.

Recall that |⋅| denotes the cardinality of a set, in particular |E| denotes the number of arcs of G.

The following result allows us to apply Theorem 24 to Problem (35).

Proposition 25

The convex set\(\mathscr{K}\)definedby (7.2) admits a weak separation oracle which runs in polynomial time.

To show this proposition, we need a series of arguments. Some of these arguments, like the next lemma, are standard, whereas other arguments require some transparent but rather technical bookkeeping, exploiting the non-expansive character of f to control the approximation errors.

Lemma 26

Let𝜖 > 0 and t be rational numbers, and assume first thatt ≤ 0.Then, a rational approximation of absolute precision𝜖of exp(t) can be computed in a time that is polynomial intand𝜖.Assume now thatt > 0.Then, a rational approximation of absolute precision𝜖of log(t) can be computed in a time that is polynomial intand𝜖.

Proof

It is shown in [7] that the conclusion is true when the input belongs to a fixed compact subset of the intervals (−,0], in the case of exp, or of (0,), in the case of log. The fact that the same property still holds for the whole intervals (−,0] and (0,) follows from the range reduction techniques [35]. □

Lemma 27

Let x be a vector in\(\mathbb {R}^{D}\)withrational entries, and let𝜖 > 0 be a rational number. An approximation off(x) with a sup-norm error not exceeding𝜖can be obtained in polynomial time inx〉 + 〈𝜖〉 + |E| + log W.

Proof

We have

$$ f_{d}(x) = \max_{(\sigma(d),p)\in E} \log(\sum\limits_{(p,d^{\prime})\in E}m_{pd^{\prime}}\exp(x_{d^{\prime}})) \enspace. $$

Hence, it suffices to check that for every pP, the value

$$ h(x):=\log\left( \sum\limits_{(p,d^{\prime})\in E}m_{pd^{\prime}}\exp(x_{d^{\prime}})\right) $$

can be approximated with a precision 𝜖 within a polynomial time. We set \(\bar x:= \max _{d^{\prime }\in D} x_{d^{\prime }}\), and make the change of variables \(x_{d^{\prime }} = \bar x+ \tilde {x}_{d^{\prime }}\), so that

$$ h(x) = \bar x + \log t, t:= \sum\limits_{(p,d^{\prime})\in E} m_{pd^{\prime}}\exp(\tilde{x}_{d^{\prime}}) $$

We observe that 1 ≤ t and that log has Lipschitz constant 1 over [1,). Hence, to evaluate h(x) with a precision 𝜖, it suffices to compute an approximation \(\tilde {t}\) of t with precision 𝜖/2, which can be done in polynomial time thanks to Lemma 26, and then to approximate \(\log \tilde {t}\) with precision 𝜖/2, which can also be done in polynomial time by the same lemma. □

Proof of Proposition 25

Let ν > 0. Our aim is to check whether a given pair \((\bar v,\bar \mu )\) is at distance at most ν from \(\mathscr{K}\). Since we already showed that we can get an approximation of f in polynomial time, the proof will be a matter of routine bookkeeping (except perhaps the use of the subdifferential of f to construct a separating halfspace).

We denote by 𝜖 > 0 a rational number, 𝜖 ≤ 1, which we shall fix in the course of the proof.

We provide the announced separation oracle. We first check that every box constraint, as well as the non-linear constraints \(f_{d}(\bar v)\leqslant \bar v_{d}+ \bar \mu \), for dD, are satisfied up to a precision 𝜖, which can be done in polynomial time in \(\langle \bar {v}\rangle +\langle \epsilon \rangle +|E|+\log W\) thanks to Lemma 27.

(i) If these constraints are satisfied up to a precision 𝜖, we have \(-\epsilon \leqslant \bar v_{d} \leqslant (n-1)\lceil \log (nW) \rceil +\epsilon \), \(-\epsilon \leqslant \bar \mu \leqslant \lceil \log (nW) \rceil +2 + \epsilon \), and \(f(\bar v) \leqslant (\bar \mu + \epsilon )e + \bar v\). Setting \(\tilde {v}_{d}= \min (\max (\bar v_{d},0),\lceil (n-1)\log (nW) \rceil )\), we get that \(\|\bar v-\tilde {v}\|_{\infty }\leqslant \epsilon \) with \(\tilde {v}\) satisfying the constraint (34b). Using the fact that f is nonexpansive in the sup-norm, we deduce that \(f(\tilde {v})\leqslant \tilde {\mu }e+ \tilde {v}\), where \(\tilde {\mu } =\bar \mu + 3\epsilon \). Moreover, \(0\leqslant \tilde \mu \leqslant \lceil \log (nW) \rceil +2 + 4\epsilon \). Now, since f is also convex, for all t ∈ [0,1], we have

$$ f(t \tilde{v})\leqslant t f(\tilde{v})+(1-t) f(0) \leqslant \tilde{\mu}^{\prime} e+ t \tilde{v}\enspace ,$$

with \(\tilde {\mu }^{\prime }=t\tilde {\mu }+(1-t) \lceil \log (nW) \rceil \). Hence, taking t = 1/(1 + 2𝜖), we get that \(\tilde {\mu }^{\prime }\) satisfies the constraint (34c), whereas \(\tilde {v}^{\prime }=t\tilde {v}\) still satisfies the constraint (34b), so that \((\tilde {v}^{\prime },\tilde {\mu }^{\prime })\) belongs to \(\mathscr{K}\). Using t ≤ 2𝜖, we also have \(\|(\bar v,\bar \mu )-(\tilde {v}^{\prime },\tilde {\mu }^{\prime })\|_{\infty } \leqslant \epsilon L\), with L = (5 + 2(n − 1)⌈log(nW)⌉), implying that \(d_{2}((\bar v,\bar \mu ),\mathscr{K}) \leqslant L\sqrt {n+1}\epsilon \). Hence, we shall require that

$$ \epsilon \leqslant \epsilon_{1}:= \frac{\nu}{(5+2 (n-1)\lceil \log (nW) \rceil) \sqrt{n+1}}\enspace , $$

to make sure that \(d_{2}((\bar v,\bar \mu ),\mathscr{K})\leqslant \nu \).

(ii) Assume now that one of the box constraints is violated by more than 𝜖. Then, one of the linear forms (v,μ)↦ ± vd or (v,μ)↦ ± μ provides a separating half-space, and the norm of this linear form is 1. Assume finally that all the box constraints are satisfied up to 𝜖, and that one of the non-linear constraints is violated by more than 𝜖. Let us write this constraint as

$$ g(v,\mu) := \log\left( \sum\limits_{d^{\prime}\in D} m_{pd^{\prime}} \exp(v_{d^{\prime}})\right) - v_{d} - \mu\leqslant 0 $$

for some dD and (σ(d),p) ∈ E, so that \(g(\bar v, \bar \mu )> \epsilon \). Since g is convex, the differential ϕ of g at point \((\bar v, \bar \mu )\) satisfies

$$ \phi(v-\bar v, \mu - \bar\mu) \leqslant g(v,\mu) - g(\bar v,\bar\mu)\leqslant - \epsilon $$

for all \((v,\mu ) \in \mathscr{K}\), i.e.,

$$ \phi(v,\mu) \leqslant \phi(\bar v,\bar\mu)-\epsilon, \forall (v,\mu)\in \mathscr{K} $$

showing that ϕ is a separating half-space. However, we need an approximate half-space given by a linear form with rational coefficients, which we next construct by approximating ϕ.

To do so, we first compute the differential of g at point \((\bar v,\bar \mu )\). This is the linear form

$$ \phi: (x,y)\in \mathbb{R}^{D}\times \mathbb{R} \mapsto \sum\limits_{d^{\prime}} x_{d^{\prime}} m_{pd^{\prime}} \exp(\bar v_{d^{\prime}})/\left( \sum\limits_{d^{\prime\prime}} m_{pd^{\prime\prime}}\exp(\bar v_{d^{\prime\prime}})\right)-x_{d}-y\enspace . $$

The maximum of \(\bar v\) can be subtracted to every coordinate of \(\bar v\) without changing this linear form. Then, by Lemma 26, the coefficients of this linear form can be approximated in polynomial time. It follows that we can compute an approximation \(\tilde {\phi }\) of ϕ of precision 𝜖 in polynomial time in \(\langle \bar v\rangle +\langle \epsilon \rangle \). Observe that the coefficient of the variable y in the linear form ϕ is always equal to − 1. Hence, the approximate linear form \(\tilde {\phi }\) can be chosen with the same coefficient, and then, \(\tilde {\phi }\) is of norm at least 1.

Since any element (v,μ) of \(\mathscr{K}\) satisfies the box constraints (34b) and (34c), whereas \((\bar v, \bar \mu )\) satisfies these constraints up to 𝜖 ≤ 1, we get that

$$ |(v,\mu)- (\bar v,\bar \mu)|_{\infty} \leqslant M:=(n-1)\lceil \log (nW) \rceil)+3\enspace ,$$

hence

$$ |\tilde{\phi}(v-\bar v,\mu-\bar \mu )- \phi(v-\bar v,\mu-\bar \mu)| \leqslant M (D+1)\epsilon, \forall (v,\mu)\in \mathscr{K} \enspace . $$

So it suffices that

$$ \epsilon \leqslant \epsilon_{2} := \frac{\nu}{M(D+1)} $$

to make sure that \(\tilde {\phi }\) defines an approximate separating half-space of precision ν.

To summarize, it suffices to take 𝜖 = min(𝜖1,𝜖2) in the previous analysis, so that the conditions of Definition 1 are satisfied. Moreover, for this choice, all the computations take a polynomial time in \(\langle \nu \rangle +\langle \bar v\rangle +\langle \bar \mu \rangle \), and the size |E| + log W of the description of the game. □

7.4 Using Separation Bounds Between Algebraic Numbers to Compare Policies

It follows from Theorem 24 that we can compute in polynomial time an approximate solution of Problem (35) with precision 𝜖. We next show that it is possible to choose 𝜖 with a polynomial number of bits, in such a way that this approximate solution allows us to identify an optimal policy. We shall actually prove a version of this result in the more general two-player case. This extended version, stated as Corollary 29, will apply both to the Despot-free case, and to the case of entropy games with a fixed number of significant states, see Section 7.6. To prove it, we rely on separation bounds for algebraic numbers.

Theorem 28

[41] Let p be a univariate polynomial of degree n with integer coefficients,possibly with multiple roots. Let S be the sum of the absolute values of itscoefficients. Then, the distance between any two distinct roots of p is atleast

$$ (2 n^{\frac n2+2}(S+1)^{n})^{-1} \enspace . $$

To any given pair (δ,τ) of policies, one can associate a directed sub-graph G(δ,τ) of G, obtained, by erasing for all dD, every arc in {(d,t) ∈ E} except {d,δ(d)}, and similarly, by erasing for all tT, every arc in {(t,p) ∈ E} except {t,τ(t)}. The dynamic programming operator of the game associated to this sub-graph G(δ,τ) coincides with the conjugate τFδτ := exp∘ τfδ ∘ log of (12) and is equal to the linear operator with matrix \({~}^{\tau }\!M^{\delta }\in \mathbb {N}^{D \times D}\):

$$ {~}^{\tau}F^{\delta}:\mathbb{R}^{D}\to\mathbb{R}^{D}, X\mapsto {~}^{\tau}\!M^{\delta} X \enspace, $$
(37)

where the entry (d,d) of τMδ is equal to \(m_{\tau \circ \delta (d),d^{\prime }}\) when (τδ(d),d) ∈ E and to zero otherwise. Then, the value of the entropy game starting in state d, \(R^{\infty }_{d}(\delta ,\tau )\) coincides with the maximum of the Perron-roots (that is the positive eigenvalues which coincide with the spectral radii) of the submatrices of τMδ with nodes in a strongly connected component to which d has access in the graph G(δ,τ). The value of the original entropy game coincides with the Perron-root of τMδ.

Corollary 29

There exists a rational function (n,W)↦ηsep(n,W) > 0 such that for every two different pairs (δ,τ) and (δ,τ) which yield different values of the entropy game, these two values differ by at leastηsep(n,W) and

$$ \eta_{\text{sep}}(n,W) \geqslant \exp(-\text{poly}(n+\log W)) $$

where the polynomial inside the exponential is independent of the input.

Proof

Given a pair (δ,τ) of policies, the values of the entropy game are eigenvalues of the matrix \(A={~}^{\tau }M^{\delta }\in \mathbb {N}^{D \times D}\). Observe that the entries of A are integers bounded by W . Let fA be the characteristic polynomial of A. The coefficient of the monomial of degree nk in fA is the sum of the \({C_{n}^{k}}\) principal minors of A of size k. By Hadamard’s inequality, each absolute value of these minors is at most \((\sqrt {n}W)^{k}\), and so, every coefficient of fA has an absolute value bounded by \({C_{n}^{k}}(\sqrt {n}W)^{k}\) and their sum is \(\leqslant (2\sqrt {n}W)^{n}\). Two different pairs of strategies yield two characteristic polynomials, fA and fB, whose product is of degree 2n and whose sum of absolute value of coefficients is bounded by the product of such bounds for fA and fB, so by \((2\sqrt {n}W)^{2n}\). Therefore, the size S appearing in Theorem 28 is bounded by \((2\sqrt {n}W)^{2n}\). We deduce that if the two pairs of strategies yield distinct values, the distance between these values is at least

$$ \eta_{\text{sep}}(n,W) :=(2 (2n)^{n+2}((2\lceil \sqrt{n} \rceil W)^{2n}+1)^{2n})^{-1} \enspace . $$

This number is rational and satisfies

$$ \eta_{\text{sep}}(n,W) \geqslant \exp(-\text{poly}(n+\log W)), $$

for some polynomial function poly. Since the above lower bound is true for every two pairs of different policies (δ,τ) and (δ,τ), we obtain the result. □

Hence, if two policies of Tribune yield different values λ and λ, then, |λλ| is bounded below by the rational number ηsep > 0 whose number of bits is polynomially bounded in the size of the input.

7.5 Synthesis of an Optimal Strategy of Tribune from an Approximate Solution of the Convex Program in Proposition 22

To any policy τ of Tribune, we associate a dynamic programming operator τF, which is the specialization of the map τFδ of previous section to the case where δ = σ. This is the self-map of \(\mathbb {R}^{D}\) defined by

$$ {~}^{\tau}F_{d}(X) = \sum\limits_{(\tau(\sigma(d)),d^{\prime}) \in E} m_{\tau(\sigma(d))d^{\prime}}X_{d^{\prime}} \enspace. $$

So τF(X) = τMX, where τM = τMσ is the |D|×|D| matrix with nonnegative entries equal to \(m_{\tau (\sigma (d))d^{\prime }}\) when (τ(σ(d)),d) ∈ E and zero otherwise.

7.5.1 The Simpler Situation in Which Every Policy τ of Tribune Yields an Irreducible Matrix

To explain our method, we make first the restrictive assumption that for every policy τ of Tribune, the matrix τM is irreducible. In particular, we can take an optimal policy τ. By a standard result of Perron-Frobenius theory [12], \({~}^{\tau ^{*}}\!M\) has a left eigenvector π with positive entries, associated to the spectral radius \(\lambda ^{\tau ^{*}}:=\rho ({~}^{\tau ^{*}}\!M)\), called Perron root. Hence, \(\pi {~}^{\tau ^{*}}\!M=\lambda ^{\tau ^{*}}\pi \). Since τ is optimal, \(\lambda ^{\tau ^{*}}=\lambda ^{*}\), where λ is the value of the entropy game starting from any node dD, see Lemma 17. Moreover, by applying Lemma 21 to the linear map \(U \mapsto ({~}^{\tau ^{*}}M)^T U\), where T denotes the transposition, we deduce that \(\pi _d/\pi _{d^{\prime }}\leqslant (nW)^{n-1}\).

For any rational number 𝜖 > 0, the ellipsoid algorithm, applied to the optimization problem of Proposition 22, yields in polynomial time a vector u and a scalar μ such that μ ≤ log λ + 𝜖 and \(d_2((u,\mu ),\mathscr{K})\leqslant \epsilon \). So there exists \((\tilde u,\tilde \mu )\in \mathscr{K}\) such that \(\|u-\tilde {u}\|_{\infty } \leqslant \epsilon \) and \(|\mu -\tilde \mu |\leqslant \epsilon \). Since \((\tilde u,\tilde \mu )\in \mathscr{K}\), and log λ is the value of (35), we deduce that \(\log \lambda ^{*} \leqslant \tilde \mu \leqslant \mu +\epsilon \), so λexp(−𝜖) ≤ exp(μ) ≤ λexp(𝜖). Using (7.2), and assuming that 𝜖 ≤ 1, we deduce that \(u_d-u_{d^{\prime }}\leqslant (n-1)\lceil \log (nW) \rceil +2\epsilon \leqslant (n-1)(\log (nW)+1)+2\), for all d,dD. Using the nonexpansivity of f, we also obtain that \(f(u)\leqslant f(\tilde {u})+\epsilon {e} \leqslant \tilde u +(\tilde \mu +\epsilon ){e} \leqslant (\mu +3\epsilon ){e} + u \leqslant (\log \lambda ^{*} +4\epsilon ){e} + u\). Taking U := (Ud)dD with Ud := exp(ud), we get F(U) ≤ λexp(4𝜖)U and \(U_d/U_{d^{\prime }}\leqslant (enW)^{n-1}e^2\).

We choose any policy \(\underline {\tau }\) such that \(F(U)={~}^{\underline {\tau }}\!MU\). Therefore,

$$ \underline{\tau}(\sigma(d)) \in \mathop{\text{argmax}}_{\tau \in \mathcal{P}_{T}} \sum\limits_{(\tau(\sigma(d)),d^{\prime}) \in E} m_{\tau(\sigma(d))d^{\prime}}U_{d^{\prime}} \enspace. $$

We claim that \(\underline {\tau }\) is optimal if 𝜖 is sufficiently small.

To show the latter claim, we observe that \({~}^{\tau ^{*}}\!M U \leqslant F(U)\). For all dD,

$$ \begin{array}{@{}rcl@{}} 0 &\leqslant& \pi_{d} (\lambda^{*}\exp(4\epsilon)U_{d} - F_{d}(U))\leqslant \pi_{d} (\lambda^{*}\exp(4\epsilon)U_{d} - ({~}^{\tau^{*}}\!MU)_{d})\\ &\leqslant& \sum\limits_{d^{\prime}\in D} \pi_{d^{\prime}} (\lambda^{*}\exp(4\epsilon) U_{d^{\prime}} - ({~}^{\tau^{*}}\!MU)_{d^{\prime}}) \\ &=& \pi (\lambda^{*}\exp(4\epsilon) U - {~}^{\tau^{*}}\!MU) \\ &=& \lambda^{*}(\exp(4\epsilon)-1) \pi U\enspace. \end{array} $$
(38)

Using \(\pi _d/\pi _{d^{\prime }}\leqslant (nW)^{n-1}\) and \(U_d/U_{d^{\prime }}\leqslant (enW)^{n-1}e^2\), we deduce that πUπdUd(1 + (n − 1)(enW)2(n− 1)e), so \(F(U) \geqslant \underline {\lambda } U\), where \(\underline {\lambda }:= \lambda ^{*}[1- (\exp (4\epsilon ) -1) (n-1) e (enW)^{2(n-1)}]\). In view of the formula of \(\underline {\lambda }\), we can choose 𝜖 > 0, with a polynomially bounded number of bits, such that \(\underline {\lambda }> \lambda ^{*} - . _{\text {sep}}\). Since, \({~}^{\underline {\tau }}MU\geqslant \underline {\lambda } U\), we have \(\rho ({~}^{\underline {\tau }}M)\geqslant \underline {\lambda }\) and so \(\rho ({~}^{\underline {\tau }}M)> \lambda ^{*} - \eta _{\text {sep}}\). Since λ is the maximum of the values of all the policies, \(\rho ({~}^{\underline {\tau }}M)\leqslant \lambda ^{*}\). By definition of the separation parameter ηsep given in Corollary 29, this implies that \(\rho ({~}^{\underline {\tau }}M)=\lambda ^{*}\), and so the policy \(\underline {\tau }\) of Tribune which we just constructed is optimal, showing the claim.

In the preceding argument, the computation (38) may look a bit magic at the first sight, it should become intuitive if one interprets it as an approximate complementary slackness condition for the semi-infinite program of Proposition 22, the invariant measure π playing the role of a Lagrange multiplier.

When some policies τ yield a reducible matrix τM, the synthesis of the optimal policy \(\underline {\tau }\) still exploits the same idea with an additional technicality, since we can only guarantee that the inequality \(F_d(U) \geqslant \underline {\lambda } U_d\) is valid for every state d such that πd > 0. We explain the more technical argument in the next section.

7.5.2 Synthesis of an Optimal Strategy of Tribune, in General

Recall that if M is a reducible nonnegative matrix, a class of M is a strongly connected component of the directed graph of M, and that this class is basic if the B × B submatrix of M, denoted by MBB, has Perron root ρ(M). It is known [12] that M has always a basic class, and a nonnegative left eigenvector associated with ρ(M). Moreover, choosing a basic class B which is final among the basic classes of M, that is such that the set S of nodes dD that are reachable in the directed graph of M starting from some node in the basic class B does not contain any node of another basic class, then there exists a nonnegative left eigenvector π so that its support {dπd≠ 0} coincides with S. We shall assume that π, S and B satisfy these properties, for \(M={~}^{\tau ^{*}}\!M\) corresponding to an optimal policy τ. We set N := DB, and for any D × D matrix M, any vector \(u\in \mathbb {R}^D\), and any subsets F and G of D, we denote by MFG the F × G submatrix of M and by vF the vector of \(\mathbb {R}^F\) given by vF := (vd)dF. Since B is a basic class and B has access to any element of S, we get that no element of SB has access to an element of B, and since π equals zero outside S, we get that the restriction of \(\pi {~}^{\tau ^{*}}\!M\) to B equals \(\pi _B {~}^{\tau ^{*}}\!M_{BB}\) and so \(\pi _B {~}^{\tau ^{*}}\!M_{BB}= \lambda ^{*} \pi _B \). The same computation as in Section 7.5.1 restricted to the elements dB now gives

$$ \begin{array}{@{}rcl@{}} 0 &\leqslant& \pi_{d} (\lambda^{*}\exp(4\epsilon)U_{d} - F_{d}(U)) \leqslant \pi_{d} (\lambda^{*}\exp(4\epsilon)U_{d} - ({~}^{\tau^{*}}\!MU)_{d})\\ & \leqslant& \pi_{B}(\lambda^{*}\exp(4\epsilon)U_{B} - ({~}^{\tau^{*}}\!M_{BB}U_{B}+{~}^{\tau^{*}}\!M_{BN}U_{N}))\\ &=& \lambda^{*}(\exp(4\epsilon)-1)\pi_{B} U_{B} - \pi_{B}{~}^{\tau^{*}}\!M_{BN}U_{N} \enspace . \end{array} $$

The bounds on U obtained in Section 7.5.1 are still valid, and the ones of \(\pi _d/\pi _{d^{\prime }}\) are valid only for d,dB using \(\pi _B {~}^{\tau ^{*}}\!M_{BB}=\lambda ^{*} \pi _B \) and the irreducibility of \({~}^{\tau ^{*}}\!M_{BB}\). We deduce that \(F_d(U) \geqslant \underline {\lambda } U_d\) for all dB, for the same \( \underline {\lambda } \) as in Section 7.5.1. Moreover, \(\pi _B {~}^{\tau ^{*}}\!M_{BN}U_N \leqslant \lambda ^{*}(\exp (4\epsilon )-1)\pi _BU_B\). We define 𝜖 := λ(exp(4𝜖) − 1)en(enW)2(n− 1), so that \({~}^{\tau ^{*}}\!M_{dN}U_N\leqslant \epsilon ^{\prime } U_d\) for all dB, where \({~}^{\tau ^{*}}\!M_{dN}\) is the d th line of the matrix \({~}^{\tau ^{*}}\!M_{BN}\).

We first choose, any policy \(\underline {\tau }\) and set B such that \({~}^{\underline {\tau }}\!M_{dD} U \geqslant \underline {\lambda } U_d\) and \({~}^{\underline {\tau }}\!M_{dN^{\prime }}U_{N^{\prime }} \leqslant \epsilon ^{\prime } U_d\), for all dB, with N = DB. We know from the above analysis that there is always at least one policy \(\underline {\tau }\) and set B with this property (namely τ and B = B). Moreover, such a policy and set can be obtained by the following algorithm. Indeed, let us start from any policy \(\underline {\tau }\) such that \(F(U)={~}^{\underline {\tau }}\!MU\), and choose B as the set of dD such that \({~}^{\underline {\tau }}\!M_{dD} U \geqslant \underline {\lambda } U_d\). Then, BB since \({~}^{\underline {\tau }}\!M_{dD} U=F_d(U)\). At each step of the algorithm, one applies the following operations to each dB in some order: set N = DB and check if \({~}^{\underline {\tau }}M_{dN^{\prime }}U_{N^{\prime }} \leqslant \epsilon ^{\prime }\). If this does not hold, change τ(σ(d)) to any action so that \({~}^{\underline {\tau }}M_{dD} U \geqslant \underline {\lambda } U_d\) and \({~}^{\underline {\tau }}M_{dN}U_N \leqslant \epsilon ^{\prime }\). If this is impossible, then eliminate d from B and continue. Then stop at any step in which B does not change. Since the cardinality of B decreases by one at each step to which one does not stop and BB, we get that the algorithm stops after at most n iterations and needs at most n2 products of a matrix by a vector, so it takes a polynomial time. Moreover at each step and so at the end of the algorithm, we have BB and NN.

We deduce that \({~}^{\underline {\tau }}\!M_{B^{\prime }B^{\prime }}U_{B^{\prime }} \geqslant (\underline {\lambda }-\epsilon ^{\prime })U_{B^{\prime }}\), showing that \(\rho ({~}^{\underline {\tau }}\!M)\geqslant \rho ({~}^{\underline {\tau }}\!M_{B^{\prime }B^{\prime }}) \geqslant (\underline {\lambda }-\epsilon ^{\prime })\). In view of the formula of \(\underline {\lambda }\) and 𝜖 we can always choose 𝜖 > 0, with a polynomially bounded number of bits, such that \(\underline {\lambda }-\epsilon ^{\prime }> \lambda ^{*}-\eta _{\text {sep}}\). Hence, \(\rho ({~}^{\underline {\tau }}\!M)=\rho ({~}^{\underline {\tau }}\!M_{B^{\prime }B^{\prime }})= \lambda ^{*}\), since \(\rho ({~}^{\underline {\tau }}\!M)\) is an eigenvalue of \({~}^{\underline {\tau }}\!M\). We also deduce from \({~}^{\underline {\tau }}\!M_{B^{\prime }B^{\prime }}U_{B^{\prime }} \geqslant (\underline {\lambda }-\epsilon ^{\prime })U_{B^{\prime }}\) that every state dB has value λ. Finally, since the game is irreducible, we can always replace \(\underline {\tau }(\sigma (d))\) for dB to make B accessible from any initial state, so that the policy \(\underline {\tau }\) is optimal. This concludes the proof of Theorem 15.

7.6 Derivation of Theorem 16 from Theorem 15

By Corollary 9,

$$ V^{\infty}_{d} = \min_{\delta \in \mathcal{P}_{D}} V^{\infty}_{d}(\delta,\ast) $$

Observe that \(|\mathcal {P}_D|\leqslant |E|^s\), where s is the number of significant states for despot, hence, if s is fixed, this minimum involves a polynomial number of terms. Thanks to the separation bound given in Corollary 29, it suffices to compute an approximation of each Vd(δ,∗) for some 𝜖 > 0 such that log 𝜖 is polynomially bounded in the size of the input, to make sure that a policy δ achieving the minimum in the above expression, in which every term Vd(δ,∗) is replaced by its approximate value, is an optimal policy.

8 Multiplicative Policy Iteration Algorithm and Comparison with the Spectral Simplex Method of Protasov

We now consider the question of solving entropy games in practice.

8.1 Algorithms

The equivalence between entropy games and some special class of stochastic mean payoff games, through logarithmic glasses (see Section 3), will allow us to adapt classical algorithms for one or two player zero sum games, such as the value iteration and the policy iteration algorithm. We next present a multiplicative version of the policy iteration algorithm, which follows by adapting policy iteration ideas for two player games by Hoffman and Karp [23], with “multiplicative” policy iteration techniques of Howard and Matheson [24], Rothblum [40] and Sladky [43], The latter “multiplicative” policy iteration techniques apply to the Despot-free case. For clarity, we shall explain first policy iteration in the special Despot-free case: this is more transparent, and this also will allow us to interpret Protasov’s spectral simplex method [38] as a variant of policy iteration. The newer part here is the two player case, which is dealt with in Algorithm 9.

We assume that D = T = {1,…,|T|} and σ is the identity in (32) . Let τF and τM, \(\tau \in \mathcal {P}_{T}\), be defined as in the previous section. If τM is irreducible, in particular if all its entries are positive, τM has an eigenvector Xτ > 0, associated to the Perron root λτ := ρ(τM). Moreover, Xτ is unique up to a multiplicative constant and is called a Perron eigenvector. If all the matrices τM, \(\tau \in \mathcal {P}_{T}\) are irreducible, one can construct a multiplicative version of the policy iteration, Algorithm 1.

figure b

The following result shows that Algorithm 1 does terminate. The proof relies on a multiplicative version of the classical strict monotonicity argument in policy iteration, which was already used by Howard and Matheson [24], Rothblum [40] and Sladky [43]. We reproduce the short proof for completeness.

Proposition 30

Consider Algorithm 1, where the computations areperformed in exact arithmetics and all the matricesτM,\(\tau \in \mathcal {P}_{T}\)aresupposed to be irreducible. Then, the sequence\(\lambda ^{\tau ^k}\)isincreasing as long asτkτk− 1.Moreover, the algorithm ends after a finite number of iterations k,and\(\lambda ^{\tau ^k}\)isthe value of the game (at any initial state).

Proof

The property that \(\lambda ^{\tau ^k}\) is increasing uses the general property that for an irreducible nonnegative matrix M, and for a positive vector u, \(Mu\geqslant \lambda u\) (component-wise) and Muλu implies ρ(M) > λ. Then, the algorithm terminates since the number of policies τ is a finite set, and so \(\{ \lambda ^{\tau } | \tau \in \mathcal {P}_{T} \}\) is finite. When τk = τk− 1, we get \(F(X^{\tau ^{k}})=\lambda ^{\tau ^k} X^{\tau ^{k}}\), and Lemma 17 shows that \(\lambda ^{\tau ^k}\) is the value of the game (at any initial state). □

Algorithm 1 has a dual version, in which maximization is replaced by minimization, in order to solve the Tribune-free setting of entropy games. For this dual version of Algorithm 1, the sequence \(\lambda ^{\tau ^k}\) is decreasing as long as τkτk− 1. This uses the property (dual to the previous one) that for an irreducible nonnegative matrix M, and for a positive vector u, Muλu and Muλu implies ρ(M) < λ. Then the algorithm terminates as for the primal version.

In practice, Algorithm 1 can only be implemented in an approximate way. A bottleneck in this algorithm is the computation of the Perron root and Perron eigenvector. The later can be computed by standard double precision algorithms, like the QR method. The latter method requires O(D3) flops, where D is the size of the matrix τM associated to a policy τ. (Note that such complexity estimates in an informal “floating point” arithmetic model are meaningful only for well conditioned instances, in contrast with the Turing-model complexity estimates that we derived unconditionnally in Section 7.) One may also use a more scalable algorithm, like the power algorithm. In fact, we shall see in Algorithm 5 that the power idea can be applied directly and in a simpler way to solve the non-linear equation, avoiding the recourse to policy iteration. So it is not clear that the power algorithm will be the best choice to compute the eigenpair in situations in which Algorithm 1 is competitive. In the experiments which follow, we used the QR method in Algorithm 1.

In [38], Protasov introduced the Spectral Simplex Algorithm. His algorithm is a variant of Algorithm 1 in which at every iteration the policy is improved only at one state, which is the first state t such that \(F_t(X^{\tau ^k})> \lambda ^{\tau ^k}X^{\tau ^k}_t\). We shall also consider another version of Algorithm 1, in which we also change the policy at only one state t, but we choose it in order to maximize the expression \(F_t(X^{\tau ^k})- \lambda ^{\tau ^k}X^{\tau ^k}_t\). We shall refer to this algorithm as ”Spectral Simplex-D” since this is analogous to Dantzig’s pivot rule in the original simplex method [49].

The Spectral Simplex Algorithm introduced by Protasov in [38] is described in the Despot-free setting in Algorithm 2.

figure c

The Spectral Simplex Algorithm with Dantzig update is shown in Algorithm 3, again in the Despot-free setting.

figure d

One can also adapt the Algorithms 2 and 3 to the Tribune-free setting, again using minimization instead of maximization and replacing τF by Fδ.

8.2 Numerical Experiments

We next report numerical experiments in the case of Despot-free in order to compare Protasov’s spectral simplex algorithm (with and without the improvement of Dantzig’ pivot rule) with the multiplicative policy iteration algorithm (Algorithm 1). In the log-log Figs. 1 and 2, these algorithms are respectively named ”Policy Iteration”, ”Spectral Simplex” and ”Spectral Simplex-D”.

Fig. 1
figure 1

Performance of Algorithms 1, 2, and 3 for different n = 10,...,500

Fig. 2
figure 2

Performance of Algorithms 1, 2, and 3 for different m = 10,...,500

We constructed random Despot-free instances in which D = T has cardinal n, and every coordinate of the operator is of the form \(F_t(X)= \max _{1\leqslant p \leqslant {m}} {\sum }_{t^{\prime }}A^p_{tt^{\prime }} X_{t^{\prime }}\), where \((A^p_{tt^{\prime }})\) is a 3-dimensional tensor whose entries are independent random variables drawn with the uniform law on {1,…,15}. Remember that m is an integer which represents the number of possible different actions per state t. All the results shown on the figures are the average made over 30 simulations, they concern the situation in which one of the two parameters m,n is kept constant while the other one increases. The time is given in seconds. The computations were performed on Matlab R2016a, using an Intel(R) Core(TM) i7-6500 CPU @ 2.59GHz processor with 12,0Go of RAM.

In both figures, Spectral Simplex-D appears to be more efficient than the Spectral Simplex algorithm with its original rule. However both algorithms are experimentally outperformed by policy iteration, by one to two order of magnitude, when n, whereas when m (for constant parameter n), the performance of the three algorithms seem to deteriorate at the same rate.

9 Two-player Entropy Games: Multiplicative Policy Iteration and Power Algorithm

Let us now consider the general two-player case. For \(\delta \in \mathcal {P}_D\) and \(\tau \in \mathcal {P}_T\), let Fδ (resp. τFδ) be the dynamic programming operator of the game in which the strategy of Despot, δ, is fixed (resp. the strategies of Despot and of Tribune, δ and τ are fixed), see (22) (resp. (37)). We assume here that the matrices τMδ of the linear operators τFδ, \(\delta \in \mathcal {P}_D, \tau \in \mathcal {P}_{T}\), see (37), are irreducible.

Then, the Hoffman-Karp’s idea [23] is readily adapted to the multiplicative setting: in the following algorithm, a sequence δk is constructed in a similar way as τk in the dual version of Algorithm 1, except that in Step 3, \(\lambda ^{\delta ^{k}}\) and \(X^{\delta ^{k}}\) are computed by applying Algorithm 1 to the dynamic programming operator \(F^{\delta ^{k}}\) in which the strategy of Despot is fixed to δk. We call this the multiplicative Hoffman-Karp algorithm. ta It can also be viewed as an “exact” version of the policy iteration algorithm of Hoffman and Karp [23] for f.

A similar proof as the one of Proposition 30 shows that Algorithm 4, implemented in exact arithmetics, terminates and is correct under the previous assumption that for any pair of policies of the two players, the associated transition matrix is irreducible. Indeed, as for the dual version of Algorithm 1, \(\lambda ^{\delta ^k}\) is decreasing as long as δkδk− 1. Then, since again the set of policies of Despot is finite, the algorithm ends.

figure e

To have an additional point of comparison, we used a power type algorithm, more precisely, the projective version of the Krasnoselski-Mann iteration, proposed in [21]. The original Krasnoselski-Mann iteration was developed in [29, 33], its analysis was extended and refined by Ishikawa [26] and Baillon and Bruck [8]. Recall that

$$F_{d}(X) = \min_{(d,t)\in E} \max_{(t,p)\in E} \sum\limits_{(p,d^{\prime}) \in E} m_{pd^{\prime}}X_{d^{\prime}}\enspace ,$$

and let us define

$$ H_{d}(X) = (X_{d}G_{d}(X))^{1/2}, \text{ where} G_{d}(X)=\frac{F_{d}(X)}{({\prod}_{d^{\prime}\in E}F_{d^{\prime}}(X))^{1/|D|}}.$$

Every fixed point of H is an eigenvector \(X \in R_{+}^{D}\) of F such that \({\prod }_{d^{\prime }\in E}X_d=1\), indeed

$$ \begin{array}{@{}rcl@{}} H(X)=X & \iff& (X_{d}G_{d}(X))^{1/2}=X_{d}, \forall d \in D \\ & \iff& G_{d}(X)=X_{d} , \forall d \in D \\ & \iff& \frac{F_{d}(X)}{({\prod}_{d^{\prime}}F_{d^{\prime}}(X))^{1/|D|}} = X_{d}, \forall d \in D \\ &\iff& F_{d}(X) = \lambda X_{d},\forall d \in D, \text{and} \prod\limits_{d^{\prime}\in E}X_{d}=1 \enspace, \end{array} $$

for some λ > 0 (since then \(\lambda = ({\prod }_{d^{\prime }}F_{d^{\prime }}(X))^{1/|D|}\)). It can be shown as a corollary of a general result of Ishikawa [26] concerning nonexpansive mappings in Banach spaces that Algorithm 5 does converge if F has an eigenvector in the interior of the cone. We refer the reader to [21] for more details on the analysis of the projective Krasnoselski-Mann iteration. We use Hilbert’s projective metric dH(X,Y ) = ∥log(X) − log(Y )∥H (with ∥XH = maxdXd − mindXd) to test the approximate termination.

figure f

The log-log Figs. 3 and 4 show the performances of the power algorithm and the two-player policy iteration algorithm. The computation were performed on the same computer as in the previous section; the time is still given in seconds. The policy iteration algorithm outperforms the power algorithm asymptotically for large number of actions m, whereas for large number of states n, the power algorithm is more efficient. The experimentally observed efficiency of the –naive– power algorithm actually reveals that the random instances we considered are relatively “easy”.

Fig. 3
figure 3

Performance of Algorithms 4 and 5 for n = 10,...,2000

Fig. 4
figure 4

Performance of Algorithms 4 and 5 for m = 10,...,200

10 Concluding Remarks

We developed an operator approach for entropy games, relating them with risk sensitive control via non-linear Perron-Frobenius theory. This leads to a theoretical result (polynomial time solvability of the Despot-free case), and this allows one to adapt policy iteration to these games. Several issues concerning policy iteration in the spectral setting remains unsolved. A first issue is to understand what kind of approximate eigenvalue algorithms are best suited. A second issue is to identify significant classes of entropy games on which the Hoffman-Karp type policy iteration algorithm can be shown to run in polynomial time (compare with [25, 49] in the case of Markov decision processes). In view of the asymmetry between Despot and Tribune, one may expect that Tribune-free entropy games are at least as hard as deterministic mean payoff games, it would be interesting to confirm that this is the case.