Keywords

Introduction

In a Nash equilibrium, each player’s strategy is optimal with respect to the strategies of other players. Accordingly, Nash equilibrium offers a predictive model of the outcome of a game. That is, given the basic elements of a game – (i) a set of players; (ii) for each player, a set of strategies; and (iii) for each player, a utility function that captures preferences over strategies – one can model/assert that the strategies selected by the players constitute a Nash equilibrium.

In making this assertion, there is no suggestion of how players may come to reach a Nash equilibrium. Two motivating quotations in this regard are:

The attainment of equilibrium requires a disequilibrium process (Arrow 1986).

and

The explanatory significance of the equilibrium concept depends on the underlying dynamics (Skyrms 1992).

These quotations reflect that a foundation for Nash equilibrium as a predictive model is dynamics that lead to equilibrium. Motivated by these considerations, the topic of “learning in games” shifts the attention away from equilibrium and towards underlying dynamic processes and their long-run behavior. The intent is to understand how players may reach an equilibrium as well as understand possible barriers to reaching Nash equilibrium.

In the setup of learning in games, players repetitively play a game over a sequence of stages. At each stage, players use past experiences/observations to select a strategy for the current stage. Once player strategies are selected, the game is played, information is updated, and the process is repeated. The question is then to understand the long-run behavior, e.g., whether or not player strategies converge to Nash equilibrium.

Traditionally the dynamic processes considered under learning in games have players selecting strategies based on a myopic desire to optimize for the current stage. That is, players do not consider long-run effects in updating their strategies. Accordingly, while players are engaged in repetitive play, the dynamic processes generally are not optimal in the long run (as in the setting of “repeated games”). Indeed, the survey article of Hart (2005) refers to the dynamic processes of learning in games as “adaptive heuristics.” This distinction is important in that an implicit concern in learning in games is to understand how “low rationality” (i.e., suboptimal and heuristic) processes can lead to the “high rationality” (i.e., mutually optimal) notion of Nash equilibrium.

This entry presents a sampling of results from the learning in games literature through a selection of illustrative dynamic processes, a review of their long-run behaviors relevant to Nash equilibrium, and pointers to further work.

Illustration: Commuting Game

We begin with a description of learning in games in the specific setting of the commuting game, which is a special case of so-called congestion games (cf., Roughgarden 2005). The setup is as follows. Each player seeks to plan a path from an origin to a destination. The origins and destinations can differ from player to player. Players seek to minimize their own travel times. These travel times depend both on the chosen path (distance traveled) and the paths of other players (road congestion). Every day, a player uses past information and observations to select that day’s path according to some selection rule, and this process is repeated day after day.

In game-theoretic terms, player “strategies” are paths linking their origins to destinations, and player “utility functions” reflect travel times. At a Nash equilibrium, players have selected paths such that no individual player can find a shorter travel time given the chosen paths of others. The learning in games question is then whether player paths indeed converge to Nash equilibrium in the long run. Not surprisingly, the answer depends on the specific process that players use to select paths and possible additional structure of the commuting game.

Suppose that one of the players, say “Alice,” is choosing among a collection of paths. For the sake of illustration, let us give Alice the following capabilities: (i) Alice can observe the paths chosen by all other players and (ii) Alice can compute off-line her travel time as a function of her path and the paths of others.

With these capabilities, Alice can compute running averages of the travel times along all available paths. Note that the assumed capabilities allow Alice to compute the travel time of a path and hence its running average, whether or not she took the path on that day. With average travel time values in hand, two possible learning rules are:

  • Exploitation: Choose the path with the lowest average travel time.

  • Exploitation with Exploration: With high probability, choose the path with the lowest average travel time, and with low probability, choose a path at random.

Assuming that all players implement the same learning rule, each case induces a dynamic process that governs the daily selection of paths and determines the resulting long-run behavior. We will revisit these processes in a more formal setting in the next section.

A noteworthy feature of these learning rules is that they do not explicitly depend on the utility functions of other players. For example, suppose one of the other players is willing to trade off travel time for more scenic routes. Similarly, suppose one of the other players prefers to travel on high congestion paths, e.g., a rolling billboard seeking to maximize exposure. The aforementioned learning rules for Alice remain unchanged. Of course, Alice’s actions implicitly depend on the utility functions of other players, but only indirectly through their selected paths. This characteristic of no explicit dependence on the utility functions of others is known as “uncoupled” learning, and it can have major implications on the achievable long-run behavior (Hart and Mas-Colell 2003a).

In assuming the ability to observe the paths of other players and to compute off-line travel times as a function of these paths, these learning rules impose severe requirements on the information available to each player. Less restrictive are learning rules that are “payoff based” (Young 2005). A simple modification that leads to payoff-based learning is as follows. Alice maintains an empirical average of the travel times of a path using only the days that she took that path. Note the distinction – on any given day, Alice remains unaware of travel times for the routes not selected. Using these empirical average travel times, Alice can then mimic any of the aforementioned learning rules. As intended, she does not directly observe the paths of others, nor does she have a closed-form expression for travel times as a function of player paths. Rather, she only can select a path and measure the consequences. As before, all players implementing such a learning rule induce a dynamic process, but the ensuing analysis in payoff-based learning can be more subtle.

Learning Dynamics

We now give a more formal presentation of selected learning rules and results concerning their long-run behavior.

Preliminaries

We begin with the basic setup of games with a finite set of players, \(\left \{1,2,\ldots,N\right \}\), and for each player i, a finite set of strategies, \(\mathcal{A}_{i}\). Let

$$\displaystyle{\mathcal{A} = \mathcal{A}_{1} \times \ldots \times \mathcal{A}_{N}}$$

denote the set of strategy profiles. Each player, i, is endowed with a utility function

$$\displaystyle{u_{i} : \mathcal{A}\rightarrow \mathbb{R}.}$$

Utility functions capture player preferences over strategy profiles. Accordingly, for any \(a,a\prime \in \mathcal{A}\), the condition

$$\displaystyle{u_{i}(a)> u_{i}(a\prime)}$$

indicates that player i prefers the strategy profile a over a′.

The notation − i indicates the set of players other than player i. Accordingly, we sometimes write \(a \in \mathcal{A}\) as \((a_{i},a_{-i})\) to isolate a i , the strategy of player i, versus ai, the strategies of other players. The notation − i is used in other settings as well.

Utility functions induce best-response sets. For \(a_{-i} \in \mathcal{A}_{-i}\), define

$$\displaystyle\begin{array}{lll} \mathcal{B}_{i}(a_{-i}) =& \left \{a_{i} : u_{i}(a_{i},a_{-i}) \geq u_{i}(a_{i}\prime,a_{-i})\right. {}\\ & \left.\text{ for all }a_{i}\prime \in \mathcal{A}_{i}\right \}. {}\\ \end{array}$$

In words, \(\mathcal{B}_{i}(a_{-i})\) denotes the set of strategies that are optimal for player i in response to the strategies of other players, ai.

A strategy profile \(a^{{\ast}}\in \mathcal{A}\) is a Nash equilibrium if for any player i and any \(a_{i}\prime \in \mathcal{A}_{i}\),

$$\displaystyle{u_{i}(a_{i}^{{\ast}},a_{ -i}^{{\ast}}) \geq u_{ i}(a\prime_{i},a_{-i}^{{\ast}}).}$$

In words, at a Nash equilibrium, no player can achieve greater utility by unilaterally changing strategies. Stated in terms of best-response sets, a strategy profile, a, is a Nash equilibrium if for every player i,

$$\displaystyle{a_{i}^{{\ast}}\in \mathcal{B}_{ i}(a_{-i}^{{\ast}}).}$$

We also will need the notions of mixed strategies and mixed strategy Nash equilibrium. Let \(\Delta (\mathcal{A}_{i})\) denote probability distributions (i.e., nonnegative vectors that sum to one) over the set \(\mathcal{A}_{i}\). A mixed strategy profile is a collection of probability distributions, \(\alpha = (\alpha _{1},\ldots,\alpha _{N})\), with \(\alpha _{i} \in \Delta (\mathcal{A}_{i})\) for each i. Let us assume that players choose a strategy randomly and independently according to these mixed strategies. Accordingly, define \(\mathbf{Pr}[a;\alpha ]\) to be the probability of strategy a under the mixed strategy profile α, and define the expected utility of player i as

$$\displaystyle{U_{i}(\alpha ) =\sum _{a\in \mathcal{A}}u_{i}(a) \cdot \mathbf{Pr}[a;\alpha ].}$$

A mixed strategy Nash equilibrium is a mixed strategy profile, α, such that for any player i and any \(\alpha _{i}\prime \in \Delta (\mathcal{A})\),

$$\displaystyle{U_{i}(\alpha _{i}^{{\ast}},\alpha _{ -i}^{{\ast}}) \geq U_{ i}(\alpha _{i}\prime,\alpha _{-i}^{{\ast}}).}$$

Special Classes of Games

We will reference three special classes of games: (i) zero-sum games, (ii) potential games, and (iii) weakly acyclic games.

Zero-sum games: There are only two players (i.e., N = 2), and \(u_{1}(a) = -u_{2}(a)\).

Potential games: There exists a (potential) function,

$$\displaystyle{\phi : \mathcal{A}\rightarrow \mathbb{R}}$$

such that for any pair of strategies, \(a = (a_{i},a_{-i})\) and \(a\prime = (a_{i}\prime,a_{-i})\), that differ only in the strategy of player i,

$$\displaystyle{u_{i}(a_{i},a_{-i}) - u_{i}(a_{i}\prime,a_{-i}) =\phi (a_{i},a_{-i}) -\phi (a_{i}\prime,a_{-i}).}$$

Weakly acyclic games: There exists a function

$$\displaystyle{\phi : \mathcal{A}\rightarrow \mathbb{R}}$$

with the following property: if \(a \in \mathcal{A}\) is not a Nash equilibrium, then at least one player, say player i, has an alternative strategy, say \(a_{i}\prime \in \mathcal{A}_{i}\), such that

$$\displaystyle{u_{i}(a_{i}\prime,a_{-i})> u_{i}(a_{i},a_{-i})}$$

and

$$\displaystyle{\phi (a_{i}\prime,a_{-i})>\phi (a_{i},a_{-i}).}$$

Potential games are a special class of games for which various learning dynamics converge to a Nash equilibrium. The aforementioned commuting game constitutes a potential game under certain special assumptions. These are as follows: (i) the delay on a road only depends on the number of users (and not their identities) and (ii) all players measure delay in the same manner (Monderer and Shapley 1996).

Weakly acyclic games are a generalization of potential games. In potential games, there exists a potential function that captures differences in utility under unilateral (i.e., single player) changes in strategy. In weakly acyclic games (see Young 1998), if a strategy profile is not a Nash equilibrium, then there exists a player who can simultaneously achieve an increase in utility while increasing the potential function. The characterization of weakly acyclic games through a potential function herein is not traditional and is borrowed from Marden et al. (2009a).

Forecasted Best-Response Dynamics

One family of learning dynamics involves players formulating a forecast of the strategies of other players based on past observations and then playing a best response to this forecast.

Cournot Best-Response Dynamics

The simplest illustration is Cournot best-response dynamics. Players repetitively play the same game over stages \(t = 0,1,2,\ldots\). At stage t, a player forecasts that the strategies of other players are the strategies played at the previous stage t − 1. The following rules specify Cournot best response with inertia. For each stage t and for each player i:

  • With probability p ∈ (0, 1), \(a_{i}(t) = a_{i}(t - 1)\) (inertia).

  • With probability 1 − p, \(a_{i}(t) \in \mathcal{B}_{i}(a_{-i}(t - 1))\) (best response).

  • If \(a_{i}(t - 1) \in \mathcal{B}_{i}(a_{-i}(t - 1))\), then \(a_{i}(t) = a_{i}(t - 1)\) (continuation).

Proposition 1

For weakly acyclic (and hence potential) games, player strategies under Cournot best-response dynamics with inertia converge to a Nash equilibrium.

Cournot best-response dynamics need not always converge in games with a Nash equilibrium, hence the restriction to weakly acyclic games.

Fictitious Play

In fictitious play, introduced in Brown (1951), players also use past observations to construct a forecast of the strategies of other players. Unlike Cournot best-response dynamics, this forecast is probabilistic.

As a simple example, consider the commuting game with two players, Alice and Bob, who both must choose between two paths, A and B. Now suppose that on stage t = 10, Alice has observed Bob used path A for 6 out of the previous 10 days and path B for the remaining days. Then Alice’s forecast of Bob is that he will chose path A with 60 % probability and path B with 40 % probability. Alice then chooses between path A and B in order to optimize her expected utility. Likewise, Bob uses Alice’s empirical averages to form a probabilistic forecast of her next choice and selects a path to optimize his expected utility.

More generally, let \(\pi _{j}(t) \in \Delta (\mathcal{A}_{j})\) denote the empirical frequency for player j at stage t. This vector is a probability distribution that indicates the relative frequency of times player j played each strategy in \(\mathcal{A}_{j}\) over stages \(0,1,\ldots,t - 1\). In fictitious play, player i assumes (incorrectly) that at stage t, other players will select their strategies independently and randomly according to their empirical frequency vectors. Let \(\Pi _{-i}(t)\) denote the induced probability distribution over \(\mathcal{A}_{-i}\) at stage t. Under fictitious play, player i selects an action according to

$$\displaystyle\begin{array}{lll} a_{i}(t) & \in \arg \max _{a_{i}\in \mathcal{A}_{i}}\sum _{a_{-i}\in \mathcal{A}_{-i}}u_{i}(a_{i},a_{-i}) {}\\ & \quad \cdot \mathbf{Pr}[a_{-i};\Pi _{-i}(t)]. {}\\ \end{array}$$

In words, player i selects the action that maximizes expected utility assuming that other players select their strategies randomly and independently according to their empirical frequencies.

Proposition 2

For (i) zero-sum games, (ii) potential games, and (iii) two-player games in which one player has only two actions, player empirical frequencies under fictitious play converge to a mixed strategy Nash equilibrium.

These results are reported in Fudenberg and Levine (1998), Hofbauer and Sandholm (2002), and Berger (2005). Fictitious play need not converge to Nash equilibria in all games. An early counterexample is reported in Shapley (1964), which constructs a two-player game with a unique mixed strategy Nash equilibrium. A weakly acyclic game with multiple pure (i.e., non-mixed) Nash equilibria under which fictitious play does not converge is reported in Foster and Young (1998).

A variant of fictitious play is “joint strategy” fictitious play (Marden et al. 2009b). In this framework, players construct as forecasts empirical frequencies of the joint play of other players. This formulation is in contrast to constructing and combining empirical frequencies for each player. In the commuting game, it turns out that joint strategy fictitious play is equivalent to the aforementioned “exploitation” rule of selecting the path with lowest average travel time. Marden et al. (2009b) show that action profiles under joint strategy fictitious play (with inertia) converge to a Nash equilibrium in potential games.

Log-Linear Learning

Under forecasted best-response dynamics, players chose a best response to the forecasted strategies of other players. Log-linear learning, introduced in Blume (1993), allows the possibility of “exploration,” in which players can select nonoptimal strategies but with relatively low probabilities.

Log-linear learning proceeds as follows. First, introduce a “temperature” parameter, T > 0.

  • At stage t, a single player, say player i, is selected at random.

  • For player i,

    $$\displaystyle{\mathbf{Pr}[a_{i}(t) = a_{i}\prime] ={ 1 \over Z}e^{u_{i}(a_{i}\prime,a_{-i}(t-1))/T}.}$$
  • For all other players, ji,

    $$\displaystyle{a_{j}(t) = a_{j}(t - 1).}$$

In words, under log-linear learning, only a single player performs a strategy update at each stage. The probability of selecting a strategy is exponentially proportional to the utility garnered from that strategy (with other players repeating their previous strategies). In the above description, the dummy parameter Z is a normalizing variable used to define a probability distribution. In fact, the specific probability distribution for strategy selection is a Gibbs distribution with temperature parameter, T. For very large T, strategies are chosen approximately uniformly at random. However, for small T, the selected strategy is a best response (i.e., \(a_{i}(t) \in \mathcal{B}_{i}(a_{-i}(t - 1))\)) with high probability, and an alternative strategy is selected with low probability.

Because of the inherent randomness, strategy profiles under log-linear learning never converge. Nonetheless, the long-run behavior can be characterized probabilistically as follows.

Proposition 3

For potential games with potential function ϕ(⋅) under log-linear learning, for any \(a \in \mathcal{A}\) ,

$$\displaystyle{\lim _{t\rightarrow \infty }\mathbf{Pr}[a(t) = a] ={ 1 \over Z}e^{\phi (a)/T}.}$$

In words, the long-run probabilities of strategy profiles conform to a Gibbs distribution constructed from the underlying potential function. This characterization has the important implication of (probabilistic) equilibrium selection. Prior convergence results stated convergence to Nash equilibria, but did not specify which Nash equilibrium in the case of multiple equilibria. Under log-linear learning, there is a probabilistic preference for the Nash equilibrium that maximizes the underlying potential function.

Extensions and Variations

Payoff-based learning. The discussion herein presumed that players can observe the actions of other players and can compute utility functions off-line. Payoff-based algorithms, i.e., algorithms in which players only measure the utility garnered in each stage, impose less restrictive informational requirements. See Young (2005) for a general discussion, as well as Marden et al. (2009c), Marden and Shamma (2012), and Arslan and Shamma (2004) for various payoff-based extensions.

No-regret learning. The broad class of so-called “no-regret” learning rules has the desirable property of converging to broader solution concepts (namely, Hannan consistency sets and correlated equilibria) in general games. See Hart and Mas-Colell (200020012003b) for an extensive discussion.

Calibrated forecasts. Calibrated forecasts are more sophisticated than empirical frequencies in that they satisfy certain long-run consistency properties. Accordingly, forecasted best-response learning using calibrated forecasts has stronger guaranteed convergence properties, such as convergence to correlated equilibria. See Foster and Vohra (1997), Kakade and Foster (2008), and Mannor et al. (2007).

Impossibility results. This entry focused on convergence results in various special cases. There are broad impossibility results that imply the impossibility of families of learning rules to converge to Nash equilibria in all games. The focus is on uncoupled learning, i.e., the learning dynamics for player i does not depend explicitly on the utility functions of other players (which is satisfied by all of the learning dynamics presented herein). See Hart and Mas-Colell (2003a2006), Hart and Mansour (2007), and Shamma and Arslan (2005). Another type of impossibility result concerns lower bounds on the required rate of convergence to equilibrium (e.g., Hart and Mansour 2010).

Welfare maximization. Of special interest is learning dynamics that select welfare (i.e., sum of utilities) maximizing strategy profiles, whether or not they are Nash equilibria. Recent contributions include Pradelski and Young (2012), Marden et al. (2011), and Arieli and Babichenko (2012).

Summary and Future Directions

We have presented a selection of learning dynamics and their long-run characteristics, specifically in terms of convergence to Nash equilibria. As stated early on, the original motivation of learning in games research has been to add credence to solution concepts such as Nash equilibrium as a model of the outcome of a game. An emerging line of research stems from engineering considerations, in which the objective is to use the framework of learning in games as a design tool for distributed decision architecture settings such as autonomous vehicle teams, communication networks, or smart grid energy systems. A related emerging direction is social influence, in which the objective is to steer the collective behaviors of human decision makers towards a socially desirable situation through the dispersement of incentives. Accordingly, learning in games can offer baseline models on how individuals update their behaviors to guide and inform social influence policies.

Cross-References

Recommended Reading

Monographs on learning in games:

  • Fudenberg D, Levine DK (1998) The theory of learning in games. MIT, Cambridge

  • Hart S, Mas Colell A (2013) Simple adaptive strategies: from regret-matching to uncoupled dynamics. World Scientific Publishing Company

  • Young HP (1998) Individual strategy and social structure. Princeton University Press, Princeton

  • Young HP (2005) Strategic learning and its limits. Princeton University Press, Princeton

Overview articles on learning in games:

  • Fudenberg D, Levine DK (2009) Learning and equilibrium. Annu Rev Econ 1:385–420

  • Hart S (2005) Adaptive heuristics. Econometrica 73(5):1401–1430

  • Young HP (2007) The possible and the impossible in multi-agent learning. Artif Intell 171:429–433

Articles relating learning in games to distributed control:

  • Li N, Marden JR (2013) Designing games for distributed optimization. IEEE J Sel Top Signal Process 7:230–242

  • Mannor S, Shamma JS (2007) Multi-agent learning for engineers, Artif Intell 171:417–422

  • Marden JR, Arslan G, Shamma JS (2009) Cooperative control and potential games. IEEE Trans Syst Man Cybern B Cybern 6:1393–1407