Multi-agent learning Fictitious Play - Utrecht University · Multi-agent learning Fictitious Play...

Multi-agent learning Fictitious Play

Multi-agent learningFi titious Play

Gerard Vreeswijk, Intelligent Systems Group, Computer Science Department,

Faculty of Sciences, Utrecht University, The Netherlands.

Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 1


Fictitious play: motivation

• Rather than considering your

own payoffs, monitor the

behaviour of your opponent(s),

and respond optimally.

• Behaviour of an opponent is

projected on a single mixedstrategy.

• Brown (1951): explanation for

Nash equilibrium play.

In terms of current use, the name

is a bit of a misnomer, since play

actually occurs (Berger, 2005).

• One of the most important, if not

the most important,

representative of a followerstrategy.



Plan for today

Part I. Best reply strategy 1. Pure fictitious play.

2. Results that connect pure fictitious play to Nash equilibria.

Part II. Extensions and approximations of fictitious play

1. Smoothed fictitious play.

2. Exponential regret matching.

3. No-regret property of smoothed fictitious play (Fudenberg et al., 1995).

4. Convergence of better reply strategies when players have limited

memory and are inert [tend to stick to their current strategy] (Young,

1998).

Shoham et al. (2009): Multi-agent Systems. Ch. 7: “Learning and Teaching”.

H. Young (2004): Strategic Learning and it Limits, Oxford UP.

D. Fudenberg and D.K. Levine (1998), The Theory of Learning in Games, MIT Press.



Part I:

Pure fictitious play



Repeated Coordination Game

Players receive payoff p > 0 iff they coordinate. This game possesses three

Nash equilibria, viz. (0, 0), (0.5, 0.5), and (1, 1).

Round A’s action B’s action A’s beliefs B’s beliefs

0. (0.0, 0.0) (0.0, 0.0)

1. L* R* (0.0, 1.0) (1.0, 0.0)

2. R L (1.0, 1.0) (1.0, 1.0)

3. L* R* (1.0, 2.0) (2.0, 1.0)

4. R L (2.0, 2.0) (2.0, 2.0)

5. R* R* (2.0, 3.0) (2.0, 3.0)

6. R R (2.0, 4.0) (2.0, 4.0)

7. R R (2.0, 5.0) (2.0, 5.0)...

......

......



Steady states are pure (but possibly weak) Nash equilibria

Definition (Steady state). An action profile a is a steady state (or absorbingstate) of fictitious play if it is the case that whenever a is played at round t

then, inevitably, it is also played at round t + 1.

Theorem. If a pure strategy profile is a steady state of fictitious play, then it is a

(possibly weak) Nash equilibrium in the stage game.

Proof . Suppose a = (a1, . . . , an) is a steady state. Consequently, i’s opponent

model converges to a−i, for all i. By definition of fictitious play, i plays best

responses to a−i, i.e.,

∀i : ai ∈ BR(a−i).

The latter is precisely the definition of a Nash equilibrium. �

Still, the resulting Nash equilibrium is often strict, because for weak equilibria

the process is likely to drift due to alternative best responses.



Pure strict Nash equilibria are steady states

Theorem. If a pure strategy profile is a strict Nash equilibrium of a stage game, then

it is a steady state of fictitious play in the repeated game.

Notice the use of terminology: “pure strategy profile” for Nash equilibria;

“action profile” for steady states.

Proof . Suppose a is a pure Nash equilibrium and ai is played at round t, for

all i. Because a is strict, ai is the unique best response to a−i. Because this

argument holds for each i, action profile a will be played in round t + 1 again.

�

Summary of the two theorems:

Pure strict Nash ⇒ Steady state ⇒ Pure Nash.

But what if pure Nash equilibria do not exist?



Repeated game of Matching Pennies

Zero sum game. A’s goal is to have pennies matched. B maintains opposite.


0. (1.5, 2.0) (2.0, 1.5)

1. T T (1.5, 3.0) (2.0, 2.5)

2. T H (2.5, 3.0) (2.0, 3.5)

3. T H (3.5, 3.0) (2.0, 4.5)

4. H H (4.5, 3.0) (3.0, 4.5)

5. H H (5.5, 3.0) (4.0, 4.5)

6. H H (6.5, 3.0) (5.0, 4.5)

7. H T (6.5, 4.0) (6.0, 4.5)

8. H T (6.5, 5.0) (7.0, 4.5)...

......

......



Convergent empirical distribution of strategies

Theorem. If the empirical distribution of each player’s strategies converges in

fictitious play, then it converges to a Nash equilibrium.

Proof . Same as before. If the empirical distributions converge to q, then i’s

opponent model converges to q−i, for all i. By definition of fictitious play,

qi ∈ BR(q−i). Because of convergence, all such (mixed) best replies remain the

same. By definition we have a Nash equilibrium. �

Remarks:

1. The qi may be mixed.

2. It actually suffices that the q−i

converge asymptotically to the

actual distribution (Fudenberg &

Levine, 1998).

3. If empirical distributions

converge (hence, converge to a

Nash equilibrium), the actually

played responses per stage need

not be Nash equilibria of the

stage game.



Empirical distributions converge to Nash 6⇒ stage Nash

Repeated Coordination Game. Players receive payoff p > 0 iff they

coordinate.


0. (0.5, 1.0) (1.0, 0.5)

1. B A (1.5, 1.0) (1.0, 1.5)

2. A B (1.5, 2.0) (2.0, 1.5)

3. B A (2.5, 2.0) (2.0, 2.5)

4. A B (2.5, 3.0) (3.0, 2.5)...

......

......

• This game possesses three equilibria, viz. (0, 0), (0.5, 0.5), and (1, 1), with

expected payoffs 1, 0.5, and 1, respectively.

• Empirical distribution of play converges to (0.5, 0.5),—with payoff 0,

rather than p/2.



Empirical distribution of play does not need to converge

Rock-paper-scissors. Winner receives payoff p > 0. Else, payoff zero.

• Rock-paper-scissors with these payoffs is known as the Shapley game.

• The Shapley game possesses one equilibrium, viz. (1/3, 1/3, 1/3), with

expected payoff p/3.


0. (0.0, 0.0, 0.5) (0.0, 0.5, 0.0)

1. Rock Scissors (0.0, 0.0, 1.5) (1.0, 0.5, 0.0)

2. Rock Paper (0.0, 1.0, 1.5) (2.0, 0.5, 0.0)

3. Rock Paper (0.0, 2.0, 1.5) (3.0, 0.5, 0.0)

4. Scissors Paper (0.0, 3.0, 1.5) (3.0, 0.5, 1.0)

5. Scissors Paper (0.0, 4.0, 1.5) (3.0, 0.5, 2.0)...

......

......



Repeated Shapley Game: Phase Diagram

•

Rock Paper

Scissors

NN

N

N

N



Part II:

Extensions and approximations of

fictitious play



Proposed extensions to fictitious play

Build forecasts, not on complete history, but on

• Recent data, say on m most recent rounds.

• Discounted data, say with discount factor γ.

• Perturbed data, say with error ǫ on individual observations.

• Random samples of historical data, say on random samples of size m.

Give not necessarily best responses, but

• ǫ-greedy.

• Perturbed throughout, with small random shocks.

• Randomly, and proportional to expected payoff.



Framework for predictive learning (like fictitious play)

A fore asting rule for player i is a function that maps a history to a probability

distribution over the opponents’ actions in the next round:

fi : H → ∆(X−i).

A response rule for player i is a function that maps a history to a probability

distribution over i’s own actions in the next round:

gi : H → ∆(Xi).

A predi tive learning rule for player i is the combination of a forecasting rule

and a response rule. This is typically written as ( fi, gi).

• This framework can be attributed to J.S. Jordan (1993).

• Forecasting and response functions are deterministic.

• Reinforcement and regret do not fit. They are not involved with

prediction.



Forecasting and response rules for fictitious play

Let ht ∈ Ht be a history of play up to and including round t and

φjt =Def the empirical distribution of j’s actions up to and including round t.

Then the � titious play fore asting rule is given by

fi(ht) =Def ∏j 6=i

φjt.

Let fi be a fictitious play forecasting rule. Then gi is said to be a � titious playresponse rule if all values are best responses to values of fi.

Remarks:

1. Player i attributes a mixed strategy φjt to player j. This strategy reflects

the number of times each action is played by j.

2. The mixed strategies are assumed to be independent.

3. Both (1) and (2) are simplifying assumptions.



Smoothed fictitious play

Notation:

p−i : strategy profile of opponents as predicted by fi in round t.

ui(xi, p−i) : expected utility of action xi, given p−i.

qi : strategy profile of player i in round t + 1. I.e., gi(h).

Task: define qi given p−i and ui(xi, p−i).

Idea:

Respond randomly, but (somehow) proportional to expected payoff.

Elaborations of this idea:

a) Strictly proportional:

qi(xi | p−i) =Defui(xi, p−i)

∑x′i∈Xiui(x′i , p−i)

.

b) Through, what is called, mixed logit:qi(xi | p−i) =Def

eui(xi ,p−i)/γi

∑x′i∈Xieui(x′i ,p

−i)/γi.



Mixed logit, or quantal response function

• Let d1 + · · ·+ dn = 1 and dj ≥ 0.

logit(di) =Defedi/γ

∑j edj/γ

where γ > 0.

• The logit function can be seen as a soft maximum on n variables.

γ ↓ 0 : logit “shares” 1 among all maximal di

γ = 1 : logit is strictly proportional

γ → ∞ : logit “spreads” 1 among all di evenly

Mixed logit can be justified in different ways.

a) On the basis of information and entropy arguments.

b) By assuming the dj are i.i.d. extreme value (a.k.a. log Weibull) distributed.

Anderson et al. (1992): Discrete Choice Theory of Product Differentiation. Sec. 2.6.1: “Derivation of the Logit”.



Evenly (γ → ∞) −→ mixed logit −→ best response only (γ ↓ 0)

As you see, mixed logit respects best replies, but leaves room for

experimentation.



Digression: Coding theory and entropy

This digression tries to answer the following question:

Why does play according to a diversified strategy yields more information

than play according to a strategy where only a few options are played?

• To send 8 different binary

encoded messages would cost 3

bits. Encoded messages are 000,

001, . . . 111.

• To encode 16 different messages,

we would need log2 16 = 4 bits.

• To encode 20 different messages,

we would need ⌈log2 20⌉ =⌈4.32⌉ = 5 bits.

If some messages are send more

frequently than others, it pays off to

search for a code such that messages

that occur more frequently are

represented by short code words (at

the expense of messages that are send

less frequently, that must then be

represented by the remaining longer

code words).



Coding theory and entropy (continued)

Example. Suppose persons A

and B work on a dark terrain.

They are separated, and can only

communicate by morse through a

flashlight.

A and B have agreed to send only the

following messages:

m1 Yes

m2 No

m3 All well?

m4 Shall I come over?

A possible encoding could beCode 1: m1 ↔ 00

m2 ↔ 01

m3 ↔ 10

m4 ↔ 11



Coding theory and entropy (continued)

Another encoding could beCode 2: m1 ↔ 0

m2 ↔ 10

m3 ↔ 110

m4 ↔ 111

To prevent ambiguity, no code word

may be a prefix of some other code

word.

A useless coding would beCode 3: m1 ↔ 0

m2 ↔ 1

m3 ↔ 00

m4 ↔ 01

Under Code 3, the sequence 0101

may mean different things, such as

m1, m2, m1, m2, or m1, m2, m4. (There

are still other possibilities.)

• The objective is to search for ane� ient en oding, i.e., an encoding

that minimises the number of bits

per message.

• If the relative frequen y of messages

is known, we can for every code

compute the expected number of

bits per message,

hence its efficiency.



Coding theory and entropy [end of digression]

The following would be a plausible

probability distribution:

m1 Yes 1/2

m2 No 1/4

m3 All well? 1/8

m4 Shall I come over? 1/8

For Code 2,

E[number of bits] =

1

2· 1 +

1

4· 2 +

1

8· 3 +

1

8· 3 = 1.75

For Code 1, the expected number of

bits is 2.0. Therefore, Code 2 is more

efficient than Code 1.

Theorem (Noiseless Coding Theo-

rem, Shannon)

p1 log2(1/p1)+ . . . + pn log2(1/pn)

is a lower bound for the expected

number of bits in an encoding of

n messages with expected occur-

rence (p1, . . . , pn).

This number is called the entropy of

(p1, . . . , pn). Alternatively, entropy is

−[p1 log2(p1) + · · ·+ pn log2(pn)].

The entropy of Code 2 is equal to

1.75. Therefore, Code 2 is optimal.



Smoothed fictitious play (Fudenberg & Levine, 1995)

Smoothed fictitious play is a generalisation of mixed logit. Let

wi : ∆i → R

be a function that “grades” i’s probability distributions (over actions) under

the following conditions.

1. Grading is smooth (wi is infinitely often differentiable).

2. Grading is strictly concave (bump) in such a manner that ‖∇wi(qi)‖ → ∞

(steep) whenever grading approaches the boundary of ∆i (whenever

distributions become extremely uneven).

Let

Ui(qi, p−i) =Def ui(qi, p−i) + γi·wi(qi)

Let fi be fictitious forecasting and let gi correspond to a best response based

on Ui. Then ( fi, gi) is called smoothed � titious play with smoothing function

wi and smoothing parameter γi.



Smoothed fictitious play limits regret

Theorem (Fudenberg & Levine, 1995). Let G be a finite game and let ǫ > 0. If a

given player uses smoothed fictitious play with a sufficiently small smoothing

parameter, then with probability one his regrets are bounded above by ǫ.

– Young does not reproduce the proof of Fudenberg et al., but shows that in

this case ǫ-regret can be derived from a later and more general result of

Hart and Mas-Colell in 2001.

– This later result identifies a large family of rules that eliminate regret,

based on an extension of Blackwell’s approachability theorem.

(Roughly, Blackwell’s approachability theorem generalises maxmin

reasoning to vector-valued payoffs.)

Fudenberg & Levine, 1995. “Consistency and cautious fictitious play,” Journal of Economic Dynamics andControl, Vol. 19 (5-7), pp. 1065-1089.

Hart & Mas-Colell, 2001. “A General Class of Adaptive Strategies,” Journal of Economic Theory, Vol. 98(1),pp. 26-54.



Smoothed fictitious play converges to ǫ-CCE

Definition. A oarse orrelated equilibrium (CCE) is a probability distribu-

tion on strategy profiles, q ∈ ∆(X), such that no player can opt out (to gain

expected utility) before q is made known.

In a oarse orrelated ǫ-equilibrium (ǫ-CCE), no player can opt out to gain

more in expectation than ǫ.

Theorem (Fudenberg & Levine, 1995). Let G be a finite game and let ǫ > 0. If all

players use smoothed fictitious play with sufficiently small smoothing parameters,

then with probability one empirical play will converge to the set of coarse correlated

ǫ-equilibria.

Summary of the two theorems: smoothed fictitious play limits regret and

converges to ǫ-CCE.

There is another learning method with no regret and convergence to

zero-CCE . . .



There are

more Coarse Correlated

Equilibria

than Correlated Equilibria

than Nash Equilibria

Simple coordination game:

Other:

Left RightYou: Left (1, 1) (0, 0)

Right (0, 0) (1, 1)

(In this picture, CCE = CE.)



Exponentiated regret matching

Let j : action j, where 1 ≤ j ≤ k

uti : i’s realised average payoff up to and including round t

φ−it : the realised joint empirical distribution of i’s opponents

ui(j, φ−it) : i’s hypothetical average payoff for playing action j against φ−it

rit : player i’s regret vector in round t, i.e., ui(j, φ−it) − utiExponentiated regret mat hing (PY, p. 59) is defined as

qi(t+1)j ∝ [rit

j ]a+

where a > 0. (For a = 1 we have ordinary regret matching.)

An extended theorem on regret matching (Mas-Colell et al., 2001) ensures that

individual players have no-regret with probability one, and empirical

distribution of play converges to the set of coarse correlated equilibria (PY,

p. 60).



FP vs. Smoothed FP vs. Exponentiated RM

Fictitious play Plays best responses.

• Does depend on past play of opponent(s).

• Puts zero probabilities on sub-optimal responses.

Smoothed fictitious play Plays sub-optimal responses, e.g.,

softmax-proportional to their estimated payoffs.

• Does depend on past play of opponent(s).

• Puts non-zero probabilities on sub-optimal responses.

• Approaches fictitious play when γi ↓ 0 (PY, p. 84).

Exponentiated regret matching Plays regret suboptimally, i.e., proportional to

a power of positive regret.

• Does depend on own past payo�s.• Puts non-zero probabilities on sub-optimal responses.

• Approaches fictitious play when exponent a → ∞ (PY, p. 84).



FP vs. Smoothed FP vs. Exponentiated RM

FP Smoothed FP Exponentiated RM

Depends on past play of

opponents

√ √ −

Depends on own past payoffs − − √

Puts zero probabilities on

sub-optimal responses

√ − −

Best response√

when smoothing

parameter γi ↓ 0

when exponent

a → ∞

Individual no-regret − Within ǫ > 0, almost

always (PY, p. 82)

Exact, almost

always (PY, p. 60)

Collective convergence to coarse

correlated equilibria

− Within ǫ > 0, almost

always (PY, p. 83)

Exact, almost

always (PY, p. 60)



Part III:

Finite memory and inertia



Finite memory: motivation

• In their basic version, most

learning rules rely on the entire

history of play.

• People, as well as computers,

have a finite memory. (On the

other hand, for average or

discounted payoffs this is no real

problem.)

• Nevertheless: experiences in the

distant past are apt to be less

relevant than more recent ones.

• Idea: let players have a finite

memory of m rounds.



Inertia: motivation

• When players’ strategies are

constantly re-evaluated,

discontinuities in behaviour are

likely to occur.

Example: the asymmetric

coordination game.

• Discontinuities in behaviour are

less likely to lead to equilibria of

any sort.

• Idea: let players play the same

action as in the previous round

with probability λ.



Weakly acyclic games

• Game G with action space X.

• G′ = (V, E) where V = X and

E = { (x, y) | for some i :

y−i = x−i and ui(yi, y−i) >

ui(xi, y−i) }

• For all x ∈ X: x is a sink iff x is a

Nash equilibrium.

• G is said to be weakly a y li underbetter replies if every node is

connected to a sink.

• WAuBR ⇒ ∃ Nash equilibrium.

(1, 1)

(2, 4)

(4, 2)

(1, 1)

(4, 2)

(2, 4)

(3, 3)

(1, 1)

(1, 1)



Examples of weakly acyclic games

Coordination games Two-person games with

identical actions for all players, where best

responses are formed by the diagonal of

the joint action space.

Potential games (Monderer and Shapley, 1996).

There is a function ρ : X → R, called the potential,such that for every player i and every action profile x, y ∈ X:

y−i = x−i ⇒ ui(yi, y−i) − ui(xi, x−i) = ρ(y) − ρ(x)

Example: ongestion games.true : The potential function increases along every path.

⇒ : Paths cannot cycle.

⇒ : In finite graphs, paths must end.



Weakly acyclic games under finite memory and inertia

Theorem. Let G be a finite weakly acyclic n-person game. Every better-reply process

with finite memory and inertia converges to a pure Nash equilibrium of G.

Proof . (Outline.)

1. Let the state spa e Z be Xm.

2. A state x ∈ Xm is calledhomogeneous if it consists of

identical action profiles x. Such a

state is denoted by 〈x〉.Z∗ =Def { homogeneous states }.

3. In a moment, it will be shown that

the process will hit Z∗ infinitely

often.

4. In a moment, it will be shown that

the overall probability to play any

action is bounded away from zero.

5. It can easily be seen that

Absorbing = Z∗ ∩ Pure Nash.

6. In a moment, it will be shown that,

due to weak acyclicity, inertia, and

(4), the process eventually lands in

an absorbing state which, due to

(5), is a repeated pure Nash

equilibrium. �



First claim: process hits Z∗ infinitely often

Let inertia be determined by λ > 0.

Pr(all players play their previous action) = λn.

Hence,

Pr(all players play their previous action during m subsequent rounds) = λnm.

If all players play their previous action during m subsequent rounds, then the

process arrives at a homogeneous state. But also conversely. Hence, for all t,

Pr(process arrives at a homogeneous state in round t + m) = λnm.

Infinitely many disjoint histories of length m occur, hence infinitely many

independent events “homogeneous at t + m” occur.

Apply the (first) Borel-Cantelli lemma: if {En}n are independent events, and

∑∞n=1 Pr(En) is unbounded, then Pr( an infinite number of En ’s occur ) = 1. �



Second claim: all actions will be played with probability γ > 0

A better-reply learning method from states (finite histories) to strategies

(probability distributions on actions)

γi : Z → ∆(Xi)

possesses the following important properties:

i) It is deterministic.

ii) Every action is played with positive probability.

1. Hence, if

γi = inf{γi(z)(xi) | z ∈ Z, xi ∈ Xi}Since Z and Xi are finite, the “inf” is a “min,” and γi > 0.

2. Similarly, if

γ = inf{γi | 1 ≤ i ≤ n}.

Since there are finitely many players, the “inf” is a “min,” and γ > 0. �



Final claim: overall probability to reach a sink from Z∗> 0

Suppose the process is in 〈x〉.1. If x is pure Nash, we are done,

because response functions are

deterministic better replies.

2. If x is not pure Nash, there must

be an edge x → y in the better

reply graph. Suppose this edge

concerns action xi of player i. We

now know that xi is played with

probability at least γ, irrespective

of player and state.

Further probabilities:

• All other players j 6= i keep

playing the same action : λn−1.

• Edge x → y is actually traversed :

γλn−1.

• Profile y is maintained for

another m − 1 rounds, so as to

arrive at state 〈y〉 : λn(m−1).

• To traverse from 〈x〉 to 〈y〉 :

γλn−1· λn(m−1) = γλnm−1.

• The image 〈x(1)〉, . . . , 〈x(l)〉 of a

better reply-path x(1), . . . , x(l) is

followed to a sink : ≥ (γλnm−1)L,

where L is the length of a longest

better-reply path.

Since Z∗ is encountered infinitely

often, the result follows. �



Summary

• With fictitious play, the behaviourof opponents is modelled by (or

represented by, or projected on)

a mixed strategy.

• Fictitious play ignores

sub-optimal actions.

• There is a family of so-calledbetter-reply learning rules, that

i) play sub-optimal actions, and

ii) can be brought arbitrary close

to fictitious play.

• In weakly acyclic n-person

games, every better-reply

process with finite memory and

inertia converges to a pure Nash

equilibrium.



What next?

Bayesian play :

• With fictitious play, the behaviour of opponents is modelled by a singlemixed strategy.

• With Bayesian play, opponents are modelled by a probability distributionover (a possibly on�ned set of) mixed strategies.Gradient dynamics :

• Like fictitious play, players model (or assess) each other through mixed

strategies.

• Strategies are not played, only maintained.

• Due to CKR (common knowledge of rationality, cf. Hargreaves Heap &

Varoufakis, 2004), all models of mixed strategies are correct. (I.e.,

q−i = s−i, for all i.)

• Players gradually adapt their mixed strategies through hill-climbing in

the payoff space.


Multi-agent learning Fictitious Play - Utrecht University · Multi-agent learning Fictitious Play...

Documents

Transcript of Multi-agent learning Fictitious Play - Utrecht University · Multi-agent learning Fictitious Play...