Multi-agent learning Fictitious Play - Utrecht University · Multi-agent learning Fictitious Play...
Transcript of Multi-agent learning Fictitious Play - Utrecht University · Multi-agent learning Fictitious Play...
Multi-agent learning Fictitious Play
Multi-agent learningFi titious Play
Gerard Vreeswijk, Intelligent Systems Group, Computer Science Department,
Faculty of Sciences, Utrecht University, The Netherlands.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 1
Multi-agent learning Fictitious Play
Fictitious play: motivation
• Rather than considering your
own payoffs, monitor the
behaviour of your opponent(s),
and respond optimally.
• Behaviour of an opponent is
projected on a single mixedstrategy.
• Brown (1951): explanation for
Nash equilibrium play.
In terms of current use, the name
is a bit of a misnomer, since play
actually occurs (Berger, 2005).
• One of the most important, if not
the most important,
representative of a followerstrategy.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 2
Multi-agent learning Fictitious Play
Plan for today
Part I. Best reply strategy 1. Pure fictitious play.
2. Results that connect pure fictitious play to Nash equilibria.
Part II. Extensions and approximations of fictitious play
1. Smoothed fictitious play.
2. Exponential regret matching.
3. No-regret property of smoothed fictitious play (Fudenberg et al., 1995).
4. Convergence of better reply strategies when players have limited
memory and are inert [tend to stick to their current strategy] (Young,
1998).
Shoham et al. (2009): Multi-agent Systems. Ch. 7: “Learning and Teaching”.
H. Young (2004): Strategic Learning and it Limits, Oxford UP.
D. Fudenberg and D.K. Levine (1998), The Theory of Learning in Games, MIT Press.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 3
Multi-agent learning Fictitious Play
Part I:
Pure fictitious play
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 4
Multi-agent learning Fictitious Play
Repeated Coordination Game
Players receive payoff p > 0 iff they coordinate. This game possesses three
Nash equilibria, viz. (0, 0), (0.5, 0.5), and (1, 1).
Round A’s action B’s action A’s beliefs B’s beliefs
0. (0.0, 0.0) (0.0, 0.0)
1. L* R* (0.0, 1.0) (1.0, 0.0)
2. R L (1.0, 1.0) (1.0, 1.0)
3. L* R* (1.0, 2.0) (2.0, 1.0)
4. R L (2.0, 2.0) (2.0, 2.0)
5. R* R* (2.0, 3.0) (2.0, 3.0)
6. R R (2.0, 4.0) (2.0, 4.0)
7. R R (2.0, 5.0) (2.0, 5.0)...
......
......
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 5
Multi-agent learning Fictitious Play
Steady states are pure (but possibly weak) Nash equilibria
Definition (Steady state). An action profile a is a steady state (or absorbingstate) of fictitious play if it is the case that whenever a is played at round t
then, inevitably, it is also played at round t + 1.
Theorem. If a pure strategy profile is a steady state of fictitious play, then it is a
(possibly weak) Nash equilibrium in the stage game.
Proof . Suppose a = (a1, . . . , an) is a steady state. Consequently, i’s opponent
model converges to a−i, for all i. By definition of fictitious play, i plays best
responses to a−i, i.e.,
∀i : ai ∈ BR(a−i).
The latter is precisely the definition of a Nash equilibrium. �
Still, the resulting Nash equilibrium is often strict, because for weak equilibria
the process is likely to drift due to alternative best responses.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 6
Multi-agent learning Fictitious Play
Pure strict Nash equilibria are steady states
Theorem. If a pure strategy profile is a strict Nash equilibrium of a stage game, then
it is a steady state of fictitious play in the repeated game.
Notice the use of terminology: “pure strategy profile” for Nash equilibria;
“action profile” for steady states.
Proof . Suppose a is a pure Nash equilibrium and ai is played at round t, for
all i. Because a is strict, ai is the unique best response to a−i. Because this
argument holds for each i, action profile a will be played in round t + 1 again.
�
Summary of the two theorems:
Pure strict Nash ⇒ Steady state ⇒ Pure Nash.
But what if pure Nash equilibria do not exist?
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 7
Multi-agent learning Fictitious Play
Repeated game of Matching Pennies
Zero sum game. A’s goal is to have pennies matched. B maintains opposite.
Round A’s action B’s action A’s beliefs B’s beliefs
0. (1.5, 2.0) (2.0, 1.5)
1. T T (1.5, 3.0) (2.0, 2.5)
2. T H (2.5, 3.0) (2.0, 3.5)
3. T H (3.5, 3.0) (2.0, 4.5)
4. H H (4.5, 3.0) (3.0, 4.5)
5. H H (5.5, 3.0) (4.0, 4.5)
6. H H (6.5, 3.0) (5.0, 4.5)
7. H T (6.5, 4.0) (6.0, 4.5)
8. H T (6.5, 5.0) (7.0, 4.5)...
......
......
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 8
Multi-agent learning Fictitious Play
Convergent empirical distribution of strategies
Theorem. If the empirical distribution of each player’s strategies converges in
fictitious play, then it converges to a Nash equilibrium.
Proof . Same as before. If the empirical distributions converge to q, then i’s
opponent model converges to q−i, for all i. By definition of fictitious play,
qi ∈ BR(q−i). Because of convergence, all such (mixed) best replies remain the
same. By definition we have a Nash equilibrium. �
Remarks:
1. The qi may be mixed.
2. It actually suffices that the q−i
converge asymptotically to the
actual distribution (Fudenberg &
Levine, 1998).
3. If empirical distributions
converge (hence, converge to a
Nash equilibrium), the actually
played responses per stage need
not be Nash equilibria of the
stage game.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 9
Multi-agent learning Fictitious Play
Empirical distributions converge to Nash 6⇒ stage Nash
Repeated Coordination Game. Players receive payoff p > 0 iff they
coordinate.
Round A’s action B’s action A’s beliefs B’s beliefs
0. (0.5, 1.0) (1.0, 0.5)
1. B A (1.5, 1.0) (1.0, 1.5)
2. A B (1.5, 2.0) (2.0, 1.5)
3. B A (2.5, 2.0) (2.0, 2.5)
4. A B (2.5, 3.0) (3.0, 2.5)...
......
......
• This game possesses three equilibria, viz. (0, 0), (0.5, 0.5), and (1, 1), with
expected payoffs 1, 0.5, and 1, respectively.
• Empirical distribution of play converges to (0.5, 0.5),—with payoff 0,
rather than p/2.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 10
Multi-agent learning Fictitious Play
Empirical distribution of play does not need to converge
Rock-paper-scissors. Winner receives payoff p > 0. Else, payoff zero.
• Rock-paper-scissors with these payoffs is known as the Shapley game.
• The Shapley game possesses one equilibrium, viz. (1/3, 1/3, 1/3), with
expected payoff p/3.
Round A’s action B’s action A’s beliefs B’s beliefs
0. (0.0, 0.0, 0.5) (0.0, 0.5, 0.0)
1. Rock Scissors (0.0, 0.0, 1.5) (1.0, 0.5, 0.0)
2. Rock Paper (0.0, 1.0, 1.5) (2.0, 0.5, 0.0)
3. Rock Paper (0.0, 2.0, 1.5) (3.0, 0.5, 0.0)
4. Scissors Paper (0.0, 3.0, 1.5) (3.0, 0.5, 1.0)
5. Scissors Paper (0.0, 4.0, 1.5) (3.0, 0.5, 2.0)...
......
......
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 11
Multi-agent learning Fictitious Play
Repeated Shapley Game: Phase Diagram
•
Rock Paper
Scissors
NN
N
N
N
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 12
Multi-agent learning Fictitious Play
Part II:
Extensions and approximations of
fictitious play
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 13
Multi-agent learning Fictitious Play
Proposed extensions to fictitious play
Build forecasts, not on complete history, but on
• Recent data, say on m most recent rounds.
• Discounted data, say with discount factor γ.
• Perturbed data, say with error ǫ on individual observations.
• Random samples of historical data, say on random samples of size m.
Give not necessarily best responses, but
• ǫ-greedy.
• Perturbed throughout, with small random shocks.
• Randomly, and proportional to expected payoff.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 14
Multi-agent learning Fictitious Play
Framework for predictive learning (like fictitious play)
A fore asting rule for player i is a function that maps a history to a probability
distribution over the opponents’ actions in the next round:
fi : H → ∆(X−i).
A response rule for player i is a function that maps a history to a probability
distribution over i’s own actions in the next round:
gi : H → ∆(Xi).
A predi tive learning rule for player i is the combination of a forecasting rule
and a response rule. This is typically written as ( fi, gi).
• This framework can be attributed to J.S. Jordan (1993).
• Forecasting and response functions are deterministic.
• Reinforcement and regret do not fit. They are not involved with
prediction.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 15
Multi-agent learning Fictitious Play
Forecasting and response rules for fictitious play
Let ht ∈ Ht be a history of play up to and including round t and
φjt =Def the empirical distribution of j’s actions up to and including round t.
Then the � titious play fore asting rule is given by
fi(ht) =Def ∏j 6=i
φjt.
Let fi be a fictitious play forecasting rule. Then gi is said to be a � titious playresponse rule if all values are best responses to values of fi.
Remarks:
1. Player i attributes a mixed strategy φjt to player j. This strategy reflects
the number of times each action is played by j.
2. The mixed strategies are assumed to be independent.
3. Both (1) and (2) are simplifying assumptions.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 16
Multi-agent learning Fictitious Play
Smoothed fictitious play
Notation:
p−i : strategy profile of opponents as predicted by fi in round t.
ui(xi, p−i) : expected utility of action xi, given p−i.
qi : strategy profile of player i in round t + 1. I.e., gi(h).
Task: define qi given p−i and ui(xi, p−i).
Idea:
Respond randomly, but (somehow) proportional to expected payoff.
Elaborations of this idea:
a) Strictly proportional:
qi(xi | p−i) =Defui(xi, p−i)
∑x′i∈Xiui(x′i , p−i)
.
b) Through, what is called, mixed logit:qi(xi | p−i) =Def
eui(xi ,p−i)/γi
∑x′i∈Xieui(x′i ,p
−i)/γi.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 17
Multi-agent learning Fictitious Play
Mixed logit, or quantal response function
• Let d1 + · · ·+ dn = 1 and dj ≥ 0.
logit(di) =Defedi/γ
∑j edj/γ
where γ > 0.
• The logit function can be seen as a soft maximum on n variables.
γ ↓ 0 : logit “shares” 1 among all maximal di
γ = 1 : logit is strictly proportional
γ → ∞ : logit “spreads” 1 among all di evenly
Mixed logit can be justified in different ways.
a) On the basis of information and entropy arguments.
b) By assuming the dj are i.i.d. extreme value (a.k.a. log Weibull) distributed.
Anderson et al. (1992): Discrete Choice Theory of Product Differentiation. Sec. 2.6.1: “Derivation of the Logit”.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 18
Multi-agent learning Fictitious Play
Evenly (γ → ∞) −→ mixed logit −→ best response only (γ ↓ 0)
As you see, mixed logit respects best replies, but leaves room for
experimentation.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 19
Multi-agent learning Fictitious Play
Digression: Coding theory and entropy
This digression tries to answer the following question:
Why does play according to a diversified strategy yields more information
than play according to a strategy where only a few options are played?
• To send 8 different binary
encoded messages would cost 3
bits. Encoded messages are 000,
001, . . . 111.
• To encode 16 different messages,
we would need log2 16 = 4 bits.
• To encode 20 different messages,
we would need ⌈log2 20⌉ =⌈4.32⌉ = 5 bits.
If some messages are send more
frequently than others, it pays off to
search for a code such that messages
that occur more frequently are
represented by short code words (at
the expense of messages that are send
less frequently, that must then be
represented by the remaining longer
code words).
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 20
Multi-agent learning Fictitious Play
Coding theory and entropy (continued)
Example. Suppose persons A
and B work on a dark terrain.
They are separated, and can only
communicate by morse through a
flashlight.
A and B have agreed to send only the
following messages:
m1 Yes
m2 No
m3 All well?
m4 Shall I come over?
A possible encoding could beCode 1: m1 ↔ 00
m2 ↔ 01
m3 ↔ 10
m4 ↔ 11
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 21
Multi-agent learning Fictitious Play
Coding theory and entropy (continued)
Another encoding could beCode 2: m1 ↔ 0
m2 ↔ 10
m3 ↔ 110
m4 ↔ 111
To prevent ambiguity, no code word
may be a prefix of some other code
word.
A useless coding would beCode 3: m1 ↔ 0
m2 ↔ 1
m3 ↔ 00
m4 ↔ 01
Under Code 3, the sequence 0101
may mean different things, such as
m1, m2, m1, m2, or m1, m2, m4. (There
are still other possibilities.)
• The objective is to search for ane� ient en oding, i.e., an encoding
that minimises the number of bits
per message.
• If the relative frequen y of messages
is known, we can for every code
compute the expected number of
bits per message,
hence its efficiency.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 22
Multi-agent learning Fictitious Play
Coding theory and entropy [end of digression]
The following would be a plausible
probability distribution:
m1 Yes 1/2
m2 No 1/4
m3 All well? 1/8
m4 Shall I come over? 1/8
For Code 2,
E[number of bits] =
1
2· 1 +
1
4· 2 +
1
8· 3 +
1
8· 3 = 1.75
For Code 1, the expected number of
bits is 2.0. Therefore, Code 2 is more
efficient than Code 1.
Theorem (Noiseless Coding Theo-
rem, Shannon)
p1 log2(1/p1)+ . . . + pn log2(1/pn)
is a lower bound for the expected
number of bits in an encoding of
n messages with expected occur-
rence (p1, . . . , pn).
This number is called the entropy of
(p1, . . . , pn). Alternatively, entropy is
−[p1 log2(p1) + · · ·+ pn log2(pn)].
The entropy of Code 2 is equal to
1.75. Therefore, Code 2 is optimal.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 23
Multi-agent learning Fictitious Play
Smoothed fictitious play (Fudenberg & Levine, 1995)
Smoothed fictitious play is a generalisation of mixed logit. Let
wi : ∆i → R
be a function that “grades” i’s probability distributions (over actions) under
the following conditions.
1. Grading is smooth (wi is infinitely often differentiable).
2. Grading is strictly concave (bump) in such a manner that ‖∇wi(qi)‖ → ∞
(steep) whenever grading approaches the boundary of ∆i (whenever
distributions become extremely uneven).
Let
Ui(qi, p−i) =Def ui(qi, p−i) + γi·wi(qi)
Let fi be fictitious forecasting and let gi correspond to a best response based
on Ui. Then ( fi, gi) is called smoothed � titious play with smoothing function
wi and smoothing parameter γi.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 24
Multi-agent learning Fictitious Play
Smoothed fictitious play limits regret
Theorem (Fudenberg & Levine, 1995). Let G be a finite game and let ǫ > 0. If a
given player uses smoothed fictitious play with a sufficiently small smoothing
parameter, then with probability one his regrets are bounded above by ǫ.
– Young does not reproduce the proof of Fudenberg et al., but shows that in
this case ǫ-regret can be derived from a later and more general result of
Hart and Mas-Colell in 2001.
– This later result identifies a large family of rules that eliminate regret,
based on an extension of Blackwell’s approachability theorem.
(Roughly, Blackwell’s approachability theorem generalises maxmin
reasoning to vector-valued payoffs.)
Fudenberg & Levine, 1995. “Consistency and cautious fictitious play,” Journal of Economic Dynamics andControl, Vol. 19 (5-7), pp. 1065-1089.
Hart & Mas-Colell, 2001. “A General Class of Adaptive Strategies,” Journal of Economic Theory, Vol. 98(1),pp. 26-54.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 25
Multi-agent learning Fictitious Play
Smoothed fictitious play converges to ǫ-CCE
Definition. A oarse orrelated equilibrium (CCE) is a probability distribu-
tion on strategy profiles, q ∈ ∆(X), such that no player can opt out (to gain
expected utility) before q is made known.
In a oarse orrelated ǫ-equilibrium (ǫ-CCE), no player can opt out to gain
more in expectation than ǫ.
Theorem (Fudenberg & Levine, 1995). Let G be a finite game and let ǫ > 0. If all
players use smoothed fictitious play with sufficiently small smoothing parameters,
then with probability one empirical play will converge to the set of coarse correlated
ǫ-equilibria.
Summary of the two theorems: smoothed fictitious play limits regret and
converges to ǫ-CCE.
There is another learning method with no regret and convergence to
zero-CCE . . .
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 26
Multi-agent learning Fictitious Play
There are
more Coarse Correlated
Equilibria
than Correlated Equilibria
than Nash Equilibria
Simple coordination game:
Other:
Left RightYou: Left (1, 1) (0, 0)
Right (0, 0) (1, 1)
(In this picture, CCE = CE.)
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 27
Multi-agent learning Fictitious Play
Exponentiated regret matching
Let j : action j, where 1 ≤ j ≤ k
uti : i’s realised average payoff up to and including round t
φ−it : the realised joint empirical distribution of i’s opponents
ui(j, φ−it) : i’s hypothetical average payoff for playing action j against φ−it
rit : player i’s regret vector in round t, i.e., ui(j, φ−it) − utiExponentiated regret mat hing (PY, p. 59) is defined as
qi(t+1)j ∝ [rit
j ]a+
where a > 0. (For a = 1 we have ordinary regret matching.)
An extended theorem on regret matching (Mas-Colell et al., 2001) ensures that
individual players have no-regret with probability one, and empirical
distribution of play converges to the set of coarse correlated equilibria (PY,
p. 60).
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 28
Multi-agent learning Fictitious Play
FP vs. Smoothed FP vs. Exponentiated RM
Fictitious play Plays best responses.
• Does depend on past play of opponent(s).
• Puts zero probabilities on sub-optimal responses.
Smoothed fictitious play Plays sub-optimal responses, e.g.,
softmax-proportional to their estimated payoffs.
• Does depend on past play of opponent(s).
• Puts non-zero probabilities on sub-optimal responses.
• Approaches fictitious play when γi ↓ 0 (PY, p. 84).
Exponentiated regret matching Plays regret suboptimally, i.e., proportional to
a power of positive regret.
• Does depend on own past payo�s.• Puts non-zero probabilities on sub-optimal responses.
• Approaches fictitious play when exponent a → ∞ (PY, p. 84).
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 29
Multi-agent learning Fictitious Play
FP vs. Smoothed FP vs. Exponentiated RM
FP Smoothed FP Exponentiated RM
Depends on past play of
opponents
√ √ −
Depends on own past payoffs − − √
Puts zero probabilities on
sub-optimal responses
√ − −
Best response√
when smoothing
parameter γi ↓ 0
when exponent
a → ∞
Individual no-regret − Within ǫ > 0, almost
always (PY, p. 82)
Exact, almost
always (PY, p. 60)
Collective convergence to coarse
correlated equilibria
− Within ǫ > 0, almost
always (PY, p. 83)
Exact, almost
always (PY, p. 60)
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 30
Multi-agent learning Fictitious Play
Part III:
Finite memory and inertia
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 31
Multi-agent learning Fictitious Play
Finite memory: motivation
• In their basic version, most
learning rules rely on the entire
history of play.
• People, as well as computers,
have a finite memory. (On the
other hand, for average or
discounted payoffs this is no real
problem.)
• Nevertheless: experiences in the
distant past are apt to be less
relevant than more recent ones.
• Idea: let players have a finite
memory of m rounds.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 32
Multi-agent learning Fictitious Play
Inertia: motivation
• When players’ strategies are
constantly re-evaluated,
discontinuities in behaviour are
likely to occur.
Example: the asymmetric
coordination game.
• Discontinuities in behaviour are
less likely to lead to equilibria of
any sort.
• Idea: let players play the same
action as in the previous round
with probability λ.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 33
Multi-agent learning Fictitious Play
Weakly acyclic games
• Game G with action space X.
• G′ = (V, E) where V = X and
E = { (x, y) | for some i :
y−i = x−i and ui(yi, y−i) >
ui(xi, y−i) }
• For all x ∈ X: x is a sink iff x is a
Nash equilibrium.
• G is said to be weakly a y li underbetter replies if every node is
connected to a sink.
• WAuBR ⇒ ∃ Nash equilibrium.
(1, 1)
(2, 4)
(4, 2)
(1, 1)
(4, 2)
(2, 4)
(3, 3)
(1, 1)
(1, 1)
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 34
Multi-agent learning Fictitious Play
Examples of weakly acyclic games
Coordination games Two-person games with
identical actions for all players, where best
responses are formed by the diagonal of
the joint action space.
Potential games (Monderer and Shapley, 1996).
There is a function ρ : X → R, called the potential,such that for every player i and every action profile x, y ∈ X:
y−i = x−i ⇒ ui(yi, y−i) − ui(xi, x−i) = ρ(y) − ρ(x)
Example: ongestion games.true : The potential function increases along every path.
⇒ : Paths cannot cycle.
⇒ : In finite graphs, paths must end.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 35
Multi-agent learning Fictitious Play
Weakly acyclic games under finite memory and inertia
Theorem. Let G be a finite weakly acyclic n-person game. Every better-reply process
with finite memory and inertia converges to a pure Nash equilibrium of G.
Proof . (Outline.)
1. Let the state spa e Z be Xm.
2. A state x ∈ Xm is calledhomogeneous if it consists of
identical action profiles x. Such a
state is denoted by 〈x〉.Z∗ =Def { homogeneous states }.
3. In a moment, it will be shown that
the process will hit Z∗ infinitely
often.
4. In a moment, it will be shown that
the overall probability to play any
action is bounded away from zero.
5. It can easily be seen that
Absorbing = Z∗ ∩ Pure Nash.
6. In a moment, it will be shown that,
due to weak acyclicity, inertia, and
(4), the process eventually lands in
an absorbing state which, due to
(5), is a repeated pure Nash
equilibrium. �
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 36
Multi-agent learning Fictitious Play
First claim: process hits Z∗ infinitely often
Let inertia be determined by λ > 0.
Pr(all players play their previous action) = λn.
Hence,
Pr(all players play their previous action during m subsequent rounds) = λnm.
If all players play their previous action during m subsequent rounds, then the
process arrives at a homogeneous state. But also conversely. Hence, for all t,
Pr(process arrives at a homogeneous state in round t + m) = λnm.
Infinitely many disjoint histories of length m occur, hence infinitely many
independent events “homogeneous at t + m” occur.
Apply the (first) Borel-Cantelli lemma: if {En}n are independent events, and
∑∞n=1 Pr(En) is unbounded, then Pr( an infinite number of En ’s occur ) = 1. �
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 37
Multi-agent learning Fictitious Play
Second claim: all actions will be played with probability γ > 0
A better-reply learning method from states (finite histories) to strategies
(probability distributions on actions)
γi : Z → ∆(Xi)
possesses the following important properties:
i) It is deterministic.
ii) Every action is played with positive probability.
1. Hence, if
γi = inf{γi(z)(xi) | z ∈ Z, xi ∈ Xi}Since Z and Xi are finite, the “inf” is a “min,” and γi > 0.
2. Similarly, if
γ = inf{γi | 1 ≤ i ≤ n}.
Since there are finitely many players, the “inf” is a “min,” and γ > 0. �
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 38
Multi-agent learning Fictitious Play
Final claim: overall probability to reach a sink from Z∗> 0
Suppose the process is in 〈x〉.1. If x is pure Nash, we are done,
because response functions are
deterministic better replies.
2. If x is not pure Nash, there must
be an edge x → y in the better
reply graph. Suppose this edge
concerns action xi of player i. We
now know that xi is played with
probability at least γ, irrespective
of player and state.
Further probabilities:
• All other players j 6= i keep
playing the same action : λn−1.
• Edge x → y is actually traversed :
γλn−1.
• Profile y is maintained for
another m − 1 rounds, so as to
arrive at state 〈y〉 : λn(m−1).
• To traverse from 〈x〉 to 〈y〉 :
γλn−1· λn(m−1) = γλnm−1.
• The image 〈x(1)〉, . . . , 〈x(l)〉 of a
better reply-path x(1), . . . , x(l) is
followed to a sink : ≥ (γλnm−1)L,
where L is the length of a longest
better-reply path.
Since Z∗ is encountered infinitely
often, the result follows. �
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 39
Multi-agent learning Fictitious Play
Summary
• With fictitious play, the behaviourof opponents is modelled by (or
represented by, or projected on)
a mixed strategy.
• Fictitious play ignores
sub-optimal actions.
• There is a family of so-calledbetter-reply learning rules, that
i) play sub-optimal actions, and
ii) can be brought arbitrary close
to fictitious play.
• In weakly acyclic n-person
games, every better-reply
process with finite memory and
inertia converges to a pure Nash
equilibrium.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 40
Multi-agent learning Fictitious Play
What next?
Bayesian play :
• With fictitious play, the behaviour of opponents is modelled by a singlemixed strategy.
• With Bayesian play, opponents are modelled by a probability distributionover (a possibly on�ned set of) mixed strategies.Gradient dynamics :
• Like fictitious play, players model (or assess) each other through mixed
strategies.
• Strategies are not played, only maintained.
• Due to CKR (common knowledge of rationality, cf. Hargreaves Heap &
Varoufakis, 2004), all models of mixed strategies are correct. (I.e.,
q−i = s−i, for all i.)
• Players gradually adapt their mixed strategies through hill-climbing in
the payoff space.
Gerard Vreeswijk. Last modified on February 27th, 2012 at 18:35 Slide 41