Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol...
-
date post
20-Jan-2016 -
Category
Documents
-
view
216 -
download
0
Transcript of Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol...
![Page 1: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/1.jpg)
Multiple timescales for multiagent learning
David Leslie and E. J. Collins
University of Bristol
David Leslie is supported by CASE Research Studentship 00317214 from theUK Engineering and Physical Sciences Research Council in cooperation with BAE SYSTEMS.
![Page 2: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/2.jpg)
NIPS 2002workshop onmultiagent learning
Introduction
Learning in iterated normal form games.
Simple environment.
Theoretical properties of multiagent Q-learning.
![Page 3: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/3.jpg)
NIPS 2002workshop onmultiagent learning
Notation
players.
Player plays mixed strategy .
Opponent mixed strategy .
Expected reward for playing is .
estimated by .
N
i
),( ii ar
i
a
i
)(aQ in),( i
ni ar
![Page 4: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/4.jpg)
NIPS 2002workshop onmultiagent learning
Mixed strategies
Mixed equilibria necessary.
Mixed strategies from values.
Boltzmann smoothing with fixed temperature parameter .
Q
![Page 5: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/5.jpg)
NIPS 2002workshop onmultiagent learning
Fixed temperatures
Nash distribution approximates Nash equilibrium.
No discontinuities.
True convergence to mixed strategies.
![Page 6: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/6.jpg)
NIPS 2002workshop onmultiagent learning
Q-learning
Standard Q-learning, except for division by .
is the indicator function, is the reward.
Learning parameters satisfy
)(ain
I inR
n n
in
in .)( , 2
)()(
)()(}{1 a
aQRIaQaQ i
n
in
in
aa
in
in
in i
n
![Page 7: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/7.jpg)
NIPS 2002workshop onmultiagent learning
Three player penniesPlayer 1
Player 3
Player 2
1 point if choicematches player
2
1 point ifchoice
matchesplayer 3
1 point ifchoice is
opposite toplayer 1
![Page 8: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/8.jpg)
NIPS 2002workshop onmultiagent learning
A plot of Q values
![Page 9: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/9.jpg)
NIPS 2002workshop onmultiagent learning
Stochastic approximation
Relate to an ODE.
implies values track
Deterministic, continuous time system.
jijn
in , Q
)(),()(dd aQaraQ i
tit
iitt
![Page 10: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/10.jpg)
NIPS 2002workshop onmultiagent learning
Analysis of the example
Unique fixed point.
Small temperatures make fixed point unstable - a periodic orbit is stable.
Explains cycling of values.Q
![Page 11: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/11.jpg)
NIPS 2002workshop onmultiagent learning
Multiple timescales - I
Generalise stochastic approximation.
for .
The quicker , the slower the process adapts.
0jnin ji
i
nCin
)(
0in
![Page 12: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/12.jpg)
NIPS 2002workshop onmultiagent learning
Multiple timescales - II
Fast processes can fully adapt to slow processes.
Slow processes see fast processes as having completely converged.
Will work if the fast processes converge to a unique value for each fixed value of the slow processes.
![Page 13: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/13.jpg)
NIPS 2002workshop onmultiagent learning
Multiple-timescalesQ-learning assumption Assume that for fixed the
values of will converge to a unique value, resulting in joint best response .
For example, holds for two-player games and for cyclic games.
),...,( 1 jQQ),...,( 1 N
njn QQ
),...,( 1 jj QQB
![Page 14: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/14.jpg)
NIPS 2002workshop onmultiagent learning
Convergence of multiple-timescales Q-learning Behaviour determined by the ODE
Can prove convergence if player 1 has only two actions.
Hence process converges for three player pennies.
)())(,()( 11111dd aQBaraQ tttt
![Page 15: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/15.jpg)
NIPS 2002workshop onmultiagent learning
Another plot of Q values
![Page 16: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/16.jpg)
NIPS 2002workshop onmultiagent learning
Conclusion
Theoretical study of multiagent learning.
Fixed temperature parameter to achieve mixed equilibria from values.
Multiple timescales assists convergence and enables theoretical study.
![Page 17: Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.](https://reader033.fdocuments.in/reader033/viewer/2022051517/56649d6c5503460f94a4c5f1/html5/thumbnails/17.jpg)
NIPS 2002workshop onmultiagent learning
Future work
Investigate when the convergence assumption must hold.
Experiments with multiple-timescales learning in Markov games.
Theoretical results for multiple-timescales learning in Markov games.