Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by...
-
date post
20-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by...
Nash Q-Learning for General-Sum Stochastic Games
Hu & Wellman
March 6th, 2006
CS286r
Presented by
Ilan Lobel
Outline
Stochastic Games and Markov Perfect Equilibria Bellman’s Operator as a Contraction Mapping Stochastic Approximation of a Contraction Mapping Application to Zero-Sum Markov Games Minimax-Q Learning Theory of Nash-Q Learning Empirical Testing of Nash-Q Learning
How do we model games that evolve over time ?
Stochastic Games ! Current Game = State
Ingredients:– Agents (N)– States (S)– Payoffs (R)– Transition Probabilities (P)– Discount Factor (δ)
Example of a Stochastic Game
1,2 3,4
5,6 7,8
-1,2 -3,4
-5,6 -7,8
A
B
C D
0,0
-10,10
A
B
C D E
Move with 30% probabilitywhen (B,D)
Move with 50% probabilitywhen (A,C) or (A,D)
δ = 0.9
Markov Perfect Equilibrium (MPE)
Strategy maps states into randomized actions– πi: S Δ(A)
No agent has an incentive to unilaterally change her policy.
Cons & Pros of MPEs
Cons:– Can’t implement everything described by the Folk
Theorems (i.e., no trigger strategies)
Pros:– MPEs always exist in finite Markov Games (Fink, 64)– Easier to “search for”
Learning in Stochastic Games
Learning is specially important in Markov Games because MPE are hard to compute.
Do we know:– Our own payoffs ?– Others’ rewards ?– Transition probabilities ?– Others’ strategies ?
Learning in Stochastic Games
Adapted from Reinforcement Learning:– Minimax-Q Learning (zero-sum games)– Nash-Q Learning– CE-Q Learning
Zero-Sum Stochastic Games
Nice properties:– All equilibria have the same value.– Any equilibrium strategy of player 1 against any
equilibrium strategy of player 2 produces an MPE.– It has a Bellman’s-type equation.
Bellman’s Equation in DP
)}'()',,(),({max)('
sJ*ssaPasrsJ*s
a
Bellman Operator: T
Bellman’s Equation Rewritten:
TJ*J*
)}'()',,(),({max))(('
sJssaPasrsTJs
a
Contraction Mapping
The Bellman Operator has the contraction property:
Bellman’s Equation is a direct consequence of the contraction.
|)(')(|max |)(')(|max sJsJsTJsTJ ss
The Shapley Operator for Zero-Sum Stochastic Games
)}'(),,',( ),,({maxmin))((s'
212121 sJaassPaasrsTJ aa
The Shapley Operator is a contraction mapping. (Shapley, 53)
Hence, it also has a fixed point, which is an MPE:
TJ*J*
Value Iteration for Zero-Sum Stochastic Games
Direct consequence of contraction.
Converges to fixed point of operator.
kk TJJ 1
0any Start with J
Q-Learning
Another consequence of a contraction mapping:– Q-Learning converges !
Q-Learning can be described as an approximation of value iteration:– Value iteration with noise.
Q-Learning Convergence
Q-Learning is called a Stochastic Iterative Approximation of Bellman’s operator:– Learning Rate of 1/t.– Noise is zero-mean and has bounded variance.
It converges if all state-action pairs are visited infinitely often.
(Neuro-Dynamic Programming – Bertsekas, Tsitsiklis)
Minimax-Q Learning Algorithm For Zero-Sum Stochastic Games
Initialize your Q0(s,a1,a2) for all states, actions. Update rule:
Player 1 then chooses action u1 in the next stage sk+1.
)}],,({maxmin),,([
),,(1(),,(
21121
2121
21
1
uusQaasr
aasQaasQ
kkuukt
ktk kk
Minimax-Q Learning
It’s a Stochastic Iterative Approximation of Shapley Operator.
It converges to a Nash Equilibrium if all state-action-action triplets are visited infinitely often. (Littman, 96)
Can we extend it to General-Sum Stochastic Games ?
Yes & No. Nash-Q Learning is such an extension. However, it has much worse computational
and theoretical properties.
Nash-Q Learning Algorithm
Initialize Q0j(s,a1,a2) for all states, actions and for
every agent.– You must simulate everyone’s Q-factors.
Update rule:
Choose the randomized action generated by the Nash operator.
)}],,({ ),,([
),,(1(),,(
21121
21211
uusQNashaasr
aasQaasQ
kkkt
kjktk
jk
The Nash Operator andThe Principle of Optimality
Nash Operator finds the Nash of a stage game. Find Nash of stage game with Q-factors as
your payoffs.
)},,({ ),,( 21121 uusQNashaasr kkk
Payoffs for Rest of theMarkov Game
Current Reward
The Nash Operator
Unkown complexity even for 2 players. In comparison, the minimax operator can be
solved in polynomial time. (there’s a linear programming formulation)
For convergence, all players must break ties in favor of the same Nash Equilibrium.
Why not go model-based if computation is so expensive ?
Convergence Results
If every stage game encountered during learning has a global optimum, Nash-Q converges.
If every stage game encountered during learning has a saddle point, Nash-Q converges.
Both of these are VERY strong assumptions.
Convergence Result Analysis
The global optimum assumption implies full cooperation between agents.
The saddle point assumption implies no cooperation between agents.
Are these equivalent to DP Q-Learning and minimax-Q Learning, respectively ?
Empirical Performance
In very small and simple games, Nash-Q learning often converged even though theory did not predict so.
In particular, if all Nash Equilibria have the same value Nash-Q did better than expected.