Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background...
-
date post
15-Jan-2016 -
Category
Documents
-
view
224 -
download
0
Transcript of Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background...
![Page 1: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/1.jpg)
![Page 2: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/2.jpg)
Outline
• MDP (brief)– Background– Learning MDP
• Q learning
• Game theory (brief)– Background
• Markov games (2-player)– Background– Learning Markov games
• Littman’s Minimax Q learning (zero-sum)• Hu & Wellman’s Nash Q learning (general-sum)
![Page 3: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/3.jpg)
/ SG/ POSG
Stochastic games (SG)
Partially observable SG (POSG)
![Page 4: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/4.jpg)
Immediate reward
Expectation over next states
Value of next state
![Page 5: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/5.jpg)
• Model-based reinforcement learning:1. Learn the reward function and the state transition function
2. Solve for the optimal policy
• Model-free reinforcement learning:1. Directly learn the optimal policy without knowing the reward
function or the state transition function
![Page 6: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/6.jpg)
#times action a has been executed in state s
#times action a causes state transition s s’
Total reward accrued when applying a in s
![Page 7: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/7.jpg)
v(s’)
![Page 8: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/8.jpg)
1. Start with arbitrary initial values of Q(s,a), for all sS, aA
2. At each time t the agent chooses an action and observes its reward rt
3. The agent then updates its Q-values based on the Q-learning rule
4. The learning rate t needs to decay over time in order for the learning algorithm to converge
![Page 9: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/9.jpg)
![Page 10: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/10.jpg)
Famous game theory example
![Page 11: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/11.jpg)
![Page 12: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/12.jpg)
![Page 13: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/13.jpg)
A co-operative game
![Page 14: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/14.jpg)
![Page 15: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/15.jpg)
![Page 16: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/16.jpg)
Mixed strategy
Generalization of MDP
![Page 17: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/17.jpg)
![Page 18: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/18.jpg)
Stationary: the agent’s policy does not change over time
Deterministic: the same action is always chosen whenever the agent is in state s
![Page 19: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/19.jpg)
Example
0 1 -1
-1 0 1
1 -1 0
1 -1
-1 1
2 1 1
1 2 1
1 1 2State 1
State 2
1 1
1 1
![Page 20: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/20.jpg)
v(s,*) v(s,) for all s S,
![Page 21: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/21.jpg)
Max V
Such that: rock + paper + scissors = 1
![Page 22: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/22.jpg)
Best response
Worst case
Expectation over all actions
![Page 23: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/23.jpg)
![Page 24: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/24.jpg)
Quality of a state-action pair
Discounted value of all succeeding states weighted by their likelihood
Discounted value of all succeeding states
This learning rule converges to the correct values of Q and v
![Page 25: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/25.jpg)
eplor controls how often the agent will deviate from its current policy
Expected reward for taking
action a when opponent chooses o from state s
![Page 26: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/26.jpg)
![Page 27: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/27.jpg)
![Page 28: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/28.jpg)
![Page 29: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/29.jpg)
![Page 30: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/30.jpg)
![Page 31: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/31.jpg)
Hu and Wellman general-sum Markov games as a framework for RL
Theorem (Nash, 1951) There exists a mixed strategy Nash equilibrium for any finite bimatrix game
![Page 32: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/32.jpg)
![Page 33: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/33.jpg)
![Page 34: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/34.jpg)
![Page 35: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/35.jpg)
![Page 36: Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.](https://reader035.fdocuments.in/reader035/viewer/2022062309/56649d4a5503460f94a276e8/html5/thumbnails/36.jpg)