Byeong-Joo Lee Multi-component Heterogeneous System Byeong-Joo Lee POSTECH - MSE
MDP solutions - GitHub Pages · 1 S. Joo ([email protected]) 10/28/2014 1 CS 4649/7649...
Transcript of MDP solutions - GitHub Pages · 1 S. Joo ([email protected]) 10/28/2014 1 CS 4649/7649...
![Page 1: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/1.jpg)
1
10/28/2014S. Joo ([email protected]) 1
CS 4649/7649Robot Intelligence: Planning
Sungmoon Joo
School of Interactive Computing
College of Computing
Georgia Institute of Technology
MDP solutions
*Slides based in part on Dr. Mike Stilman and Dr. Pieter Abbeel’s slides
10/28/2014S. Joo ([email protected]) 2
• CS7649
- project proposal: Due Oct. 30 (proposal outline: proposal_outline.pdf)
- project final report: Due Dec. 4, 23:59pm, conference-style paper
- project presentation: Dec. 11, 11:30am - 2:20pm
• CS4649
- project reviewer assignment: Oct. 28 ( 2 ~ 3 reviewers/project)
- proposal review report: Due Nov. 6
- project review report(for the assigned project): Due Dec. 11, 11:30am
- project presentation review*(for all presentation): Due Dec. 11, 2:20pm
*presentation review sheets will be provided
Administrative– Final Project
![Page 2: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/2.jpg)
2
10/28/2014S. Joo ([email protected]) 3
Probability
(1) 0 ≤ P(A) ≤ 1 (2) P(True) = 1 (3) P(False) = 0
(4) P(A or B) = P(A) + P(B) – P(A and B)
(5) P(A = vi ∧ A = vj) = 0 if i ≠ j
(6) P(A = v1 ∨ A = v2 ∨… ∨ A = vk) = 1
P (not A) = P (¬ A) = 1 – P(A)
P (A) = P (A ∧ B) + P(A ∧ ¬ B)
Axioms
Theorems
10/28/2014S. Joo ([email protected]) 4
Conditional Probability
• Conditional Probability (Definition)
• Corollary
• Bayes Rule
![Page 3: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/3.jpg)
3
10/28/2014S. Joo ([email protected]) 5
Markov Property
● “Markov” generally means that given the present state, the future and the past are independent
● For MP, “Markov” means
● For MDP, “Markov” means
10/28/2014S. Joo ([email protected]) 6
Solving MDP
● In deterministic single-agent search problems, we want an optimal
plan, or sequence of actions, from start to a goal
● For MDP, we want an optimal policy
- A policy gives an action for each state- An optimal policy maximizes expected utility(e.g. sum of rewards),
if followed
● Two ways to define stationary utilities
- Additive utility
- Discounted utility
![Page 4: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/4.jpg)
4
10/28/2014S. Joo ([email protected]) 7
Solving MDP
● Problem: infinite state sequences have infinite rewards
● Solutions
- Finite horizon: Terminate episodes after a fixed horizon of T steps,
yielding non-stationary policies (policies depend on time)
- Discounting: for
- …
● We’ve discussed the case with the infinite horizon & discounting factor
- the optimal policy is stationary (the same as at all times)
10/28/2014S. Joo ([email protected]) 8
Robots with Uncertain Actions
![Page 5: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/5.jpg)
5
10/28/2014S. Joo ([email protected]) 9
Value Iteration
10/28/2014S. Joo ([email protected]) 10
Value Iteration
![Page 6: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/6.jpg)
6
10/28/2014S. Joo ([email protected]) 11
Value Iteration
10/28/2014S. Joo ([email protected]) 12
Value Iteration
![Page 7: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/7.jpg)
7
10/28/2014S. Joo ([email protected]) 13
Value Iteration
10/28/2014S. Joo ([email protected]) 14
Effect of Discount & Uncertainty in MDP
Parameters:
- Discount
- Uncertainty/Noise
Example from: http://www.cs.berkeley.edu/~pabbeel/cs287-fa13/ - Rewards (e.g. cliff)
+ Rewards
![Page 8: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/8.jpg)
8
10/28/2014S. Joo ([email protected]) 15
discount uncertainty discount uncertainty
Effect of Discount & Uncertainty in MDP
10/28/2014S. Joo ([email protected]) 16
discount uncertainty discount uncertainty
Effect of Discount & Uncertainty in MDP
![Page 9: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/9.jpg)
9
10/28/2014S. Joo ([email protected]) 17
MDP Applications: Passive Dynamic Walking
10/28/2014S. Joo ([email protected]) 18
MDP Applications: Inverted Pendulum
![Page 10: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/10.jpg)
10
10/28/2014S. Joo ([email protected]) 19
MDP Applications: Inverted Pendulum
10/28/2014S. Joo ([email protected]) 20
MDP Applications: Inverted Pendulum
![Page 11: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/11.jpg)
11
10/28/2014S. Joo ([email protected]) 21
Improving Efficiency
10/28/2014S. Joo ([email protected]) 22
Policy Evaluation
● Our goal is to find a policy, not just computing values.
● In value iteration, we compute values then find a policy
● How do we calculate the V’s for a fixed policy?
- use Bellman’s equation
- …
![Page 12: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/12.jpg)
12
● Policy Iteration
- Step 1. Policy evaluation
: calculate utilities for some fixed policy (not optimal utilities!) until convergence
- Step 2. Policy improvement
: update policy using one step look-ahead with resulting converged (but not optimal!) utilities as future values
-Repeat steps until policy converges
● Properties
- It’s still optimal!
- Can converge faster under some conditions
10/28/2014S. Joo ([email protected]) 23
Policy Iteration
● Policy Evaluation
- with a fixed current policy , find values with Bellman updates
- iterate until values converges
● Policy Improvement
- with fixed(converged) values, find the best action
● Theorem
Policy iteration is guaranteed to converge and at convergence, the current policy and its value function are the optimal policy and the optimal value function.
10/28/2014S. Joo ([email protected]) 24
Policy Iteration
![Page 13: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/13.jpg)
13
10/28/2014S. Joo ([email protected]) 25
Policy Iteration
10/28/2014S. Joo ([email protected]) 26
Policy Iteration
![Page 14: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/14.jpg)
14
10/28/2014S. Joo ([email protected]) 27
Policy Iteration
10/28/2014S. Joo ([email protected]) 28
Policy Iteration
![Page 15: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/15.jpg)
15
10/28/2014S. Joo ([email protected]) 29
Policy Iteration
10/28/2014S. Joo ([email protected]) 30
Policy Iteration
![Page 16: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/16.jpg)
16
10/28/2014S. Joo ([email protected]) 31
Comparison
● In value iteration:
- Every pass (or “backup”) updates both utilities (explicitly, based on
current utilities) and policy (possibly implicitly, based on current policy)
10/28/2014S. Joo ([email protected]) 32
● In policy iteration:
- Several passes to update utilities with frozen policy
- Occasional passes to update policies
Comparison
![Page 17: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/17.jpg)
17
10/28/2014S. Joo ([email protected]) 33
Asynchronous Iterations
● Asynchronous policy iteration(we saw this)
- Any sequences of partial updates to either policy entries or utilities will
converge if every state is visited infinitely often
● Asynchronous value iteration
- In value iteration, we update every state in each iteration
- Actually, any sequences of Bellman updates will converge if every state
is visited infinitely often
- In fact, we can update the policy as seldom or often as we like, and we
will still converge
- Idea : Update states whose value we expect to change
If is large then update predecessors of s
10/28/2014S. Joo ([email protected]) 34
Model availability: P(j|i,a)
![Page 18: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/18.jpg)
18
10/28/2014S. Joo ([email protected]) 35
Q-Learning
10/28/2014S. Joo ([email protected]) 36
Q-Learning
![Page 19: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/19.jpg)
19
10/28/2014S. Joo ([email protected]) 37
Q-Learning
10/28/2014S. Joo ([email protected]) 38
Q-Learning
![Page 20: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/20.jpg)
20
10/28/2014S. Joo ([email protected]) 39
Q-Learning
10/28/2014S. Joo ([email protected]) 40
Q-Learning
![Page 21: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/21.jpg)
21
10/28/2014S. Joo ([email protected]) 41
Q-Learning
10/28/2014S. Joo ([email protected]) 42
Q-Learning
![Page 22: MDP solutions - GitHub Pages · 1 S. Joo (sungmoon.joo@cc.gatech.edu) 10/28/2014 1 CS 4649/7649 Robot Intelligence: Planning Sungmoon Joo School of Interactive Computing College of](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5671c06f546b25e22e2f4d/html5/thumbnails/22.jpg)
22
10/28/2014S. Joo ([email protected]) 43
Q-Learning
10/28/2014S. Joo ([email protected]) 44
Q-Learning