Oelze - ECE 571 Lecture 4 Biological Measurement 1 BIOE 571.
CSE-571 Robotics - University of Washington€¦ · CSE-571 Robotics Planning and Control: Markov...
Transcript of CSE-571 Robotics - University of Washington€¦ · CSE-571 Robotics Planning and Control: Markov...
![Page 1: CSE-571 Robotics - University of Washington€¦ · CSE-571 Robotics Planning and Control: Markov Decision Processes Problem Classes •Deterministic vs. stochastic actions •Full](https://reader035.fdocuments.in/reader035/viewer/2022062605/5fc6bfb8e40a9529e009d394/html5/thumbnails/1.jpg)
1
CSE-571Robotics
Planning and Control:
Markov Decision Processes
Problem Classes
•Deterministic vs. stochastic actions
•Full vs. partial observability
Deterministic, fully observable
Stochastic, Fully Observable Stochastic, Partially Observable Markov Decision Process (MDP)
s2
s3
s4 s5
s1
0.7
0.3
0.90.1
0.3
0.30.4
0.99
0.01
0.2
0.8 r=-10
r=20
r=0
r=1
r=0
![Page 2: CSE-571 Robotics - University of Washington€¦ · CSE-571 Robotics Planning and Control: Markov Decision Processes Problem Classes •Deterministic vs. stochastic actions •Full](https://reader035.fdocuments.in/reader035/viewer/2022062605/5fc6bfb8e40a9529e009d394/html5/thumbnails/2.jpg)
2
Markov Decision Process (MDP)
• Given:• States x• Actions u• Transition probabilities p(x’|u,x)• Reward / payoff function r(x,u)
• Wanted:• Policy p(x) that maximizes the future
expected reward
Rewards and Policies• Policy (general case):
• Policy (fully observable case):
• Expected cumulative payoff:
• T=1: greedy policy• T>1: finite horizon case, typically no discount• T=infty: infinite-horizon case, finite reward if discount < 1
ttt uuz ®-- 1:11:1 ,:p
tt ux ®:p
úû
ùêë
é= å
=+
T
tT rER1t
ttg
Policies contd.• Expected cumulative payoff of policy:
• Optimal policy:
• 1-step optimal policy:
• Value function of 1-step optimal policy:
úû
ùêë
é== å
=-+-+++
T
tttttT uzurExR1
1:11:1 )(|)(t
tttttp pg
),(argmax)(1 uxrxu
=p
),(max)(1 uxrxVu
g=
)(argmax tT xRpp
p =*
2-step Policies• Optimal policy:
• Value function:
π 2 (x) = argmaxu
r(x,u)+ V1(x ')p(x ' | u, x)dx '∫⎡⎣
⎤⎦
V2 (x) = γ maxu r(x,u)+ V1(x ')p(x ' | u, x)dx '∫⎡⎣
⎤⎦
T-step Policies• Optimal policy:
• Value function:
πT (x) = argmaxu
r(x,u)+ VT −1(x ')p(x ' | u, x)dx '∫⎡⎣
⎤⎦
VT (x) = γ maxu r(x,u)+ VT −1(x ')p(x ' | u, x)dx '∫⎡⎣
⎤⎦
Infinite Horizon
• Optimal policy:
• Bellman equation
• Fix point is optimal policy
• Necessary and sufficient condition
V∞(x) = γ maxu r(x,u)+ V∞(x ')p(x ' | u, x)dx '∫⎡⎣
⎤⎦
![Page 3: CSE-571 Robotics - University of Washington€¦ · CSE-571 Robotics Planning and Control: Markov Decision Processes Problem Classes •Deterministic vs. stochastic actions •Full](https://reader035.fdocuments.in/reader035/viewer/2022062605/5fc6bfb8e40a9529e009d394/html5/thumbnails/3.jpg)
3
Value Iteration• for all x do
• endfor
• repeat until convergence• for all x do
• endfor• endrepeat
V̂ (x)←γ maxu
r(x,u)+ V̂ (x ')p(x ' | u, x)dx '∫⎡⎣
⎤⎦
min)(ˆ rxV ¬
π (x) = argmaxu
r(x,u)+ V̂ (x ')p(x ' | u, x)dx '∫⎡⎣
⎤⎦
k=0
Noise=0.2Discount=0.9Livingreward=0
k=1
Noise=0.2Discount=0.9Livingreward=0
k=2 k=3 k=4
![Page 4: CSE-571 Robotics - University of Washington€¦ · CSE-571 Robotics Planning and Control: Markov Decision Processes Problem Classes •Deterministic vs. stochastic actions •Full](https://reader035.fdocuments.in/reader035/viewer/2022062605/5fc6bfb8e40a9529e009d394/html5/thumbnails/4.jpg)
4
k=5 k=6 k=7
k=8 k=9 k=10
![Page 5: CSE-571 Robotics - University of Washington€¦ · CSE-571 Robotics Planning and Control: Markov Decision Processes Problem Classes •Deterministic vs. stochastic actions •Full](https://reader035.fdocuments.in/reader035/viewer/2022062605/5fc6bfb8e40a9529e009d394/html5/thumbnails/5.jpg)
5
k=11 k=12 k=100
Value Function and Policy•Each step takes O(|A| |S| |S|) time.•Number of iterations required is
polynomial in |S|, |A|, 1/(1-gamma)
Value Iteration for Motion Planning(assumes knowledge of robot’s location)
Frontier-based Exploration• Every unknown location is a target point.
![Page 6: CSE-571 Robotics - University of Washington€¦ · CSE-571 Robotics Planning and Control: Markov Decision Processes Problem Classes •Deterministic vs. stochastic actions •Full](https://reader035.fdocuments.in/reader035/viewer/2022062605/5fc6bfb8e40a9529e009d394/html5/thumbnails/6.jpg)
6
Manipulator Control
Arm with two joints Configuration space
Manipulator Control Path
State space Configuration space
Manipulator Control Path
State space Configuration space
POMDPs• In POMDPs we apply the very same idea as in
MDPs.
• Since the state is not observable, the agent has to make its decisions based on the belief state which is a posterior distribution over states.
• For finite horizon problems, the resulting value functions are piecewise linear and convex.
• In each iteration the number of linear constraints grows exponentially.
• Full fledged POMDPs have only been applied to very small state spaces with small numbers of possible observations and actions.
• Approximate solutions are becoming more and more capable.