Markov Decision Processes Chapter 17
-
Upload
allistair-alford -
Category
Documents
-
view
51 -
download
5
description
Transcript of Markov Decision Processes Chapter 17
![Page 1: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/1.jpg)
Markov Decision ProcessesChapter 17
Mausam
![Page 2: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/2.jpg)
MDP vs. Decision Theory
• Decision theory – episodic
• MDP -- sequential
![Page 3: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/3.jpg)
Markov Decision Process (MDP)
• S: A set of states• A: A set of actions• Pr(s’|s,a): transition model• C(s,a,s’): cost model• G: set of goals
• s0: start state
• : discount factor• R(s,a,s’): reward model
factoredFactored MDP
absorbing/non-absorbing
![Page 4: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/4.jpg)
Objective of an MDP
• Find a policy : S → A
• which optimizes • minimizes expected cost to reach a goal• maximizes expected reward• maximizes expected (reward-cost)
• given a ____ horizon• finite• infinite• indefinite
• assuming full observability
discountedor
undiscount.
![Page 5: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/5.jpg)
Role of Discount Factor ()
• Keep the total reward/total cost finite• useful for infinite horizon problems
• Intuition (economics): • Money today is worth more than money
tomorrow.
• Total reward: r1 + r2 + 2r3 + …
• Total cost: c1 + c2 + 2c3 + …
![Page 6: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/6.jpg)
Examples of MDPs
• Goal-directed, Indefinite Horizon, Cost Minimization MDP• <S, A, Pr, C, G, s0>• Most often studied in planning, graph theory communities
• Infinite Horizon, Discounted Reward Maximization MDP• <S, A, Pr, R, >• Most often studied in machine learning, economics,
operations research communities
• Oversubscription Planning: Non absorbing goals, Reward Max. MDP• <S, A, Pr, G, R, s0>• Relatively recent model
most popular
![Page 7: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/7.jpg)
AND/OR Acyclic Graphs vs. MDPs
P
RQ S T
G
P
R S T
G
a ba b
c c c c c c c
0.6 0.4 0.50.5 0.6 0.4 0.50.5
C(a) = 5, C(b) = 10, C(c) =1
Expectimin works• V(Q/R/S/T) = 1• V(P) = 6 – action a
Expectimin doesn’t work•infinite loop
• V(R/S/T) = 1• Q(P,b) = 11• Q(P,a) = ????• suppose I decide to take a in P• Q(P,a) = 5+ 0.4*1 + 0.6Q(P,a)• = 13.5
![Page 8: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/8.jpg)
Bellman Equations for MDP1
• <S, A, Pr, C, G, s0>
• Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state.
• J* should satisfy the following equation:
![Page 9: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/9.jpg)
Bellman Equations for MDP2
• <S, A, Pr, R, s0, >
• Define V*(s) {optimal value} as the maximum expected discounted reward from this state.
• V* should satisfy the following equation:
![Page 10: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/10.jpg)
Bellman Backup (MDP2)
• Given an estimate of V* function (say Vn)• Backup Vn function at state s
• calculate a new estimate (Vn+1) :
• Qn+1(s,a) : value/cost of the strategy:• execute action a in s, execute n subsequently• n = argmaxa∈Ap(s)Qn(s,a)
V
R V
ax
![Page 11: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/11.jpg)
Bellman Backup
V0= 0
V0= 1
V0= 2
Q1(s,a1) = 2 + 0Q1(s,a2) = 5 + 0.9£ 1
+ 0.1£ 2Q1(s,a3) = 4.5 + 2
max
V1= 6.5
(
agreedy = a3
2
4.5
5a2
a1
a3
s0
s1
s2
s3
0.9
0.1
![Page 12: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/12.jpg)
Value iteration [Bellman’57]
• assign an arbitrary assignment of V0 to each state.
• repeat• for all states s
• compute Vn+1(s) by Bellman backup at s.
• until maxs |Vn+1(s) – Vn(s)| <
Iteration n+1
Residual(s)
-convergence
![Page 13: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/13.jpg)
Comments
• Decision-theoretic Algorithm• Dynamic Programming • Fixed Point Computation• Probabilistic version of Bellman-Ford Algorithm
• for shortest path computation• MDP1 : Stochastic Shortest Path Problem
Time Complexity• one iteration: O(|S|2|A|) • number of iterations: poly(|S|, |A|, 1/1-)
Space Complexity: O(|S|) Factored MDPs = Planning under uncertainty
• exponential space, exponential time
![Page 14: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/14.jpg)
Convergence Properties
• Vn → V* in the limit as n→1• -convergence: Vn function is within of V*• Optimality: current policy is within 2 of
optimal
• Monotonicity• V0 ≤p V* ⇒ Vn ≤p V* (Vn monotonic from below)• V0 ≥p V* ⇒ Vn ≥p V* (Vn monotonic from above)• otherwise Vn non-monotonic
![Page 15: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/15.jpg)
Policy Computation
Optimal policy is stationary and time-independent.• for infinite/indefinite horizon problems
Policy Evaluation
A system of linear equations in |S| variables.
ax
ax R V
R VV
![Page 16: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/16.jpg)
Changing the Search Space
• Value Iteration• Search in value space• Compute the resulting policy
• Policy Iteration• Search in policy space• Compute the resulting value
![Page 17: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/17.jpg)
Policy iteration [Howard’60]
• assign an arbitrary assignment of 0 to each state.
• repeat• Policy Evaluation: compute Vn+1the evaluation of n
• Policy Improvement: for all states s• compute n+1(s): argmaxa2 Ap(s)Qn+1(s,a)
• until n+1 nAdvantage
• searching in a finite (policy) space as opposed to uncountably infinite (value) space ⇒ convergence faster.
• all other properties follow!
costly: O(n3)
approximateby value iteration using fixed policy
Modified Policy Iteration
![Page 18: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/18.jpg)
Modified Policy iteration
• assign an arbitrary assignment of 0 to each state.
• repeat• Policy Evaluation: compute Vn+1the approx.
evaluation of n
• Policy Improvement: for all states s• compute n+1(s): argmaxa2 Ap(s)Qn+1(s,a)
• until n+1 nAdvantage
• probably the most competitive synchronous dynamic programming algorithm.
![Page 19: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/19.jpg)
Applications Stochastic Games Robotics: navigation, helicopter manuevers… Finance: options, investments Communication Networks Medicine: Radiation planning for cancer Controlling workflows Optimize bidding decisions in auctions Traffic flow optimization Aircraft queueing for landing; airline meal
provisioning Optimizing software on mobiles Forest firefighting …
![Page 20: Markov Decision Processes Chapter 17](https://reader035.fdocuments.in/reader035/viewer/2022062407/56812c67550346895d90fca6/html5/thumbnails/20.jpg)
Extensions Heuristic Search + Dynamic Programming
• AO*, LAO*, RTDP, …
Factored MDPs• add planning graph style heuristics• use goal regression to generalize better
Hierarchical MDPs• hierarchy of sub-tasks, actions to scale better
Reinforcement Learning• learning the probability and rewards• acting while learning – connections to psychology
Partially Observable Markov Decision Processes• noisy sensors; partially observable environment• popular in robotics