Montague Meets Markov: Deep Semantics with Probabilistic Logical Form
Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov...
Transcript of Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov...
![Page 1: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/1.jpg)
Probabilistic Planning with
Markov Decision Processes
Andrey Kolobov and Mausam
Computer Science and Engineering
University of Washington, Seattle
1
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA
![Page 2: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/2.jpg)
Goal
2
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA
an extensive introduction to theory and algorithms in probabilistic planning
![Page 3: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/3.jpg)
Outline of the Tutorial
• Introduction
• Fundamentals of MDPs
• Uninformed Algorithms
• Heuristic Search Algorithms
• Approximation Algorithms
• Extension of MDPs
3
(10 mins)
(1+ hr)
(1 hr)
(1 hr)
(1+ hr)
(remaining time)
![Page 4: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/4.jpg)
Outline of the Tutorial
• Introduction
• Fundamentals of MDPs
• Uninformed Algorithms
• Heuristic Search Algorithms
• Approximation Algorithms
• Extension of MDPs
4
(10 mins)
(1+ hr)
(1 hr)
(1 hr)
(1+ hr)
(remaining time)
![Page 5: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/5.jpg)
INTRODUCTION
5
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA
![Page 6: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/6.jpg)
Planning
What action
next?
Percepts Actions
Environment
Static vs. Dynamic
Fully
vs.
Partially
Observable
Perfect
vs.
Noisy
Deterministic vs.
Stochastic
Instantaneous vs.
Durative
Sequential vs.
Concurrent
![Page 7: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/7.jpg)
Classical Planning
What action
next?
Percepts Actions
Environment
Fully
Observable
Perfect
Instantaneous
Sequential
Deterministic
Static
![Page 8: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/8.jpg)
Probabilistic Planning
What action
next?
Percepts Actions
Environment
Static
Fully
Observable
Perfect
Stochastic
Instantaneous
Sequential
![Page 9: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/9.jpg)
Markov Decision Processes
• A fundamental framework for prob. planning
• History – 1950s: early works of Bellman and Howard
– 50s-80s: theory, basic set of algorithms, applications
– 90s: MDPs in AI literature
• MDPs in AI – reinforcement learning
– probabilistic planning
9
we focus on this
![Page 10: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/10.jpg)
What are MDPs good for?
• Uncertain Domain Dynamics
• Sequential Decision Making • Cyclic Domain Structures
• Full Observability and Perfect Sensors
• Fair Nature
• Rational Decision Making
10
![Page 11: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/11.jpg)
Markov Decision Process
Operations
Research
Artificial
Intelligence
Gambling
Theory
Graph
Theory
Robotics Neuroscience
/Psychology
Control
Theory
Economics
An MDP-Centric View
![Page 12: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/12.jpg)
Shameless Plug
12
Mausam and Andrey Kolobov “Planning with Markov Decision Processes: An AI Perspective”
Morgan and Claypool Publishers (Synthesis Lectures Series on Artificial Intelligence)
![Page 13: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/13.jpg)
Outline of the Tutorial
• Introduction
• Fundamentals of MDPs
• Uninformed Algorithms
• Heuristic Search Algorithms
• Approximation Algorithms
• Extension of MDPs
13
(10 mins)
(1+ hr)
(1 hr)
(1 hr)
(1+ hr)
(remaining time)
![Page 14: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/14.jpg)
FUNDAMENTALS OF MARKOV DECISION PROCESSES
14
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA
![Page 15: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/15.jpg)
3 Questions
• What is an MDP?
• What is an MDP solution?
• What it an optimal MDP solution?
15
![Page 16: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/16.jpg)
MDP: A Broad Definition
MDP is a tuple <S, D, A, T, R>, where:
• S is a finite state space
• D is a sequence of discrete time steps/decision epochs (1,2,3, … , L), L may be ∞
• A is a finite action set
• T: S x A x S x D [0, 1] is a transition function
• R: S x A x S x D R is a reward function
16
S1
S2
S3
S1
S2
S3
S1
S2
S3
a1
a1
a1 a1
a1
a1
a2
R(s3, a1, s3, 2) = 3.5
T(s3, a1, s3, 2) = 1.0
R(s3, a1, s2, 1) = -7.2
T(s3, a1, s2, 1) = 1.0
t =1 t =2 t = …
a2
a2
a2
a2
a2
![Page 17: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/17.jpg)
MDP Solution: A Broad Definition
• Want a way to choose an action in a state, i.e., a policy π
• What does a policy look like?
– Can pick action based on states visited + actions used so far, i.e., execution history h = s(1) a(1) s(2) a(2)… s
– Can pick actions randomly
• Thus, in general an MDP solution is a probabilistic history-dependent π: H x A [0,1]
17
![Page 18: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/18.jpg)
Evaluating MDP Solutions
• Executing a policy yields a sequence of rewards
• Let R1, R2, … be a sequence of random vars for rewards due to executing a policy
18
S S‘ S’’
t = i t = i +1 t = i + 2
r1 = R(s, a, s’, i) r2 = R(s’, a’, s’’, i+1)
a a’ …
![Page 19: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/19.jpg)
Evaluating MDP Solutions
• Define utility function u(R1, R2, … ) to be some “quality measure” of a reward sequence
– Need to be careful with definition, more on this later
• Define value function as V: H [-∞, ∞]
• Define value function of a policy after history h to be some utility function of subsequent rewards:
Vπ(h) = u (R1, R2, … )
19
h π
![Page 20: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/20.jpg)
Optimal MDP Solution: A Broad Definition
• Want: a behavior that is “best” in every situation.
• π* is an optimal policy if V*(h) ≥ Vπ(h) for all π, for all h
• Intuitively, a policy is optimal if its utility vector dominates
– π* not necessarily unique!
20
![Page 21: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/21.jpg)
3 Questions Revisited
• What is an MDP?
– M = <S, D, A, T, R>
• What is an MDP solution?
– Policy π: H x A [0,1], a mapping from histories to distributions over actions
• What it an optimal MDP solution?
– π* s.t. V*(h) ≥ Vπ(h) for all π and h, where Vπ(h) is some utility of rewards obtained after executing history h
21
![Page 22: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/22.jpg)
Anything Wrong w/ These Definitions?
• What is an MDP?
– M = <S, D, A, T, R>
• What is an MDP solution?
– Policy π: H x A [0,1], a mapping from histories to distributions over actions.
• What it an optimal MDP solution?
– π* s.t. V*(h) ≥ Vπ(h) for all π and h, where Vπ(h) is some utility of rewards obtained after executing history h
22
Optimality criterion is underspecified,
optimal policy may not exist!
![Page 23: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/23.jpg)
Fundamentals of MDPs
General MDP Definition
• Expected Linear Additive Utility
• The Optimality Principle
• Finite-Horizon MDPs
• Infinite-Horizon Discounted-Reward MDPs
• Stochastic Shortest-Path MDPs
• A Hierarchy of MDP Classes
• Factored MDPs
• Computational Complexity
23
![Page 24: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/24.jpg)
Dealing with Optimal Solution Existence
• Need to be careful when defining utility u(R1, R2, … ),
– E.g., u(R1, R2, … ) = R1 + R2 + … for the same h can be different across policy executions (i.e., not a well-defined function)
• Even for a well-defined u(R1, R2, … ), a policy π* s.t. V*(h) ≥ Vπ(h) for all π and h may not exist!
25
![Page 25: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/25.jpg)
Expected Linear Additive Utility
• Let’s use expected linear additive utility (ELAU)
u(R1, R2, … ) = E[R1 + γR2 + γ2R3 …]
where γ is the discount factor
• Assume γ = 1 unless stated otherwise
– 0 ≤ γ < 1: agent prefers more immediate rewards
– γ > 1: agent prefers more distant rewards
– γ = 1: rewards are equally valuable, independently of time
26
![Page 26: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/26.jpg)
Is ELAU What We Want?
Policy 1 Policy 2
– If it lands heads, you get $2M
– If it lands tails, you get nothin’
27
… your $1M … flip a fair coin
![Page 27: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/27.jpg)
Is ELAU What We Want?
• ELAU: “the utility of a policy is as good as the amount of reward the policy is expected to bring” – Agents using ELAU are “rational” (sometimes, a bad misnomer!)
• Assumes the agent is risk-neutral
– Indifferent to policies with equal reward expectation – E.g., disregards policies’ variance (in the previous example, policy
1 has lower variance)
• Not always the exact criterion we want, but... – “Good enough” – Convenient to work with – Guarantees the Optimality Principle
28
![Page 28: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/28.jpg)
Fundamentals of MDPs
General MDP Definition
Expected Linear Additive Utility
• The Optimality Principle
• Finite-Horizon MDPs
• Infinite-Horizon Discounted-Reward MDPs
• Stochastic Shortest-Path MDPs
• A Hierarchy of MDP Classes
• Factored MDPs
• Computational Complexity
29
![Page 29: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/29.jpg)
The Optimality Principle
If the quality of every policy can be measured by its expected linear additive utility, there is a policy that is optimal at every time step.
(Stated in various forms by
Bellman, Denardo, and others)
30
Guarantees that an optimal policy exists when ELAU is
well-defined!
![Page 30: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/30.jpg)
The Optimality Principle: Caveat #1
• When can policy quality not be measured by ELAU?
– Utility of above policy at s1 oscillates between 1 and 0
• ELAU isn’t well-defined unless the limit of the series E[R1 + R2 + …] exists
31
S1 S2
a1
a1
R(s2, a1, s1) = -1
R(s1, a1, s2) = 1
![Page 31: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/31.jpg)
The Optimality Principle: Caveat #2
• The utility of many policies may be infinite
– Every policy allows for ∞ reward from every state above
• ELAU may not be a meaningful criterion unless u(R1, R2, … ) = E[R1 + R2 + …] is bounded above.
32
S1 S2
a1
a1
R(s2, a1, s1) = 1
R(s1, a1, s2) = 1
a2 a2
R(s1, a2, s1) = 1 R(s2, a2, s2) = 1
![Page 32: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/32.jpg)
Recap So Far
• What is an MDP?
– M = <S, D, A, T, R>
• What is an MDP solution?
– Policy π: H x A [0,1], a mapping from histories to distributions over actions.
• What it an optimal MDP solution?
– π* s.t. V*(h) ≥ Vπ(h) for all π and h,where Vπ(h) is the expected linear additive utility of rewards obtained after executing h
33
![Page 33: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/33.jpg)
Coming Up Next
• What is an MDP?
– Stationary M = <S, D, A, T, R>
• What is an MDP solution?
– Policy π: H x A [0,1], a mapping from histories to distributions over actions.
• What it an optimal MDP solution?
– π* s.t. V*(h) ≥ Vπ(h) for all π and h,where Vπ(h) is the expected linear additive utility of rewards obtained after executing h
34
Make sure ELAU is well-defined
![Page 34: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/34.jpg)
3 Models with Well-Defined Policy ELAU
1) Finite-horizon MDPs
2) Infinite-horizon discounted-reward MDPs
3) Stochastic Shortest-Path MDPs
35
![Page 35: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/35.jpg)
Fundamentals of MDPs
General MDP Definition
Expected Linear Additive Utility
The Optimality Principle
• Finite-Horizon MDPs
• Infinite-Horizon Discounted-Reward MDPs
• Stochastic Shortest-Path MDPs
• A Hierarchy of MDP Classes
• Factored MDPs
• Computational Complexity
36
![Page 36: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/36.jpg)
Finite-Horizon MDPs: Motivation
• Assume the agent acts for a finite # time steps, L
• Example applications:
– Inventory management
“How much X to order from
the supplier every day ‘til
the end of the season?”
– Maintenance scheduling
“When to schedule
disruptive maintenance
jobs by their deadline?”
37
![Page 37: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/37.jpg)
Finite-Horizon MDPs: Definition
FH MDP is a tuple <S, A, D, T, R>, where:
• S is a finite state space
• D is a sequence of time steps (1,2,3, …, L) up to a finite horizon L
• A is a finite action set
• T: S x A x S x D [0, 1] is a transition function
• R: S x A x S x D R is a reward function
Policy value = ELAU over the remaining time steps
38
Puterman, 1994
![Page 38: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/38.jpg)
Aside: Deterministic Markovian Policies
• For FH MDPs, we can consider only deterministic Markovian solutions – Will shortly see why
• A policy is deterministic if for every history, it assigns all probability mass to one action:
π: H A
• A policy is deterministic Markovian if its decision in each
state is independent of execution history:
π: S x D A
39
![Page 39: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/39.jpg)
Aside: Markovian Value Functions
• Markovian policies can be evaluated with Markovian value functions
• Let hs,t denote history ending in state s at time t
• Vπ(hs,t) = Vπ(h’s,t) for all hs,t, h’s,t if π is Markovian
• Call V Markovian if for all hs,t, h’s,t, V(hs,t) = V(h’s,t)
– For each s, t denote Markovian V as V(s,t)
40
![Page 40: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/40.jpg)
Finite-Horizon MDPs: Optimality Principle
For an FH MDP with horizon|D| = L < ∞, let:
– Vπ(hs,t) = Eh,s,t[R1 + … + RL - t] for all 1 ≤ t ≤ L – Vπ(hs,L+1) = 0
Then:
– V* exists and is Markovian, π* exists and is det. Markovian – For all s and 1 ≤ t ≤ L:
V*(s,t) = maxa in A [ ∑s’ in S T(s, a, s’, t) [ R(s, a, s’, t) + V*(s’, t+1) ] ]
π*(s,t) =argmaxa in A [ ∑s’ in S T(s, a, s’, t) [ R(s, a, s’, t) + V*(s’, t+1) ] ]
41
π
Exp. Lin. Add. Utility
Each E[Ri] is finite
# terms in the series is finite
For every history, the value of every policy
is well-defined!
Highest utility derivable from s at
time t
Highest utility derivable from the
next state
Immediate utility of the next action
In expectation
If you act optimally now
![Page 41: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/41.jpg)
Perks of the FH MDP Optimality Principle
• V*, π* Markovian consider only Markovian V, π!
• Can easily compute π*! – For all s, compute V*(s, t) and π*(s, t) for t = L, …, 1
42
Probabilistic history-dep. π
Deterministic Markovian π
Number ∞ |A||S||D|
Size of each Ginormous! O(|S||D|)
![Page 42: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/42.jpg)
Moving to In(de)finite Horizon
• Finite known horizon sometimes not good enough
– Doesn’t cover autonomous agents with long lifespans
• Two other options:
– Infinite horizon (horizon known to be infinite)
– Indefinite horizon (horizon known to be unbounded)
43
![Page 43: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/43.jpg)
• Finite known horizon sometimes not good enough
– Doesn’t cover autonomous agents with long lifespans
• Two other options:
– Infinite horizon (horizon known to be infinite)
– Indefinite horizon (horizon known to be unbounded)
44
Moving to Infinite Horizon
![Page 44: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/44.jpg)
Analyzing MDPs with In(de)finite Horizon
• Hard to specify time-dependent T, R, etc. for a large (infinite) # steps
• Need stationary (time-independent) functions: – Stationary transition function of the form
T: S x A x S [0, 1]
– Stationary reward function of the form R: S x A x S R
– Stationary deterministic Markovian policy of the form π: S A – Stationary Markovian value function of the form V: S [-∞, ∞]
45
![Page 45: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/45.jpg)
Fundamentals of MDPs
General MDP Definition
Expected Linear Additive Utility
The Optimality Principle
Finite-Horizon MDPs
• Infinite-Horizon Discounted-Reward MDPs
• Stochastic Shortest-Path MDPs
• A Hierarchy of MDP Classes
• Factored MDPs
• Computational Complexity
46
![Page 46: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/46.jpg)
Infinite-Horizon Discounted-Reward MDPs: Motivation
• Assume the agent acts for an infinitely long time
• Example applications:
– Portfolio management
“How to invest money
under a given rate of
inflation?”
– Unstable system control
“How to help fly
a B-2 bomber?”
47
![Page 47: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/47.jpg)
Infinite-Horizon Discounted MDPs: Definition
IHDR MDP is a tuple <S, A, T, R, γ>, where:
• S is a finite state space
• (D is an infinite sequence (1,2, …))
• A is a finite action set
• T: S x A x S [0, 1] is a stationary transition function
• R: S x A x S R is a stationary reward function
• γ is a discount factor satisfying 0 ≤ γ < 1
Policy value = discounted ELAU over infinite time steps
48
Puterman, 1994
![Page 48: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/48.jpg)
Infinite-Horizon Discounted-Reward MDPs: Optimality Principle
For an IHDR MDP, let:
– Vπ(h) = Eh [R1 + γR2 + γ2R3 +… ] for all h
Then:
– V* exists and is stationary Markovian, π* exists and is
stationary deterministic Markovian – For all s:
V*(s) = maxa in A [ ∑s’ in S T(s, a, s’) [ R(s, a, s’) + γV*(s’) ] ]
π*(s) =argmaxa in A [ ∑s’ in S T(s, a, s’) [ R(s, a, s’) + γV*(s’) ] ]
49
π
Exp. Lin. Add. Utility
All γiE[Ri] are bounded by some finite K and converge
geometrically
For every history, the value of a policy is
well-defined thanks to 0 ≤ γ < 1!
Future utility is discounted Optimal utility is time-independent!
![Page 49: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/49.jpg)
Perks of the IFHD MDP Optimality Principle
• V*, π* stationary Markovian consider only stationary Markovian V, π!
50
Deterministic Markovian π
Stationary deterministic Markovian π
Number ∞ |A||S|
Size of each ∞ O(|S|)
![Page 50: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/50.jpg)
Where Does γ Come From?
• γ can affect optimal policy significantly
– γ = 0 + ε: yields myopic policies for “impatient” agents
– γ = 1 - ε: yields far-sighted policies, inefficient to compute
• How to set it?
– Sometimes suggested by data (e.g., inflation or interest rate)
– Often set to whatever gives a reasonable policy
51
![Page 51: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/51.jpg)
Moving to Indefinite Horizon
• Finite known horizon sometimes not good enough
– Doesn’t cover autonomous agents with long lifespans.
• Two other options:
– Infinite horizon (horizon known to be infinite)
– Indefinite horizon (horizon known to be unbounded)
52
![Page 52: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/52.jpg)
Fundamentals of MDPs
General MDP Definition
Expected Linear Additive Utility
The Optimality Principle
Finite-Horizon MDPs
Infinite-Horizon Discounted-Reward MDPs
• Stochastic Shortest-Path MDPs
• A Hierarchy of MDP Classes
• Factored MDPs
• Computational Complexity
53
![Page 53: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/53.jpg)
Stochastic Shortest-Path MDPs: Motivation
• Assume the agent pays cost to achieve a goal
• Example applications:
– Controlling a Mars rover
“How to collect scientific
data without damaging
the rover?”
– Navigation
“What’s the fastest way
to get to a destination, taking
into account the traffic jams?”
54
![Page 54: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/54.jpg)
Stochastic Shortest-Path MDPs: Definition
SSP MDP is a tuple <S, A, T, C, G>, where: • S is a finite state space • (D is an infinite sequence (1,2, …)) • A is a finite action set • T: S x A x S [0, 1] is a stationary transition function • C: S x A x S R is a stationary cost function (= -R: S x A x S R) • G is a set of absorbing cost-free goal states
Under two conditions: • There is a proper policy (reaches a goal with P= 1 from all states) • Every improper policy incurs a cost of ∞ from every state from
which it does not reach the goal with P=1
55
Bertsekas, 1995
![Page 55: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/55.jpg)
SSP MDP Details
• In SSP, maximizing ELAU = minimizing exp. cost
• Every cost-minimizing policy is proper!
• Thus, an optimal policy = cheapest way to a goal
• Why are SSP MDPs called “indefinite-horizon”?
– If a policy is optimal, it will take a finite, but apriori unknown, time to reach goal
56
![Page 56: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/56.jpg)
SSP MDP Example
57
S1 S2
a1
C(s2, a1, s1) = -1
C(s1, a1, s2) = 1
a2 a2
C(s1, a2, s1) = 7.2
C(s2, a2, sG) = 1
SG
C(sG, a2, sG) = 0
C(sG, a1, sG) = 0
C(s2, a2, s2) = -3
T(s2, a2, sG) = 0.3
T(s2, a2, sG) = 0.7
S3
C(s3, a2, s3) = 0.8 C(s3, a1, s3) = 2.4
a1 a2
C(s2, a1, s3) = 5
a1
T(s2, a1, s3) = 0.6
T(s2, a1, s1) = 0.4
No dead ends allowed!
, not!
a1
a2
![Page 57: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/57.jpg)
SSP MDP Example
58
S1 S2
a1
a1 C(s2, a1, s1) = -1
C(s1, a1, s2) = 1
a2 a2
C(s1, a2, s1) = 7.2
C(s2, a2, sG) = 1
SG
C(sG, a2, sG) = 0
C(sG, a1, sG) = 0
C(s2, a2, s2) = -3
T(s2, a2, sG) = 0.3
T(s2, a2, sG) = 0.7
No cost-free “loops” allowed!
, also not!
a2
a1
![Page 58: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/58.jpg)
SSP MDP Example
59
S1 S2
a1
a1 C(s2, a1, s1) = 0
C(s1, a1, s2) = 1
a2 a2
C(s1, a2, s1) = 7.2
C(s2, a2, sG) = 1
SG
C(sG, a2, sG) = 0
C(sG, a1, sG) = 0
C(s2, a2, s2) = 1
T(s2, a2, sG) = 0.3
T(s2, a2, sG) = 0.7
![Page 59: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/59.jpg)
SSP MDPs: Optimality Principle
For an SSP MDP, let:
– Vπ(h) = Eh[C1 + C2 + …] for all h
Then:
– V* exists and is stationary Markovian, π* exists and is stationary
deterministic Markovian – For all s:
V*(s) = mina in A [ ∑s’ in S T(s, a, s’) [ C(s, a, s’) + V*(s’) ] ]
π*(s) = argmina in A [ ∑s’ in S T(s, a, s’) [ C(s, a, s’) + V*(s’) ] ]
60
π
Exp. Lin. Add. Utility
Every policy either takes a finite exp. # of steps to reach a goal, or has an infinite cost.
For every history, the value of a policy
is well-defined!
![Page 60: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/60.jpg)
Fundamentals of MDPs
General MDP Definition
Expected Linear Additive Utility
The Optimality Principle
Finite-Horizon MDPs
Infinite-Horizon Discounted-Reward MDPs
Stochastic Shortest-Path MDPs
• A Hierarchy of MDP Classes
• Factored MDPs
• Computational Complexity
61
![Page 61: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/61.jpg)
SSP and Other MDP Classes
• FH => SSP: turn all states (s, L) into goals
• IHDR => SSP: add (1-γ)-probability transitions to goal
• Will concentrate on SSP in the rest of the tutorial
62
SSP IHDR FH
![Page 62: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/62.jpg)
Fundamentals of MDPs
General MDP Definition
Expected Linear Additive Utility
The Optimality Principle
Finite-Horizon MDPs
Infinite-Horizon Discounted-Reward MDPs
Stochastic Shortest-Path MDPs
A Hierarchy of MDP Classes
• Factored MDPs
• Computational Complexity
63
![Page 63: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/63.jpg)
Factored SSP MDPs: Motivation
• How to describe an MDP instance? – S = {s1, … , sn} – flat representation – T(si, aj, sk) = pi,j,k for every state, action, state triplet – …
• Flat representation too cumbersome! – Real MDPs have billions of billions of states – Can’t enumerate transition function explicitly
• Flat representation too uninformative! – State space has no meaningful distance measure – Tabulated transition/reward function has no structure
64
![Page 64: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/64.jpg)
Factored SSP MDPs: Definition
Factored SSP MDP is a tuple <X, A, T, C, G>, where: • X is a finite set of state variables (domain variables, features)
• (D is an infinite sequence (1,2, …))
• A is a finite action set
• T: (dom(X1) x … x dom(Xn)) x A x (dom(X1) x … x dom(Xn)) [0, 1] is a stationary transition function
• C: (dom(X1) x … x dom(Xn)) x A x (dom(X1) x … x dom(Xn)) R is a stationary cost function
• G, given by a conjunction over a subset of X, is a set of goal states
The conditions of the flat SSP MDP definition still apply
65
![Page 65: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/65.jpg)
Factored Representation Languages
• PPDDL – Prob. Planning Domain Definition Language [Younes and Littman, 2004]
• RDDL – Relational Domain Definition Language [Sanner, 2011]
66
![Page 66: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/66.jpg)
Example Factored SSP MDP in PPDDL
• Gremlin wants to sabotage an airplane
• Can use tools to fulfill its objective
• Needs to pick up the tools
• X = { }
67
![Page 67: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/67.jpg)
Example Factored SSP MDP in PPDDL
68
G e t S
G e t
W
G e t H
A = T = C =
1.0 1
1.0 1
0.4 1
G =
0.6 1
Preconditions
Effects/outcomes
![Page 68: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/68.jpg)
Example Factored SSP MDP in PPDDL
69
S m a s h
T w e a k
0.9 0.1
A = T = C =
1.0 2 1 100
![Page 69: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/69.jpg)
Example Factored SSP MDP in RDDL
• Sysadmin needs to maintain a network of servers until time L – Gets paid proportionately to the # of servers running at each time step
• Each server can go up or down with some probability – And drag its neighbors down – probability of going down increases with the
number of down neighbors
• Sysadmin can restart just one server per time step
• Enormous number of uncorrelated effects for each action – 2N for a problem with N servers
• X = { 1 , … , N}, G = any state at time L
70
? ?
![Page 70: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/70.jpg)
Example Factored SSP MDP in RDDL
71
Restart(Ser2)
Restart(Ser1)
Restart(Ser3)
P(Ser1t |Restartt-1(Ser1), Ser1
t-1, Ser2t-1)
P(Ser2t |Restartt-1(Ser2), Ser1
t-1, Ser2t-1, Ser3
t-1)
P(Ser3t |Restartt-1(Ser3), Ser2
t-1, Ser3t-1)
Time t-1 Time t
T: A: C: -∑i [Seri = ↑]
![Page 71: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/71.jpg)
Factored Representation Languages Summary
• PPDDL – Prob. Planning Domain Definition Language – Represents MDP actions as templates
– Good for MDPs with strongly correlated effects
– Inconvenient for MDPs with uncorrelated effects
• RDDL – Relational Domain Definition Language – Represents MDP as a Dynamic Bayes Net
– Shows how each variable evolves under every action
– Good for MDPs with uncorrelated effects
– Inconvenient for MDPs with uncorrelated strongly correlated effects
72
![Page 72: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/72.jpg)
Benefits of Factored Representations
• Can meaningfully group states
– E.g., by similarity
– And assign the same policy to each group
• Can meaningfully express V as a function of state variables
– Using mathematical operations, e.g. V(s) = X1(s) + … + Xn(s)
– Basis of dimensionality reduction techniques
• Can manipulate values of sets of states
– Symbolic and approximate algorithms, more on this later
73
![Page 73: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/73.jpg)
Fundamentals of MDPs
General MDP Definition
Expected Linear Additive Utility
The Optimality Principle
Finite-Horizon MDPs
Infinite-Horizon Discounted-Reward MDPs
Stochastic Shortest-Path MDPs
A Hierarchy of MDP Classes
Factored MDPs
• Computational Complexity
74
![Page 74: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/74.jpg)
Computational Complexity of MDPs
• Good news:
– Solving IHDR, SSP in flat representation is P-complete
– Solving FH in flat representation is P-hard
– That is, they don’t benefit from parallelization, but are solvable in polynomial time!
75
![Page 75: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/75.jpg)
Computational Complexity of MDPs
• Bad news:
– Solving FH, IHDR, SSP in factored representation is EXPTIME-complete!
– Flat representation doesn’t make MDPs harder to solve, it makes big ones easier to describe.
76
![Page 76: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/76.jpg)
Computational Complexity of MDPs
• Consolation:
– Introduce factored SSPs0 (FHs0, IFHDs0)– factored MDP with a designated initial state s0
– Assume an optimal policy starting at s0 visits at most O(poly|X|) states
– FH, IHDR SSPs0 with O(poly|X|) optimal policy size are PSPACE-complete!
77
![Page 77: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/77.jpg)
Summary So Far
• Introduced a broad MDP definition – It had an ill-defined optimal solution concept
• Imposed restrictions on the general definition to make
optimal solution well-defined – Based on expected linear additive utility – Gave rise to FH, IHDR, and SSP
• Introduced factored representations
– Convenient to use, but make MDPs look hard to solve
– In fact, they are hard to solve…
78
![Page 78: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/78.jpg)
Outline of the Tutorial
• Introduction
• Fundamentals of MDPs
• Uninformed Algorithms
• Heuristic Search Algorithms
• Approximation Algorithms
• Extension of MDPs
79
(10 mins)
(1+ hr)
(1 hr)
(1 hr)
(1+ hr)
(remaining time)
![Page 79: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/79.jpg)
UNINFORMED ALGORITHMS
80
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA
![Page 80: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/80.jpg)
Uninformed Algorithms
• Definitions
• Fundamental Algorithms
• Prioritized Algorithms
• Partitioned Algorithms
• Other models
81
![Page 81: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/81.jpg)
Stochastic Shortest-Path MDPs: Definition
SSP MDP is a tuple <S, A, T, C, G>, where: • S is a finite state space • A is a finite action set • T: S x A x S [0, 1] is a stationary transition function • C: S x A x S R is a stationary cost function • G is a set of absorbing cost-free goal states
Under two conditions: • There is a proper policy (reaches a goal with P=1 from all states) • Every improper policy incurs a cost of ∞ from every state from which
it does not reach the goal with P=1
• Solution of an SSP: policy (¼: S!A)
82
![Page 82: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/82.jpg)
Uninformed Algorithms
• Definitions
• Fundamental Algorithms
• Prioritized Algorithms
• Partitioned Algorithms
• Other models
83
![Page 83: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/83.jpg)
Brute force Algorithm
• Go over all policies ¼ – How many? |A||S|
• Evaluate each policy – V¼(s) Ã expected cost of reaching goal from s
• Choose the best – We know that best exists (SSP optimality principle)
– V¼*(s) · V¼(s)
84
finite
how to evaluate?
![Page 84: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/84.jpg)
Policy Evaluation
• Given a policy ¼: compute V¼
• TEMPORARY ASSUMPTION: ¼ is proper
– execution of ¼ reaches a goal from any state
85
![Page 85: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/85.jpg)
Deterministic SSPs
• Policy Graph for ¼
¼(s0) = a0; ¼(s1) = a1
• V¼(s1) = 1
• V¼(s0) = 6
86
s0 s1 sg
C=5 C=1
a0 a1
add costs on path to goal
![Page 86: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/86.jpg)
Acyclic SSPs
• Policy Graph for ¼
• V¼(s1) = 1
• V¼(s2) = 4
• V¼(s0) = 0.6(5+1) + 0.4(2+4) = 6
87
s0
s1
s2
sg
Pr=0.6 C=5
Pr=0.4 C=2
C=1
C=4
a0 a1
a2
backward pass in reverse topological order
![Page 87: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/87.jpg)
General SSPs can be cyclic!
• V¼(s1) = 1
• V¼(s2) = ?? (depends on V¼(s0))
• V¼(s0) = ?? (depends on V¼(s2))
88
a2
Pr=0.7 C=4
Pr=0.3 C=3
s0
s1
s2
sg
Pr=0.6 C=5
Pr=0.4 C=2
C=1
a0 a1
cannot do a simple single pass
![Page 88: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/88.jpg)
General SSPs can be cyclic!
• V¼(g) = 0 • V¼(s1) = 1+V¼(sg) = 1 • V¼(s2) = 0.7(4+V¼(sg)) + 0.3(3+V¼(s0)) • V¼(s0) = 0.6(5+V¼(s1)) + 0.4(2+V¼(s2))
89
a2
Pr=0.7 C=4
Pr=0.3 C=3
s0
s1
s2
sg
Pr=0.6 C=5
Pr=0.4 C=2
C=1
a0 a1
a simple system of linear equations
![Page 89: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/89.jpg)
Policy Evaluation (Approach 1)
• Solving the System of Linear Equations
• |S| variables.
• O(|S|3) running time
90
V ¼(s) = 0 if s 2 G=
X
s02ST (s; ¼(s); s0) [C(s; ¼(s); s0) + V ¼(s0)]
![Page 90: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/90.jpg)
Iterative Policy Evaluation
91
a2
Pr=0.7 C=4
Pr=0.3 C=3
s0
s1
s2
sg
Pr=0.6 C=5
Pr=0.4 C=2
C=1
a0 a1
0
1
3.7+0.3V¼(s0) 3.7
5.464 5.67568
5.7010816 5.704129…
4.4+0.4V¼(s2) 0
5.88 6.5856
6.670272 6.68043..
![Page 91: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/91.jpg)
Policy Evaluation (Approach 2)
92
iterative refinement
V ¼n (s)Ã
X
s02ST (s; ¼(s); s0)
£C(s; ¼(s); s0) + V ¼
n¡1(s0)¤
(1)
V ¼(s) =X
s02ST (s; ¼(s); s0) [C(s; ¼(s); s0) + V ¼(s0)]
![Page 92: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/92.jpg)
Iterative Policy Evaluation
93
iteration n
²-consistency
termination condition
![Page 93: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/93.jpg)
Convergence & Optimality
For a proper policy ¼
Iterative policy evaluation
converges to the true value of the policy, i.e.
irrespective of the initialization V0
94
limn!1V ¼n = V ¼
![Page 94: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/94.jpg)
Brute force Algorithm
• Go over all policies ¼: – How many? |A||S|
• Evaluate each policy – V¼(s) Ã expected cost of reaching goal from s
• Choose the best – We know that best exists (SSP optimality principle)
– V¼*(s) · V¼(s)
95
how to evaluate?
too slow choose an intelligent order for ¼
![Page 95: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/95.jpg)
Q-Value under a Value Function V
• The Q-value of state s and action a under a value function V
– denoted as QV(s,a)
• one-step lookahead computation of the value of a
– assuming V is true expected cost to reach goal
96
QV (s; a) =X
s02ST (s; a; s0) [C(s; a; s0) + V (s0)]
![Page 96: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/96.jpg)
Greedy Action/Policy
• Define a greedy action a wrt V
– an action that has the lowest Q-value, i.e.
– a = argmina’QV(s,a’)
• Define a greedy policy ¼V
– Policy with all greedy actions wrt V for each state
97
![Page 97: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/97.jpg)
Policy Iteration [Howard 60]
• initialize ¼0 as a random proper policy
• repeat
Policy Evaluation: Compute V¼n-1
Policy Improvement: Construct ¼n greedy wrt V¼n-1
• until ¼n==¼n-1
• return ¼n
98
choose ¼n-1
if multiple greedy actions
![Page 98: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/98.jpg)
Properties
• Policy Iteration for an SSP
(initialized with a proper policy ¼0)
Successively improves the policy in each iteration, i.e.
V¼n(s) · V¼n -1(s), and
converges to an optimal policy
99
![Page 99: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/99.jpg)
Modified Policy Iteration [van Nunen 76]
• initialize ¼0 as a random proper policy
• repeat
Approximate Policy Evaluation: Compute V¼n-1
by running only few iterations of iterative policy eval.
Policy Improvement: Construct ¼n greedy wrt V¼n-1
• until …
• return ¼n
100
![Page 100: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/100.jpg)
Limitations of PI
• Why do we need to start with a proper policy?
– Policy Evaluation will diverge
• How to get a proper policy?
– No domain independent algorithm
• PI for SSPs is not generally applicable
101
![Page 101: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/101.jpg)
Policy Iteration Value Iteration
• Changing the search space.
• Policy Iteration – Search over policies
– Compute the resulting value
• Value Iteration – Search over values
– Compute the resulting policy
102
![Page 102: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/102.jpg)
Optimality Principle/Bellman Equations
103
V ¤(s) = 0 if s 2 G= min
a2A
X
s02ST (s; a; s0) [C(s; a; s0) + V ¤(s0)]
Q*(s,a)
V*(s) = mina Q*(s,a)
![Page 103: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/103.jpg)
Fixed Point Computation in VI
104
iterative refinement
V ¤(s) = mina2A
X
s02ST (s; a; s0) [C(s; a; s0) + V ¤(s0)]
Vn(s)Ãmina2A
X
s02ST (s; a; s0) [C(s; a; s0) + Vn¡1(s
0)]
non-linear
![Page 104: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/104.jpg)
Running Example
105
s0
s2
s1
sg Pr=0.6
a00 s4
s3
Pr=0.4 a01
a21 a1
a20 a40
C=5 a41
a3 C=2
![Page 105: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/105.jpg)
V0= 0
V0= 2
Q1(s4,a40) = 5 + 0
Q1(s4,a41) = 2 + 0.6£ 0
+ 0.4£ 2
= 2.8
min
V1= 2.8
agreedy = a41
a41
a40
s4
sg
s3
Bellman Backup
C=5
C=2
sg Pr=0.6
s4
s3
Pr=0.4
a40
C=5 a41
a3 C=2
![Page 106: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/106.jpg)
Value Iteration [Bellman 57]
107
iteration n
²-consistency
termination condition
No restriction on initial value function
![Page 107: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/107.jpg)
Running Example
108
s0
s2
s1
sg Pr=0.6
a00 s4
s3
Pr=0.4 a01
a21 a1
a20 a40
C=5 a41
a3 C=2
n Vn(s0) Vn(s1) Vn(s2) Vn(s3) Vn(s4)
0 3 3 2 2 1
1 3 3 2 2 2.8
2 3 3 3.8 3.8 2.8
3 4 4.8 3.8 3.8 3.52
4 4.8 4.8 4.52 4.52 3.52
5 5.52 5.52 4.52 4.52 3.808
20 5.99921 5.99921 4.99969 4.99969 3.99969
![Page 108: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/108.jpg)
Convergence & Optimality
• For an SSP MDP, 8s2 S,
lim n!1 Vn(s) = V*(s)
irrespective of the initialization.
109
![Page 109: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/109.jpg)
Running Time
• Each Bellman backup: – Go over all states and all successors: O(|S||A|)
• Each VI Iteration – Backup all states: O(|S|2|A|)
• Number of iterations – General SSPs: no good bounds
– Special cases: better bounds • (e.g., when all costs positive [Bonet 07])
110
![Page 110: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/110.jpg)
SubOptimality Bounds
• General SSPs
– weak bounds exist on |Vn(s) – V*(s)|
• Special cases: much better bounds exist
– (e.g., when all costs positive [Hansen 11])
111
![Page 111: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/111.jpg)
Monotonicity
For all n>k
Vk ≤p V* ⇒ Vn ≤p V* (Vn monotonic from below) Vk ≥p V* ⇒ Vn ≥p V* (Vn monotonic from above)
112
![Page 112: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/112.jpg)
VI Asynchronous VI
• Is backing up all states in an iteration essential? – No!
• States may be backed up – as many times
– in any order
• If no state gets starved – convergence properties still hold!!
113
![Page 113: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/113.jpg)
Residual wrt Value Function V (ResV)
• Residual at s with respect to V
– magnitude(¢V(s)) after one Bellman backup at s
• Residual wrt respect to V
– max residual
– ResV = maxs (ResV(s))
114
ResV (s) =
¯̄¯̄¯V (s)¡min
a2A
X
s02ST (s; a; s0)[C(s; a; s0) + V (s0)]
¯̄¯̄¯
ResV <²
(²-consistency)
![Page 114: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/114.jpg)
(General) Asynchronous VI
115
![Page 115: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/115.jpg)
Uninformed Algorithms
• Definitions
• Fundamental Algorithms
• Prioritized Algorithms
• Partitioned Algorithms
• Other models
116
![Page 116: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/116.jpg)
Prioritization of Bellman Backups
• Are all backups equally important?
• Can we avoid some backups?
• Can we schedule the backups more appropriately?
117
![Page 117: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/117.jpg)
Useless Backups?
118
s0
s2
s1
sg Pr=0.6
a00 s4
s3
Pr=0.4 a01
a21 a1
a20 a40
C=5 a41
a3 C=2
n Vn(s0) Vn(s1) Vn(s2) Vn(s3) Vn(s4)
0 3 3 2 2 1
1 3 3 2 2 2.8
2 3 3 3.8 3.8 2.8
3 4 4.8 3.8 3.8 3.52
4 4.8 4.8 4.52 4.52 3.52
5 5.52 5.52 4.52 4.52 3.808
20 5.99921 5.99921 4.99969 4.99969 3.99969
![Page 118: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/118.jpg)
Useless Backups?
119
s0
s2
s1
sg Pr=0.6
a00 s4
s3
Pr=0.4 a01
a21 a1
a20 a40
C=5 a41
a3 C=2
n Vn(s0) Vn(s1) Vn(s2) Vn(s3) Vn(s4)
0 3 3 2 2 1
1 3 3 2 2 2.8
2 3 3 3.8 3.8 2.8
3 4 4.8 3.8 3.8 3.52
4 4.8 4.8 4.52 4.52 3.52
5 5.52 5.52 4.52 4.52 3.808
20 5.99921 5.99921 4.99969 4.99969 3.99969
![Page 119: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/119.jpg)
Asynch VI Prioritized VI
120
Convergence? Interleave synchronous VI iterations
![Page 120: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/120.jpg)
Which state to prioritize?
121
s1
s'
s'
s' ¢V=0
¢V=0
¢V=0
.
.
. .
.
.
. .
s2
s'
s'
s' ¢V=0
¢V=2
¢V=0
.
.
. .
.
.
. .
s3
s'
s'
s' ¢V=0
¢V=5
¢V=0
.
.
. .
.
.
. .
s1 is zero priority
0.8 0.1
s2 is higher priority s3 is low priority
![Page 121: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/121.jpg)
Prioritized Sweeping [Moore & Atkeson 93]
122
priorityPS(s) = max
½priorityPS(s);max
a2AfT (s; a; s0)ResV (s0)g
¾
• Convergence [Li&Littman 08]
Prioritized Sweeping converges to optimal in the limit,
if all initial priorities are non-zero.
(does not need synchronous VI iterations)
![Page 122: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/122.jpg)
Prioritized Sweeping
123
s0
s2
s1
sg Pr=0.6
a00 s4
s3
Pr=0.4 a01
a21 a1
a20 a40
C=5 a41
a3 C=2
V(s0) V(s1) V(s2) V(s3) V(s4)
Initial V 3 3 2 2 1
3 3 2 2 2.8
Priority 0 0 1.8 1.8 0
Updates 3 3 3.8 3.8 2.8
Priority 2 2 0 0 1.2
Updates 3 4.8 3.8 3.8 2.8
![Page 123: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/123.jpg)
Generalized Prioritized Sweeping [Andre et al 97]
124
priorityGPS2(s) = ResV (s)
• Instead of estimating residual
– compute it exactly
• Slightly different implementation
– first backup then push!
![Page 124: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/124.jpg)
Intuitions
• Prioritized Sweeping
– if a state’s value changes prioritize its predecessors
• Myopic
• Which state should be backed up?
– state closer to goal?
– or farther from goal?
125
![Page 125: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/125.jpg)
Useless Intermediate Backups?
126
s0
s2
s1
sg Pr=0.6
a00 s4
s3
Pr=0.4 a01
a21 a1
a20 a40
C=5 a41
a3 C=2
n Vn(s0) Vn(s1) Vn(s2) Vn(s3) Vn(s4)
0 3 3 2 2 1
1 3 3 2 2 2.8
2 3 3 3.8 3.8 2.8
3 4 4.8 3.8 3.8 3.52
4 4.8 4.8 4.52 4.52 3.52
5 5.52 5.52 4.52 4.52 3.808
20 5.99921 5.99921 4.99969 4.99969 3.99969
![Page 126: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/126.jpg)
Improved Prioritized Sweeping [McMahan&Gordon 05]
127
priorityIPS(s) =ResV (s)
V (s)
• Intuition – Low V(s) states (closer to goal) are higher priority initially
– As residual reduces for those states, • priority of other states increase
• A specific tradeoff – sometimes may work well
– sometimes may not work that well
![Page 127: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/127.jpg)
Tradeoff
• Priority queue increases information flow
• Priority queue adds overhead
• If branching factor is high
– each backup may result in many priority updates!
128
![Page 128: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/128.jpg)
Backward VI [Dai&Hansen 07]
• Prioritized VI without priority queue! • Backup states in reverse order starting from goal
– don‘t repeat a state in an iteration – other optimizations
• (backup only states in current greedy subgraph)
• Characteristics – no overhead of priority queue – good information flow – doesn‘t capture the intuition:
• higher states be converged before propagating further
129
![Page 129: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/129.jpg)
Comments
• Which algorithm to use? – Synchronous VI: when states highly interconnected
– PS/GPS: sequential dependencies
– IPS: specific way to tradeoff proximity to goal/info flow
– BVI: better for domains with fewer predecessors
• Prioritized VI is a meta-reasoning algorithm – reasoning about what to compute!
– costly meta-reasoning can hurt.
130
![Page 130: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/130.jpg)
Uninformed Algorithms
• Definitions
• Fundamental Algorithms
• Prioritized Algorithms
• Partitioned Algorithms
• Other models
131
![Page 131: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/131.jpg)
Partitioning of States
s0
s2
s1
sg Pr=0.6
a00 s4
s3
Pr=0.4 a01
a21 a1
a20 a40
C=5 a41
a3 C=2
![Page 132: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/132.jpg)
(General) Partitioned VI
133
How to construct a partition? How many backups to perform per partition?
How to construct priorities?
![Page 133: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/133.jpg)
Topological VI [Dai&Goldsmith 07]
• Identify strongly-connected components
• Perform topological sort of partitions
• Backup partitions to ²-consistency: reverse top. order
134
s0
s2
s1
sg Pr=0.6
a00 s4
s3
Pr=0.4 a01
a21 a1
a20 a40
C=5 a41
a3 C=2
![Page 134: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/134.jpg)
Other Benefits of Partitioning
• External-memory algorithms
– PEMVI [Dai etal 08, 09]
• partitions live on disk
• get each partition to the disk and backup all states
• Cache-efficient algorithms
– P-EVA algorithm [Wingate&Seppi 04a]
• Parallelized algorithms
– P3VI (Partitioned, Prioritized, Parallel VI) [Wingate&Seppi 04b]
135
![Page 135: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/135.jpg)
Uninformed Algorithms
• Definitions
• Fundamental Algorithms
• Prioritized Algorithms
• Partitioned Algorithms
• Other models
136
![Page 136: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/136.jpg)
Linear Programming for MDPs
137
• |S| variables
• |S||A| constraints – too costly to solve!
![Page 137: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/137.jpg)
Infinite-Horizon Discounted-Reward MDPs
138
V ¤(s) = maxa2A
X
s02ST (s; a; s0) [R(s; a; s0) + °V ¤(s0)]
• VI/PI work even better than SSPs!!
– PI does not require a “proper” policy
– Error bounds are tighter
• Example. VI error bound: |V*(s)-V¼(s)| < 2²°/(1-°)
– We can bound #iterations
• polynomial in |S|, |A| and 1/(1-°)
![Page 138: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/138.jpg)
Finite-Horizon MDPs
139
V ¤(s; t) = 0 if t > L
= maxa2A
X
s02ST (s; a; s0) [R(s; a; s0) + V ¤(s0; t + 1)]
• Finite-Horizon MDPs are acyclic!
– There exists an optimal backup order
• t=Tmax to 0
– Returns optimal values (not just ²-consistent)
– Performs one backup per augmented state
![Page 139: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/139.jpg)
Summary of Uninformed Algorithms
• Definitions
• Fundamental Algorithms – Bellman Equations is the key
• Prioritized Algorithms – Different priority functions have different benefits
• Partitioned Algorithms – Topological analysis, parallelization, external memory
• Other models – Other popular models similar
140
![Page 140: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/140.jpg)
Outline of the Tutorial
• Introduction
• Fundamentals of MDPs
• Uninformed Algorithms
• Heuristic Search Algorithms
• Approximation Algorithms
• Extension of MDPs
141
(10 mins)
(1+ hr)
(1 hr)
(1 hr)
(1+ hr)
(remaining time)
![Page 141: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/141.jpg)
HEURISTIC SEARCH ALGORITHMS
142
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA
![Page 142: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/142.jpg)
Heuristic Search Algorithms
• Definitions
• Find & Revise Scheme.
• LAO* and Extensions
• RTDP and Extensions
• Other uses of Heuristics/Bounds • Heuristic Design
143
![Page 143: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/143.jpg)
Limitations of VI/PI/Extensions
• Scalability – Memory linear in size of state space
– Time at least polynomial or more
• Polynomial is good, no? – state spaces are usually huge.
• Think PPDDL.
– if n state vars then 2n states!
• Curse of Dimensionality!
144
![Page 144: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/144.jpg)
Heuristic Search
• Insight 1
– knowledge of a start state to save on computation
~ (all sources shortest path single source shortest path)
• Insight 2
– additional knowledge in the form of heuristic function
~ (dfs/bfs A*)
145
![Page 145: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/145.jpg)
Model
• SSP (as before) with an additional start state s0
– denoted by SSPs0
• What is the solution to an SSPs0
• Policy (S !A)?
– are states that are not reachable from s0 relevant?
– states that are never visited (even though reachable)?
146
![Page 146: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/146.jpg)
Partial Policy
• Define Partial policy
– ¼: S’ ! A, where S’µ S
• Define Partial policy closed w.r.t. a state s.
– is a partial policy ¼s
– defined for all states s’ reachable by ¼s starting from s
147
![Page 147: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/147.jpg)
Partial policy closed wrt s0
148
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
s9
![Page 148: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/148.jpg)
Partial policy closed wrt s0
149
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
s9
¼s0(s0)= a1
¼s0(s1)= a2
¼s0(s2)= a1
Is this policy closed wrt s0?
![Page 149: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/149.jpg)
Partial policy closed wrt s0
150
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
s9
¼s0(s0)= a1
¼s0(s1)= a2
¼s0(s2)= a1
¼s0(s6)= a1
Is this policy closed wrt s0?
![Page 150: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/150.jpg)
Policy Graph of ¼s0
151
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
s9
¼s0(s0)= a1
¼s0(s1)= a2
¼s0(s2)= a1
¼s0(s6)= a1
![Page 151: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/151.jpg)
Greedy Policy Graph
• Define greedy policy: ¼V = argmina QV(s,a)
• Define greedy partial policy rooted at s0 – Partial policy rooted at s0
– Greedy policy
– denoted by
• Define greedy policy graph – Policy graph of : denoted by
152
¼Vs0
¼Vs0 GVs0
![Page 152: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/152.jpg)
Heuristic Function
• h(s): S!R
– estimates V*(s)
– gives an indication about “goodness” of a state
– usually used in initialization V0(s) = h(s)
– helps us avoid seemingly bad states
• Define admissible heuristic
– optimistic
– h(s) · V*(s)
153
![Page 153: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/153.jpg)
Heuristic Search Algorithms
• Definitions
• Find & Revise Scheme.
• LAO* and Extensions
• RTDP and Extensions
• Other uses of Heuristics/Bounds • Heuristic Design
154
![Page 154: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/154.jpg)
A General Scheme for Heuristic Search in MDPs
• Two (over)simplified intuitions – Focus on states in greedy policy wrt V rooted at s0
– Focus on states with residual > ²
• Find & Revise: – repeat
• find a state that satisfies the two properties above
• perform a Bellman backup
– until no such state remains
155
![Page 155: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/155.jpg)
FIND & REVISE [Bonet&Geffner 03a]
• Convergence to V* is guaranteed
– if heuristic function is admissible
– ~no state gets starved in 1 FIND steps
156
(perform Bellman backups)
![Page 156: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/156.jpg)
F&R and Monotonicity
• Vk ≤p V* ⇒ Vn ≤p V* (Vn monotonic from below)
– If h is admissible: V0 = h(s) ·p V*
) Vn ·p V* (8n)
Q*(s,a1) < Q(s,a2) < Q*(s,a2) aaaa
a2 can’t be optimal aaaa 157
s Q(s,a1)=5
.
.
Q(s, a2)=10
.
.
All values < V*, Q* All values = V*, Q*
![Page 157: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/157.jpg)
Heuristic Search Algorithms
• Definitions
• Find & Revise Scheme.
• LAO* and Extensions
• RTDP and Extensions
• Other uses of Heuristics/Bounds • Heuristic Design
158
![Page 158: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/158.jpg)
159
LAO* family
add s0 to the fringe and to greedy policy graph
repeat FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value choose a subset of affected states perform some REVISE computations on this subset recompute the greedy graph
until greedy graph has no fringe & residuals in greedy graph small
output the greedy graph as the final policy
![Page 159: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/159.jpg)
160
LAO* [Hansen&Zilberstein 98]
add s0 to the fringe and to greedy policy graph
repeat FIND: expand best state s on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform PI on this subset recompute the greedy graph
until greedy graph has no fringe & residuals in greedy graph small
output the greedy graph as the final policy
![Page 160: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/160.jpg)
161
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
add s0 in the fringe and in greedy graph
s0 V(s0) = h(s0)
![Page 161: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/161.jpg)
162
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
s0 V(s0) = h(s0)
FIND: expand some states on the fringe (in greedy graph)
![Page 162: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/162.jpg)
163
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
FIND: expand some states on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform PI on this subset
s0
s1 s2 s3 s4
V(s0)
h h h h
![Page 163: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/163.jpg)
164
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
FIND: expand some states on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform PI on this subset
recompute the greedy graph
s0
s1 s2 s3 s4
V(s0)
h h h h
![Page 164: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/164.jpg)
165
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
s0
s1 s2 s3 s4
s6 s7
FIND: expand some states on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform PI on this subset
recompute the greedy graph
h h h h
h h
V(s0)
![Page 165: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/165.jpg)
166
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
s0
s1 s2 s3 s4
s6 s7
FIND: expand some states on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform PI on this subset
recompute the greedy graph
h h h h
h h
V(s0)
![Page 166: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/166.jpg)
167
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
s0
s1 s2 s3 s4
s6 s7
FIND: expand some states on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform PI on this subset
recompute the greedy graph
h h V h
h h
V
![Page 167: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/167.jpg)
168
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
s0
s1 s2 s3 s4
s6 s7
FIND: expand some states on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform PI on this subset
recompute the greedy graph
h h V h
h h
V
![Page 168: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/168.jpg)
169
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
s0
Sg
s1 s2 s3 s4
s5 s6 s7
FIND: expand some states on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform PI on this subset
recompute the greedy graph
h h V h
h h
V
V
h 0
![Page 169: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/169.jpg)
170
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
s0
Sg
s1 s2 s3 s4
s5 s6 s7
FIND: expand some states on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform PI on this subset
recompute the greedy graph
h h V h
h h
V
V
h 0
![Page 170: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/170.jpg)
171
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
s0
Sg
s1 s2 s3 s4
s5 s6 s7
FIND: expand some states on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform PI on this subset
recompute the greedy graph
V h V h
h h
V
V
h 0
![Page 171: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/171.jpg)
172
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
s0
Sg
s1 s2 s3 s4
s5 s6 s7
FIND: expand some states on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform PI on this subset
recompute the greedy graph
V h V h
h h
V
V
h 0
![Page 172: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/172.jpg)
173
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
s0
Sg
s1 s2 s3 s4
s5 s6 s7
FIND: expand some states on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform PI on this subset
recompute the greedy graph
V V V h
h h
V
V
h 0
![Page 173: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/173.jpg)
174
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
s0
Sg
s1 s2 s3 s4
s5 s6 s7
FIND: expand some states on the fringe (in greedy graph)
initialize all new states by their heuristic value
subset = all states in expanded graph that can reach s
perform PI on this subset
recompute the greedy graph
V V V h
h h
V
V
h 0
![Page 174: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/174.jpg)
175
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
s0
Sg
s1 s2 s3 s4
s5 s6 s7
output the greedy graph as the final policy
V V V h
V h
V
V
h 0
![Page 175: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/175.jpg)
176
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
s0
Sg
s1 s2 s3 s4
s5 s6 s7
output the greedy graph as the final policy
V V V h
V h
V
V
h 0
![Page 176: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/176.jpg)
177
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
LAO*
s0
Sg
s1 s2 s3 s4
s5 s6 s7
s4 was never expanded s8 was never touched
V V V h
V h
V
V
h 0 s8
![Page 177: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/177.jpg)
178
LAO* [Hansen&Zilberstein 98]
add s0 to the fringe and to greedy policy graph
repeat FIND: expand best state s on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform PI on this subset recompute the greedy graph
until greedy graph has no fringe output the greedy graph as the final policy
one expansion
lot of computation
![Page 178: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/178.jpg)
179
Optimizations in LAO*
add s0 to the fringe and to greedy policy graph
repeat FIND: expand best state s on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s VI iterations until greedy graph changes (or low residuals) recompute the greedy graph
until greedy graph has no fringe output the greedy graph as the final policy
![Page 179: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/179.jpg)
180
Optimizations in LAO*
add s0 to the fringe and to greedy policy graph
repeat FIND: expand all states in greedy fringe initialize all new states by their heuristic value subset = all states in expanded graph that can reach s VI iterations until greedy graph changes (or low residuals) recompute the greedy graph
until greedy graph has no fringe output the greedy graph as the final policy
![Page 180: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/180.jpg)
181
iLAO* [Hansen&Zilberstein 01]
add s0 to the fringe and to greedy policy graph
repeat FIND: expand all states in greedy fringe initialize all new states by their heuristic value subset = all states in expanded graph that can reach s only one backup per state in greedy graph recompute the greedy graph
until greedy graph has no fringe output the greedy graph as the final policy
in what order? (fringe start) DFS postorder
![Page 181: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/181.jpg)
• LAO* may spend huge time until a goal is found
– guided only by s0 and heuristic
• LAO* in the reverse graph
– guided only by goal and heuristic
• Properties
– Works when 1 or handful of goal states
– May help in domains with small fan in
183
Reverse LAO* [Dai&Goldsmith 06]
![Page 182: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/182.jpg)
• Go in both directions from start state and goal
• Stop when a bridge is found
184
Bidirectional LAO* [Dai&Goldsmith 06]
![Page 183: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/183.jpg)
regular graph
soln:(shortest) path
A*
acyclic AND/OR graph
soln:(expected shortest)
acyclic graph
AO* [Nilsson’71]
cyclic AND/OR graph
soln:(expected shortest)
cyclic graph
LAO* [Hansen&Zil.’98]
All algorithms able to make effective use of reachability information!
A* LAO*
![Page 184: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/184.jpg)
186
AO* for Acyclic MDPs [Nilsson 71]
add s0 to the fringe and to greedy policy graph
repeat FIND: expand best state s on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s a single backup pass from fringe states to start state recompute the greedy graph
until greedy graph has no fringe output the greedy graph as the final policy
![Page 185: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/185.jpg)
Heuristic Search Algorithms
• Definitions
• Find & Revise Scheme.
• LAO* and Extensions
• RTDP and Extensions
• Other uses of Heuristics/Bounds • Heuristic Design
187
![Page 186: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/186.jpg)
Real Time Dynamic Programming [Barto et al 95]
• Original Motivation – agent acting in the real world
• Trial – simulate greedy policy starting from start state;
– perform Bellman backup on visited states
– stop when you hit the goal
• RTDP: repeat trials forever – Converges in the limit #trials ! 1
188
![Page 187: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/187.jpg)
Trial
189
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
![Page 188: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/188.jpg)
Trial
190
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
h h h h
V
start at start state
repeat
perform a Bellman backup
simulate greedy action
![Page 189: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/189.jpg)
Trial
191
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
h h h h
V
start at start state
repeat
perform a Bellman backup
simulate greedy action
h h
![Page 190: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/190.jpg)
Trial
192
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
h h V h
V
start at start state
repeat
perform a Bellman backup
simulate greedy action
h h
![Page 191: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/191.jpg)
Trial
193
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
h h V h
V
start at start state
repeat
perform a Bellman backup
simulate greedy action
h h
![Page 192: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/192.jpg)
Trial
194
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
h h V h
V
start at start state
repeat
perform a Bellman backup
simulate greedy action
V h
![Page 193: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/193.jpg)
Trial
195
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
h h V h
V
start at start state
repeat
perform a Bellman backup
simulate greedy action
until hit the goal
V h
![Page 194: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/194.jpg)
Trial
196
s0
Sg
s1 s2 s3 s4
s5 s6 s7 s8
h h V h
V
start at start state
repeat
perform a Bellman backup
simulate greedy action
until hit the goal
V h
Backup all states on trajectory
RTDP
repeat forever
![Page 195: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/195.jpg)
Real Time Dynamic Programming [Barto et al 95]
• Original Motivation – agent acting in the real world
• Trial – simulate greedy policy starting from start state;
– perform Bellman backup on visited states
– stop when you hit the goal
• RTDP: repeat trials forever – Converges in the limit #trials ! 1
197
No termination condition!
![Page 196: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/196.jpg)
RTDP Family of Algorithms
repeat s à s0
repeat //trials REVISE s; identify agreedy
FIND: pick s’ s.t. T(s, agreedy, s’) > 0 s à s’ until s 2 G until termination test 198
![Page 197: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/197.jpg)
• Admissible heuristic & monotonicity
⇒ V(s) · V*(s)
⇒ Q(s,a) · Q*(s,a)
• Label a state s as solved
– if V(s) has converged
best action
ResV(s) < ²
) V(s) won’t change! label s as solved
sg s
![Page 198: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/198.jpg)
Labeling (contd)
200
best action
ResV(s) < ²
s' already solved ) V(s) won’t change! label s as solved
sg s
s'
![Page 199: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/199.jpg)
Labeling (contd)
201
best action
ResV(s) < ²
s' already solved
) V(s) won’t change!
label s as solved
sg s
s'
best action
ResV(s) < ²
ResV(s’) < ²
V(s), V(s’) won’t change! label s, s’ as solved
sg s
s' best action
![Page 200: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/200.jpg)
Labeled RTDP [Bonet&Geffner 03b]
repeat s à s0 label all goal states as solved
repeat //trials REVISE s; identify agreedy
FIND: sample s’ from T(s, agreedy, s’) s à s’ until s is solved
for all states s in the trial try to label s as solved until s0 is solved
202
![Page 201: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/201.jpg)
• terminates in finite time
– due to labeling procedure
• anytime
– focuses attention on more probable states
• fast convergence
– focuses attention on unconverged states
203
LRTDP
![Page 202: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/202.jpg)
Picking a Successor Take 2
• Labeled RTDP/RTDP: sample s’ / T(s, agreedy, s’)
– Adv: more probable states are explored first
– Labeling Adv: no time wasted on converged states
– Disadv: labeling is a hard constraint
– Disadv: sampling ignores “amount” of convergence
• If we knew how much V(s) is expected to change?
– sample s’ / expected change
204
![Page 203: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/203.jpg)
Upper Bounds in SSPs
• RTDP/LAO* maintain lower bounds
– call it Vl
• Additionally associate upper bound with s
– Vu(s) ¸ V*(s)
• Define gap(s) = Vu(s) – Vl(s)
– low gap(s): more converged a state
– high gap(s): more expected change in its value
205
![Page 204: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/204.jpg)
Backups on Bounds
• Recall monotonicity
• Backups on lower bound – continue to be lower bounds
• Backups on upper bound – continues to be upper bounds
• Intuitively – Vl will increase to converge to V* – Vu will decrease to converge to V*
206
![Page 205: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/205.jpg)
Bounded RTDP [McMahan et al 05]
repeat s à s0 repeat //trials identify agreedy based on Vl
FIND: sample s’ / T(s, agreedy, s’).gap(s’) s à s’ until gap(s) < ²
for all states s in trial in reverse order REVISE s
until gap(s0) < ²
207
![Page 206: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/206.jpg)
Focused RTDP [Smith&Simmons 06]
• Similar to Bounded RTDP except – a more sophisticated definition of priority that
combines gap and prob. of reaching the state
– adaptively increasing the max-trial length
208
![Page 207: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/207.jpg)
Picking a Successor Take 3
[Slide adapted from Scott Sanner] 209
Q(s,a1) Q(s,a2)
Q(s,a2) Q(s,a1)
Q(s,a2) Q(s,a1)
Q(s,a2) Q(s,a1)
![Page 208: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/208.jpg)
• What is the expected value of knowing V(s’)
• Estimates EVPI(s’)
– using Bayesian updates
– picks s’ with maximum EVPI
210
Value of Perfect Information RTDP [Sanner et al 09]
![Page 209: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/209.jpg)
Heuristic Search Algorithms
• Definitions
• Find & Revise Scheme.
• LAO* and Extensions
• RTDP and Extensions
• Other uses of Heuristics/Bounds • Heuristic Design
211
![Page 210: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/210.jpg)
Action Elimination
If Ql(s,a1) > Vu(s) then a1 cannot be optimal for s.
212
Q(s,a1) Q(s,a2)
![Page 211: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/211.jpg)
Topological VI [Dai&Goldsmith 07]
• Identify strongly-connected components
• Perform topological sort of partitions
• Backup partitions to ²-consistency: reverse top. order
213
s0
s2
s1
sg Pr=0.6
a00 s4
s3
Pr=0.4 a01
a21 a1
a20 a40
C=5 a41
a3 C=2
![Page 212: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/212.jpg)
Topological VI [Dai&Goldsmith 07]
• Identify strongly-connected components
• Perform topological sort of partitions
• Backup partitions to ²-consistency: reverse top. order
214
s0
s2
s1
sg Pr=0.6
a00 s4
s3
Pr=0.4 a01
a21 a1
a20 a40
C=5 a41
a3 C=2
![Page 213: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/213.jpg)
Focused Topological VI [Dai et al 09]
• Topological VI
– hopes there are many small connected components
– can‘t handle reversible domains…
• FTVI
– initializes Vl and Vu
– LAO*-style iterations to update Vl and Vu
– eliminates actions using action-elimination
– Runs TVI on the resulting graph
215
![Page 214: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/214.jpg)
Factors Affecting Heuristic Search
• Quality of heuristic
• #Goal states
• Search Depth
216
![Page 215: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/215.jpg)
One Set of Experiments [Dai et al 09]
217
What if the number of reachable states is large?
![Page 216: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/216.jpg)
Heuristic Search Algorithms
• Definitions
• Find & Revise Scheme.
• LAO* and Extensions
• RTDP and Extensions
• Other uses of Heuristics/Bounds • Heuristic Design
218
![Page 217: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/217.jpg)
Admissible Heuristics
• Basic idea
– Relax probabilistic domain to deterministic domain
– Use heuristics(classical planning)
• All-outcome Determinization
– For each outcome create a different action
• Admissible Heuristics
– Cheapest cost solution for determinized domain
– Classical heuristics over determinized domain
219
s1 s
s2
a
s1 s
s2
a1
a2
![Page 218: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/218.jpg)
Summary of Heuristic Search
• Definitions
• Find & Revise Scheme – General scheme for heuristic search
• LAO* and Extensions – LAO*, iLAO*, RLAO*, BLAO*
• RTDP and Extensions – RTDP, LRTDP, BRTDP, FRTDP, VPI-RTDP
• Other uses of Heuristics/Bounds – Action Elimination, FTVI
• Heuristic Design – Determinization-based heuristics
220
![Page 219: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/219.jpg)
A QUICK DETOUR
221
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA
![Page 220: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/220.jpg)
Domains with Deadends
• Dead-end state – a state from which goal is unreachable
• Common in real-world – rover
– traffic
– exploding blocksworld!
• SSP/SSPs0 do not model such domains – assumption of at-least one proper policy
222
![Page 221: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/221.jpg)
Modeling Deadends
• How should we model dead-end states? – V(s) is undefined for deadends
) VI does not converge!!
• Proposal 1 – Add a penalty of reaching the dead-end state = P
• Is everything well-formed?
• Are there any issues with the model?
223
![Page 222: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/222.jpg)
Simple Dead-end Penalty P
• V*(s) = ²(P+1) + ².0 + (1-²).P
= P + ²
224
d s
sg
a Pr=1-²
Pr=²
C=²(P+1)
V(non-deadend) > P
![Page 223: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/223.jpg)
Proposal 2
• fSSPDE: Finite-Penalty SSP with Deadends
• Agent allowed to stop at any state – by paying a price = penalty P
• Equivalent to SSP with special astop action – applicable in each state
– leads directly to goal by paying cost P
• SSP = fSSPDE
225
V ¤(s) = min
ÃP;min
a2A
X
s02ST (s; a; s0)C(s; a; s0) + V ¤(s0)]
!
![Page 224: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/224.jpg)
fSSPDE Algorithms
• All SSP algorithms applicable…
– PI works for all domains
• Initial proper policy: (all states: astop)
– Other algorithms also work.
• Efficiency: unknown so far…
– Efficiency hit due to presence of deadends
– Efficiency hit due to magnitude of P
– Efficiency hit due to change of topology (e.g., TVI)
226
![Page 225: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/225.jpg)
SSPs0 with Dead-ends
• SSPADE: SSP with Avoidable Dead-ends [Kolobov et al 12] – dead-ends can be avoided from s0
– there exists a proper (partial) policy rooted at s0
• Heuristic Search Algorithms – LAO*: may not converge
• V(dead-ends) will get unbounded: VI may not converge
– iLAO*: will converge • only 1 backup ) greedy policy will exit dead-ends
– RTDP/LRTDP: may not converge • once stuck in dead-end won’t reach the goal • add max #steps in a trial… how many? adaptive?
227
![Page 226: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/226.jpg)
Unavoidable Dead-ends
• fSSPUDE: Finite-Penalty SSP with Unavoidable
Dead-Ends [Kolobov et al 12] – same as fSSPDE but now with a start state
• Same transformation applies
– add an astop action from every state
• SSPs0 = fSSPUDE
228
![Page 227: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/227.jpg)
Outline of the Tutorial
• Introduction
• Fundamentals of MDPs
• Uninformed Algorithms
• Heuristic Search Algorithms
• Approximation Algorithms
• Extension of MDPs
229
(10 mins)
(1+ hr)
(1 hr)
(1 hr)
(1+ hr)
(remaining time)
![Page 228: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/228.jpg)
APPROXIMATION ALGORITHMS
230
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA
![Page 229: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/229.jpg)
Motivation
• Even π* closed wr.t. s0 is often too large to fit in memory…
• … and/or too slow to compute …
• … for MDPs with complicated characteristics – Large branching factors/high-entropy transition function
– Large distance to goal
– Etc.
• Must sacrifice optimality to get a “good enough” solution
231
![Page 230: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/230.jpg)
Overview
232
Determinization-based techniques
Monte-Carlo planning
Heuristic search with inadmissible
heuristics
Hybridized planning
Hierarchical planning
Dimensionality reduction
Offline Online
![Page 231: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/231.jpg)
Overview
• Not a “golden standard” classification
– In some aspects, arguable
– Others possible, e.g., optimal in the limit vs. suboptimal in the limit
• All techniques assume factored fSSPUDE MDPs (SSPs0 MDPs with a finite dead-end penalty)
• Approaches differ in the quality aspect they sacrifice
– Probability of reaching the goal
– Expected cost of reaching the goal
– Both
233
![Page 232: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/232.jpg)
Approximation Algorithms
Overview
• Online Algorithms – Determinization-based Algorithms
– Monte-Carlo Planning
• Offline Algorithms – Heuristic Search with Inadmissible Heuristics
– Dimensionality Reduction
– Hierarchical Planning
– Hybridized Planning
234
![Page 233: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/233.jpg)
Online Algorithms: Motivation
• Defining characteristics:
– Planning + execution are interleaved
– Little time to plan • Need to be fast!
– Worthwhile to compute policy only for visited states • Would be wasteful for all
states
235
![Page 234: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/234.jpg)
Determinization-based Techniques
• A way to get a quick’n’dirty solution:
– Turn the MDP into a classical planning problem
– Classical planners are very fast
• Main idea:
1. Compile MDP into its determinization
2. Generate plans in the determinization
3. Use the plans to choose an action in the curr. state
4. Execute, repeat
236
![Page 235: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/235.jpg)
All-Outcome Determinization
Each outcome of each probabilistic action separate action
237
P = 9/10
P = 1/10
![Page 236: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/236.jpg)
Most-Likely-Outcome Determinization
238
P = 4/10 G e t H
P = 6/10
![Page 237: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/237.jpg)
FF-Replan: Overview & Example 1) Find a goal plan in
a determinization
239
2) Try executing it
in the original MDP
3) Replan&repeat if
unexpected outcome
Yoon, Fern, Givan, 2007
![Page 238: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/238.jpg)
FF-Replan: Details
• Uses either the AO or the MLO determinization
– MLO is smaller/easier to solve, but misses possible plans
– AO contains all possible plans, but bigger/harder to solve
• Uses the FF planner to solve the determinization
– Super fast
– Other fast planners, e.g., LAMA, possible
• Does not cache computed plans
– Recomputes the plan in the 3rd step in the example
240
![Page 239: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/239.jpg)
FF-Replan: Theoretical Properties
• Optimizes the MAXPROB criterion – PG of reaching the goal – In SSPs, this is always 1.0 – FF-Replan always tries to avoid cycles!
– Super-efficient on SSPs w/o dead ends
– Largely ignores expected cost
• Ignores probability of deviation from the found plan – Results in long-winded paths to the goal
– Troubled by probabilistically interesting MDPs [Little, Thiebaux, 2007] • There, an unexpected outcome may lead to catastrophic consequences
• In particular, breaks down in the presence of dead ends – Originally designed for MDPs without them
241
![Page 240: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/240.jpg)
g
FF-Replan and Dead Ends
242
Deterministic plan: Its possible execution:
b
![Page 241: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/241.jpg)
Putting “Probabilistic” Back Into Planning
• FF-Replan is oblivious to probabilities
– Its main undoing
– How do we take them into account?
• Sample determinizations probabilistically!
– Hopefully, probabilistically unlikely plans will be rarely found
• Basic idea behind FF-Hindsight
243
![Page 242: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/242.jpg)
FF-Hindsight: Overview (Estimating Q-Value, Q(s,a))
1. For Each Action A, Draw Future Samples
2. Solve Time-Dependent Classical Problems
3. Aggregate the solutions for each action
4. Select the action with best aggregation
S: Current State, A(S) → S’
Each Sample is a Deterministic Planning Problem
See if you have goal-reaching solutions, estimate Q(s,A)
Max A Q(s,A)
244
Slide courtesy of S. Yoon, A. Fern, R. Givan, and R. Kambhampati
![Page 243: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/243.jpg)
FF-Hindsight: Example Action
Probabilistic Outcome
Time 1
Time 2
Goal State
245
Action
State
Objective: Optimize MAXPROB criterion
Dead End
Left Outcomes are more likely
A1 A2
A1 A2 A1 A2 A1 A2 A1 A2
I
Slide courtesy of S. Yoon, A. Fern, R. Givan, and R. Kambhampati
![Page 244: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/244.jpg)
FF-Hindsight: Sampling a Future-1 Action
Probabilistic Outcome
Time 1
Time 2
Goal State
246
Action
State
Maximize Goal Achievement
Dead End A1: 1 A2: 0
Left Outcomes are more likely
A1 A2
A1 A2 A1 A2 A1 A2 A1 A2
I
Slide courtesy of S. Yoon, A. Fern, R. Givan, and R. Kambhampati
![Page 245: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/245.jpg)
FF-Hindsight: Sampling a Future-2 Action
Probabilistic Outcome
Time 1
Time 2
Goal State
247
Action
State
Maximize Goal Achievement
Dead End
Left Outcomes are more likely
A1: 2 A2: 1
A1 A2
A1 A2 A1 A2 A1 A2 A1 A2
I
Slide courtesy of S. Yoon, A. Fern, R. Givan, and R. Kambhampati
![Page 246: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/246.jpg)
FF-Hindsight: Sampling a Future-3 Action
Probabilistic Outcome
Time 1
Time 2
Goal State
248
Action
State
Maximize Goal Achievement
Dead End
Left Outcomes are more likely
A1: 2 A2: 1
A1 A2
A1 A2 A1 A2 A1 A2 A1 A2
I
![Page 247: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/247.jpg)
FF-Hindsight: Details & Theoretical Properties
• For each s, FF-Hindsight samples w L-horizon futures FL – In factored MDPs, amounts to choosing a’s outcome for each h
• Futures are solved by the FF planner – Fast, since they are much smaller than the AO determinization
• With enough futures, will find MAXPROB-optimal policy – If horizon H is large enough and a few other assumptions
• Much better than FF-Replan on MDPs with dead ends – But also slower – lots of FF invocations!
250
![Page 248: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/248.jpg)
Providing Solution Guarantees
• FF-Replan provides no solution guarantees
– May have PG = 0 on SSPs with dead ends, even if P*G > 0
– Wastes solutions: generates them, then forgets them
• FF-Hindsight provides some theoretical guarantees
– Practical implementations distinct from theory
– Wastes solutions: generates them, then forgets them
• RFF (Robust FF) provides quality guarantees in practice
– Constructs a policy tree out of deterministic plans
251
![Page 249: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/249.jpg)
RFF: Overview
252
Make sure the probability of ending up in an unknown
state is < ε
F. Teichteil-Königsbuch, U. Kuter, G. Infantes, AAMAS’10
![Page 250: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/250.jpg)
RFF: Initialization
253
S0 G
1. Generate either the AO or MLO determinization. Start with the policy graph consisting of the initial state s0 and all goal states G
![Page 251: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/251.jpg)
RFF: Finding an Initial Plan
254
S0 G
2. Run FF on the chosen determinization and add all the states along the found plan to the policy graph.
![Page 252: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/252.jpg)
RFF: Adding Alternative Outcomes
255
S0 G
3. Augment the graph with states to which other outcomes of the actions in the found plan could lead and that are not in the graph already. They are the policy graph’s fringe states.
![Page 253: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/253.jpg)
RFF: Run VI (Optional)
256
S0 G
4. Run VI to propagate heuristic values of the newly added states. This possibly changes the graph’s fringe and helps avoid dead ends!
![Page 254: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/254.jpg)
RFF: Computing Replanning Probability
257
S0 G
5. Estimate the probability P(failure) of reaching the fringe states (e.g., using Monte-Carlo sampling) from s0. This is the current partial policy’s failure probability w.r.t. s0.
If P(failure) > ε
P(failure) = ?
Else, done!
![Page 255: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/255.jpg)
RFF: Finding Plans from the Fringe
258
S0 G
6. From each of the fringe states, run FF to find a plan to reach the goal or one of the states already in the policy graph.
Go back to step 3: Adding Alternative Outcomes
![Page 256: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/256.jpg)
RFF: Details
• Can use either the AO or the MLO determinization
– Slower, but better solutions with AO
• When finding plans in st. 5, can set graph states as goals
– Or the MDP goals themselves
• Using the optional VI step is beneficial for solution quality
– Without this step, actions chosen under FF guidance
– With it – under VI guidance
– But can be expensive
259
![Page 257: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/257.jpg)
RFF: Theoretical Properties
• Fast
– FF-Replan forgets computed policies
– RFF essentially memorizes them
• When using AO determinization, guaranteed to find a policy that with P = 1 - ε will not need replanning
260
![Page 258: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/258.jpg)
Anticipatory Vs. Preemptive Planning
• FF-Hindsight and RFF use an anticipatory strategy
– Try to foresee deviations from a deterministic plan
• Can also try to use deterministic plans that will likely not be deviated from
– Main idea of HMDPP
– Implemented with a self-loop determinization
261
![Page 259: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/259.jpg)
Self-Loop Determinization
262 262
0.1
T = 0.9 C = 1 1
T = 1.0 C = 1/0.9 = 1.11
T = 1.0 C = 1/0.1 = 10
![Page 260: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/260.jpg)
Self-Loop Determinization
• Like AO determinization, but modifies action costs
– Assumes that getting “unexpected” outcome when executing a deterministic plan means staying in the current state
– In SL det, CSL(Outcome(a, i)) is the expected cost of repeating a in the MDP to get Outcome(a, i).
– Thus, CSL(Outcome(a, i)) = C(a) / T(Outcome(a, i))
• “Unlikely” deterministic plans look expensive in SL det.!
• Estimate hSL (s’) ≈ cost of the cheapest goal plan in the SL det.
263
![Page 261: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/261.jpg)
HMDPP: Overview
264
1. For Each Action A, Estimate QSL(s, a) and Qpdb(s, a) • QSL(s, a) = C(a) + ∑s’[T(s,a,s’) + hSL (s’)] • Qpdb(s, a) = C(a) + ∑s’[T(s,a,s’) + hpdb (s’)]
• hpdb (s’) helps recognize dead ends
2. Choose an action based on a combination of QSL(s, a) and Qpdb(s, a)
S: Current State, A(S) → S’
![Page 262: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/262.jpg)
Summary of Determinization Approaches
• Revolutionized SSP MDPs approximation techniques – Harnessed the speed of classical planners – Eventually, “learned” to take into account probabilities – Help optimize for a “proxy” criterion, MAXPROB
• Classical planners help by quickly finding paths to a goal – Takes “probabilistic” MDP solvers a while to find them on their own
• However… – Still almost completely disregard expect cost of a solution – Often assume uniform action costs (since many classical planners do) – So far, not useful on FH and IHDR MDPs turned into SSPs
• Reaching a goal in them is trivial, need to approximate reward more directly
– Impractical on problems with large numbers of outcomes
265
![Page 263: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/263.jpg)
Approximation Algorithms
Overview
• Online Algorithms – Determinization-based Algorithms
– Monte-Carlo Planning
• Offline Algorithms – Heuristic Search with Inadmissible Heuristics
– Dimensionality Reduction
– Hierarchical Planning
– Hybridized Planning
266
![Page 264: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/264.jpg)
Monte-Carlo Planning
• Recall the Sysadmin problem:
267
Restart(Ser2)
Restart(Ser1)
Restart(Ser3)
P(Ser1t |Restartt-1(Ser1), Ser1
t-1, Ser2t-1)
P(Ser2t |Restartt-1(Ser2), Ser1
t-1, Ser2t-1, Ser3
t-1)
P(Ser3t |Restartt-1(Ser3), Ser2
t-1, Ser3t-1)
Time t-1 Time t
T: A: R: ∑i [Seri = ↑]
![Page 265: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/265.jpg)
Monte-Carlo Planning: Motivation
• Characteristics of Sysadmin:
– FH MDP turned SSPs0 MDP • Reaching the goal is trivial, determinization approaches not really helpful
– Enormous reachable state space
– High-entropy T (2|X| outcomes per action, many likely ones) • Building determinizations can be super-expensive
• Doing Bellman backups can be super-expensive
• Try Monte-Carlo planning
– Does not manipulate T or C/R explicitly – no Bellman backups
– Relies on a world simulator – indep. of MDP description size
268
![Page 266: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/266.jpg)
UCT: A Monte-Carlo Planning Algorithm
• UCT [Kocsis & Szepesvari, 2006] computes a solution by simulating the current best policy and improving it – Similar principle as RTDP
– But action selection, value updates, and guarantees are different
• Success stories: – Go (thought impossible in ‘05, human grandmaster level at 9x9 in ‘08)
– Klondike Solitaire (wins 40% of games)
– General Game Playing Competition
– Real-Time Strategy Games
– Probabilistic Planning Competition
– The list is growing…
269
![Page 267: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/267.jpg)
Current World State
Rollout policy
Terminal (reward = 1)
1
1
1
1
At a leaf node perform a random rollout
Initially tree is single leaf
UCT Example
Slide courtesy of A. Fern 270
![Page 268: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/268.jpg)
Current World State
1
1
1
1
Must select each action at a node at least once
0
Rollout Policy
Terminal (reward = 0)
Slide courtesy of A. Fern 271
UCT Example
![Page 269: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/269.jpg)
Current World State
1
1
1
1
Must select each action at a node at least once
0
0
0
0
Slide courtesy of A. Fern 272
UCT Example
![Page 270: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/270.jpg)
Current World State
1
1
1
1
0
0
0
0
When all node actions tried once, select action according to tree policy
Tree Policy
Slide courtesy of A. Fern 273
UCT Example
Can throw away (“forget”) the states beyond the tree policy
that were visited by the
rollouts
![Page 271: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/271.jpg)
Current World State
1
1
1
1
When all node actions tried once, select action according to tree policy
0
0
0
0
Tree Policy
0
Rollout Policy
Slide courtesy of A. Fern 274
UCT Example
![Page 272: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/272.jpg)
Current World State
1
1
1
1/2
When all node actions tried once, select action according to tree policy
0
0
0
0 Tree Policy
0
0
0
0
What is an appropriate tree policy? Rollout policy?
Slide courtesy of A. Fern 275
UCT Example
![Page 273: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/273.jpg)
• Rollout policy:
– Basic UCT uses random
• Tree policy: – Q(s,a) : average reward received in current trajectories after
taking action a in state s
– n(s,a) : number of times action a taken in s
– n(s) : number of times state s encountered
),(
)(ln),(maxarg)(
asn
sncasQs aUCT
Theoretical constant that must be selected empirically in practice.
Slide courtesy of A. Fern 276
UCT Details
Exploration term
![Page 274: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/274.jpg)
Current World State
1
1
1
1/2
When all node actions tried once, select action according to tree policy
0
0
0
0 Tree Policy
0
0
0
0
a1 a2 ),(
)(ln),(maxarg)(
asn
sncasQs aUCT
Slide courtesy of A. Fern 277
UCT Example
![Page 275: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/275.jpg)
• To select an action at a state s – Build a tree using N iterations of Monte-Carlo tree search
• Default policy is uniform random up to level L • Tree policy is based on bandit rule
– Select action that maximizes Q(s,a) (note that this final action selection does not take the exploration term into account, just the Q-value estimate)
• The more simulations, the more accurate
– Guaranteed to pick suboptimal actions exponentially rarely after convergence (under some assumptions)
• Possible improvements
– Initialize the state-action pairs with a heuristic (need to pick a weight) – Think of a better-than-random rollout policy
Slide courtesy of A. Fern 278
UCT Summary & Theoretical Properties
![Page 276: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/276.jpg)
Approximation Algorithms
Overview
Online Algorithms – Determinization-based Algorithms
– Monte-Carlo Planning
• Offline Algorithms – Heuristic Search with Inadmissible Heuristics
– Dimensionality Reduction
– Hierarchical Planning
– Hybridized Planning
279
![Page 277: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/277.jpg)
Moving on to Approximate Offline Planning
• Useful when there is no time to plan as you go …
– E.g., when playing a fast-paced game
• … and not much time/space to plan in advance, either
• Like in online planning, oftern, no quality guarantees
• Some online methods (e.g., MCP) can be used offline too
280
![Page 278: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/278.jpg)
Inadmissible Heuristic Search
• Why? – May require less space than admissible heuristic search
• Sometimes, intuitive suboptimal policies are small – E.g., taking a more expensive direct flight vs a cheaper 2-leg
• Apriori, no reason to expect an arbitrary inadmissible heuristic to yield a small solution – But, empirically, those based on determinization often do
• Same algos as for admissible HS, only heuristics differ
281
![Page 279: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/279.jpg)
The FF Heuristic
• Taken directly from deterministic planning – A major component of the formidable FF planner
• Uses the all-outcome determinization of a PPDDL MDP
– But ignores the delete effects (negative literals in action outcomes) – Actions never “unachieve” literals, always make progress to goal
• hFF(s) = approximate cost of a plan from s to a goal in the delete relaxation
• Very fast due to using the delete relaxation
• Very informative
282
Hoffmann and Nebel, 2001
![Page 280: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/280.jpg)
The GOTH Heuristic
• Designed for MDPs at the start (not adapted classical)
• Motivation: would be good to estimate h(s) as cost of a non-relaxed deterministic goal plan from s
– But too expensive to call a classical planner from every s
– Instead, call from only a few s and generalize estimates to others
• Uses AO determinization and the FF planner
283
Kolobov, Mausam, Weld, 2010a
![Page 281: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/281.jpg)
GOTH Overview
284
AOdet(M)
Start running an MDP solver (e.g., LRTDP)
MDP M
State s
Policy
hGOTH (s)
GOTH
Evaluate s
Plan prec & cost
Determinize M
Plan
Run a classical planner (e.g., FF)
Regress plan SixthSense
State s
Dead End
Nogoods
![Page 282: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/282.jpg)
Regressing Trajectories
285
Plan preconditions
= 1
= 2
Precondition costs
![Page 283: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/283.jpg)
Plan Preconditions
286
![Page 284: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/284.jpg)
Nogoods
287
Nogood
Kolobov, Mausam, Weld, 2010b
![Page 285: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/285.jpg)
288
Computing Nogoods
• Machine learning algorithm
– Adaptively scheduled generate-and-test procedure
• Fast, sound
• Beyond the scope of this tutorial…
![Page 286: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/286.jpg)
Estimating State Values
• Intuition
– Each plan precondition cost is a “candidate” heuristic value
• Define hGOTH(s) as MIN of all available plan precondition values applicable in s
– If none applicable in s, run a classical planner and find some
– Amortizes the cost of classical planning across many states
289
![Page 287: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/287.jpg)
Open Questions in Inadmissible HS
• hGOTH is still much more expensive to compute than hFF…
• … but also more informative, so LRTDP+hGOTH is more space/time efficient than LRTDP+hFF on most benchmarks
• Still not clear when and why determinization-based inadmissible heuristics appear to work well – Because they guide to goals along short routes?
– Due to an experimental bias (MDPs with uniform action costs)?
• Need more research to figure it out…
290
![Page 288: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/288.jpg)
Approximation Algorithms
Overview
Online Algorithms – Determinization-based Algorithms
– Monte-Carlo Planning
• Offline Algorithms – Heuristic Search with Inadmissible Heuristics
– Dimensionality Reduction
– Hierarchical Planning
– Hybridized Planning
291
![Page 289: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/289.jpg)
Dimensionality Reduction: Motivation
• No approximate methods so far explicitly try to save space – Inadmissible HS can easily run out of memory
– MCP runs out of space unless allowed to “forget” visited states
• Dimenstionality reduction attempts to do exactly that – Insight: V* and π* are functions of ~|S| parameters (states)
– Replace it with an approximation with r << |S| params …
– … in order to save space
• How to do it? – Factored representations are crucial for this
– View V/π as functions of state variables, not states themselves!
292
![Page 290: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/290.jpg)
ReTrASE
• Largely similar to hGOTH – Uses preconditions of deterministic plan to evaluate states
• For each plan precondition p, defines a basis function
– Bp(s) = 1 iff p holds in s, ∞ otherwise
• Represents V(s) = minp wpBp(s) – Thus, the parameters are wp for each basis function – Problem boils down to learning wp – Does this with modified RTDP
• Crucial observation: # plan preconditions sufficient for representing V is typically much smaller than |S| – Because one plan precondition can hold in several states – Hence, the problem dimension is reduced!
293
Kolobov, Mausam, Weld, 2009
![Page 291: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/291.jpg)
ReTrASE Theoretical Properties
• Empirically, gives a large reduction in memory vs LRTDP
• Produces good policies (in terms of MAXPROB) when/if converges
• Not guaranteed to converge (weights may oscillate)
• No convergence detection/stopping criterion
295
![Page 292: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/292.jpg)
Approximate PI/LP: Motivation
• ReTrASE considers a very restricted type of basis functions
– Capture goal reachability information
– Not appropriate in FH and IHDR MDPs; e.g., in Sysadmin:
296
Restart(Ser2)
Restart(Ser1)
Restart(Ser3)
Time t-1 Time t
R(s) = ∑i [Seri = ↑]
A server is less likely to go down if its neighbors are up
State value ~increases with the number of
running servers!
![Page 293: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/293.jpg)
Approximate PI/LP: Motivation
• Define basis function bi(s) = 1 if Seri = ↑, 0 otherwise
• In Sysadmin (and other MDPs), good to let V(s)=∑iwi bi(s)
– A linear value function approximation
• If general, if a user gives a set B of basis functions, how do we pick w1, …, w|B| s.t. |V* - ∑iwi bi| is the smallest?
– Use API/ALP!
297
![Page 294: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/294.jpg)
Approximate Policy Iteration
• Assumes IHDR MDPs
• Reminder: Policy Iteration
– Policy evaluation
– Policy improvement
• Approximate Policy Iteration
– Policy evaluation: compute the best linear approx. of Vπ
– Policy improvement: same as for PI
298
![Page 295: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/295.jpg)
Approximate Policy Iteration
• To compute the best linear approximation, find
• Linear program in |B| variables and 2|S| constraints
• Does API converge?
– In theory, no; can oscillate if linear approx. for some policies coincide
– In practice, usually, yes
– If converges, can bound solution quality
299
A linear approximation
toVπ
Bellman backup applied to the linear approximation Vπ
Guestrin, Koller, Parr, Venkataraman, 2003
![Page 296: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/296.jpg)
Approximate Linear Programming
• Same principle as API: replace V(s) with ∑iwi bi(s) in LP
• Linear program in |B| variables and |S||A| constraints
• But wait a second…
– We have at least one constraint per state! Solution dimension is reduced, but finding solution is still at least linear in |S|!
300
![Page 297: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/297.jpg)
Making API and ALP More Efficient
• Insight: assume each b depends on at most z << |X| vars
• Then, can reformulate LPs with only O(2z) constraints
– Much smaller than O(2|X|)
• Very nontrivial…
301
![Page 298: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/298.jpg)
FPG
• Directly learns a policy, not a value function
• For each action, defines a desirability function
• Mapping from state variable values to action “quality” – Represented as a neural network
– Parameters to learn are network weights θa,1, …, θa,m for each a 302
X1 Xn … …
fa (X1, …, Xn)
θa,1 θa,2 θa,m-1 θa,m θa, …
θa, …
[Buffet and Aberdeen, 2006, 2009]
![Page 299: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/299.jpg)
FPG
• Policy (distribution over actions) is given by a softmax
• To learn the parameters:
– Run trials (similar to RTDP)
– After taking each action, compute the gradient w.r.t. weights
– Adjust weights in the direction of the gradient
– Makes actions causing expensive trajectories to be less desirable
303
![Page 300: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/300.jpg)
FPG Details & Theoretical Properties
• Can speed up by using FF to guide trajectories to the goal
• Gradient is computed approximately
• Not guaranteed to converge to the optimal policy
• Nonetheless, works well
304
![Page 301: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/301.jpg)
Approximation Algorithms
Overview
Online Algorithms – Determinization-based Algorithms
– Monte-Carlo Planning
• Offline Algorithms – Heuristic Search with Inadmissible Heuristics
– Dimensionality Reduction
– Hierarchical Planning
– Hybridized Planning
305
![Page 302: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/302.jpg)
Hierarchical Planning: Motivation
• Some MDPs are too hard to solve w/o prior knowledge
– Also, arbitrary policies for such MDPs may be hard to interpret
• Need a way to bias the planner towards “good” policies
– And to help the planner by providing guidance
• That’s what hierarchical planning does
– Given some prespecified (e.g., by the user) parts of a policy …
– … planner “fills in the details”
– Essentially, breaks up a large problem into smaller ones
306
![Page 303: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/303.jpg)
Hierarchical Planning with Options
• Suppose a robot knows precomputed policies (options) for some primitive behaviors
307
![Page 304: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/304.jpg)
Hierarchical Planning with Options
• Options are almost like actions, but their transition function needs to be computed
• Suppose you want to teach the robot how to dance
• You provide a hierarchical planner with options for the robot’s primitive behaviors
• Planner estimates the transition function and computes a policy for dancing that uses options as subroutines.
308
![Page 305: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/305.jpg)
Task Hierarchies
• The user breaks down a task into a hierarchy of subgoals
• The planner chooses which subgoals to achieve at each level, and how
– Subgoals are just hints
– Not all subgoals may be necessary to achieve the higher-level goal
309
Get into the car
Walk up to the car
Open left front door
Open right front door
Go outside
Cross the street
… …
![Page 306: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/306.jpg)
Hierarchies of Abstract Machines (HAMs)
• More general hierarchical representation
• Each machine is a finite-state automaton w/ 4 node types
• The user supplies a HAM
• The planner needs to decide what to do in choice nodes
310
Execute action a
Call another
machine H
Choose a machine from {H1, … , Hn} and execute it
Stop execution/return control to a higher-level
machine
![Page 307: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/307.jpg)
Optimality in Hierarchical Planning
• Hierarchy constraints may disallow globally optimal π*
• Next-best thing: a hierarchically optimal policy – The best policy obeying the hierarchy constraints
– Not clear how to find it efficiently
• A more practical notion: a recursively optimal policy – A policy optimal at every hierarchy level, assuming that policies at lower
hierarchy levels are fixed
– Optimization = finding optimal policy starting from lowest level
• Hierarchically optimal doesn’t imply recursively optimal, and v. v. – But hierarchically optimal is always at least as good as recursively optimal
311
![Page 308: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/308.jpg)
Learning Hierarchies
• Identifying useful subgoals
– States in “successful” and not in “unsuccessful” trajectories
– Such states are similar to landmarks
• Breaking up an MDP into smaller ones
– State abstraction (removing variables irrelevant to the subgoal)
• Still very much an open problem
312
![Page 309: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/309.jpg)
Approximation Algorithms
Overview
Online Algorithms – Determinization-based Algorithms
– Monte-Carlo Planning
• Offline Algorithms – Heuristic Search with Inadmissible Heuristics
– Dimensionality Reduction
– Hierarchical Planning
– Hybridized Planning
313
![Page 310: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/310.jpg)
Hybridized Planning: Motivation
• Sometimes, need to arrive at a provably “reasonable” (but possibly suboptimal) solution ASAP
314
Fast suboptimal planner with guarantees
Slower optimal planner
Suboptimal policy
Optimal policy (if enough time)
Hybridize!
![Page 311: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/311.jpg)
Hybridized Planning
• Hybridize MBP and LRTDP
• MBP is a non-deterministic planner
– Gives a policy guaranteed to reach the goal from everywhere
– Very fast
• LRTDP is an optimal probabilistic planner
– Amends MBP’s solution to have a good expected cost
• Optimal in the limit, produces a proper policy quickly
315
[Mausam, Bertoli, Weld, 2007]
![Page 312: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/312.jpg)
Summary
• Surveyed 6 different approximation families – Dimensionality reduction
– Monte-Carlo sampling
– Inadmissible heuristic search
– Dimensionality reduction
– Hierarchical planning
– Hybridized planning
• Sacrifice different solution quality aspects
• Lots of work to be done in each of these areas
316
![Page 313: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/313.jpg)
Outline of the Tutorial
• Introduction
• Fundamentals of MDPs
• Uninformed Algorithms
• Heuristic Search Algorithms
• Approximation Algorithms
• Extension of MDPs
317
(10 mins)
(1+ hr)
(1 hr)
(1 hr)
(1+ hr)
(remaining time)
![Page 314: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/314.jpg)
One set of techniques we didn’t cover
• Compact value function representations
• ADD-based planners – Symbolic VI (SPUDD)
– Symbolic Prioritized Sweeping
– Symbolic LAO*
– Symbolic RTDP
– Approximations (APRICODD)
• Better representations: Affine ADDs.
318
![Page 315: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/315.jpg)
Continuous State/Action MDPs
• See Scott’s Tutorial.
319
![Page 316: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/316.jpg)
Concurrent Probabilistic Temporal Planning
What action
next?
Percepts Actions
Environment
Static
Fully
Observable
Perfect
Stochastic
Durative
Concurrent
![Page 317: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/317.jpg)
Results
• MDPs with Durative Actions, No Concurrency – VI, RTDP, Incremental Contingency Planning
– Simple Temporal Nets, Piecewise linear vfs…
• MDPs with Concurrent Actions, No Time – CoMDPs
– Action Elimination, ALP, Hierarchical planning, …
• MDPs with Concurrent, Durative Actions – Generalized Semi-Markov Decision Process (GSMDP)
– Augmented state MDPs, Generate-test-debug, hybridized planning…
321
![Page 318: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/318.jpg)
Relational MDPs
• PPDDL/RDDL are first-order representations
– Algorithms ground it into propositional domains
• Relational MDPs actively use first-order structure
– First-order VI, PI, ALP
– Inductive approaches
• Generalizes to many problems
– with variable number of objects
322
![Page 319: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/319.jpg)
Well-formed MDPs beyond SSPs
• All improper policies may not have infinite cost • VI doesn’t work
– has multiple fixed points – greedy policy over optimal value may not be optimal
• Heuristic Search much trickier • Generalized SSP MDPs [Kolobov et al 11]
• Stochastic Safest & Shortest Path [Teichteil-Konigsbuch 12]
• Fun recent work… 323
S0 0
0 S1 S2 S3 S4 0
0
0 1
0
0 G
![Page 320: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/320.jpg)
Other Models
• Reinforcement Learning – model/costs unknown – Monte-Carlo planning
• Partially Observable MDP – MDP with incomplete state information – Large Continuous MDP – Lots of applications
• Multi-objective MDP • MDPs with Imprecise Probabilities • Collaborative Multi-agent MDPs • Adversarial Multi-agent MDPs
324
![Page 321: Probabilistic Planning with Markov Decision Processes · Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University](https://reader030.fdocuments.in/reader030/viewer/2022041208/5d660c0288c993e15a8b8f45/html5/thumbnails/321.jpg)
Thanks!
325
Mausam and Andrey Kolobov “Planning with Markov Decision Processes: An AI Perspective”
Morgan and Claypool Publishers (Synthesis Lectures Series on Artificial Intelligence)