A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington,...

47
A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento

Transcript of A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington,...

A Hybridized Planner for Stochastic Domains

Mausam and Daniel S. WeldUniversity of Washington, Seattle

Piergiorgio BertoliITC-IRST, Trento

Planning under Uncertainty

(ICAPS’03 Workshop)

Qualitative (disjunctive) uncertainty

Which real problem can you solve?

Quantitative (probabilistic) uncertainty

Which real problem can you model?

The Quantitative View

Markov Decision Processmodels uncertainty with probabilistic

outcomesgeneral decision-theoretic frameworkalgorithms are slow

do we need the full power of decision theory?

is an unconverged partial policy any good?

The Qualitative View

Conditional PlanningModel uncertainty as logical disjunction of

outcomesexploits classical planning techniques FASTignores probabilities poor solutions

how bad are pure qualitative solutions?can we improve the qualitative policies?

HybPlan: A Hybridized Planner

combine probabilistic + disjunctive plannersproduces good solutions in intermediate timesanytime: makes effective use of resourcesbounds termination with quality guarantee

Quantitative View completes partial probabilistic policy by using qualitative

policies in some states

Qualitative View improves qualitative policies in more important regions

Outline

MotivationPlanning with Probabilistic Uncertainty

(RTDP)Planning with Disjunctive Uncertainty (MBP)Hybridizing RTDP and MBP (HybPlan)ExperimentsConclusions and Future Work

Markov Decision Process

< S, A, Pr, C, s0, G > S : a set of states

A : a set of actions

Pr : prob. transition model

C : cost model

s0 : start state

G : a set of goals

Find a policy (S ! A) minimizes expected cost

to reach a goal for an indefinite horizon for a fully observable Markov decision process.

Optimal cost function, J*, ~ optimal policy

s0 Goal

2

2

Example

Longer path

Wrong direction,but goal still reachable

All states aredead-ends

41

3

8

44

1

8

3 2

2

77

3 3

1

11

2

6

01

1

2

2

Goal

Optimal State Costs

43

3 2

20

1

1 Goal

Optimal Policy

Bellman Backup: Create better approximation to cost function @ s

Bellman Backup: Create better approximation to cost function @ s

Trial=simulate greedy policy & update visited states

Bellman Backup: Create better approximation to cost function @ s

Real Time Dynamic Programming

(Barto et al. ’95; Bonet & Geffner’03)

Repeat trials until cost function converges

Trial=simulate greedy policy & update visited states

Planning with Disjunctive Uncertainty

< S, A, T, s0, G > S : a set of states

A : a set of actions

T : disjunctive transition model

s0 : the start state

G : a set of goals

Find a strong-cyclic policy (S ! A) that guarantees

reaching a goal for an indefinite

horizon for a fully observable planning problem

Model Based Planner (Bertoli et. al.)

States, transitions, etc. represented logicallyUncertainty multiple possible successor states

Planning Algorithm Iteratively removes “bad” states.Bad = don’t reach anywhere or reach other bad states

Goal

MBP Policy

Sub-optimal solution

Outline

MotivationPlanning with Probabilistic Uncertainty

(RTDP)Planning with Disjunctive Uncertainty

(MBP)Hybridizing RTDP and MBP (HybPlan)ExperimentsConclusions and Future Work

HybPlan Top Level Code0. run MBP to find a solution to goal1. run RTDP for some time2. compute partial greedy policy

(rtdp)

3. compute hybridized policy (hyb) by1. hyb(s) = rtdp(s) if visited(s) > threshold

2. hyb(s) = mbp(s) otherwise

4. clean hyb by removing1. dead-ends2. probability 1 cycles

5. evaluate hyb 6. save best policy obtained so far

repeat until1) resources exhaust or2)a satisfactory policy found

00

0

0

00

0

0

0 0

0

00

0 0

0

00

0

0

Goal

0

0

2

2

First RTDP Trial

1. run RTDP for some time

0

0

0

00

0

0

0 0

0

00

0 0

0

00

0

0

Goal

0

0

2

2

Q1(s,N) = 1 + 0.5£ 0 + 0.5£ 0Q1(s,N) = 1Q1(s,S) = Q1(s,W) = Q1(s,E) = 1J1(s) = 1

Let greedy action be North

Bellman Backup

1. run RTDP for some time

10

0

0

00

0

0

0 0

0

00

0 0

0

00

0

0

Goal

0

0

2

2

Simulation of Greedy Action

1. run RTDP for some time

10

0

0

0

0

0

0 0

0

00

0 0

0

00

0

0

Goal

0

0

2

2

Continuing First Trial

1. run RTDP for some time

0

0

0

01

0

0

0 0

0

00

0

0

00

0

0

Goal

0

0

2

2

1

Continuing First Trial

1. run RTDP for some time

0

0

0

0

0

0

0 0

0

00

1 0

0

00

0

Goal

0

0

2

2

1

1

Finishing First Trial

1. run RTDP for some time

0

0

0

0

0

0

0 0

0

00

0

0

00

2

0

Goal

0

0

2

2

1

1

1

Cost Function after First Trial

1. run RTDP for some time

0

2

Goal

2

1

1

1

Partial Greedy Policy

2. compute greedy policy (rtdp)

0

2

Goal

2

1

1

1

0

Construct Hybridized Policy w/ MBP

3. compute hybridized policy (hyb)

(threshold = 0)

0

2

Goal

2

1

1

1

After first trial

0

J(hyb) = 5

5

4

3

2

4

3

Evaluate Hybridized Policy

5. evaluate hyb

6. store hyb

0

0

0

0

1

0

0 0

0

00

0

0

12

2

0

Goal

0

0

2

2

1

1

1

Second Trial

0

112 1

Partial Greedy Policy

0

112 1

Absence of MBP Policy

MBP Policy doesn’t exist!no path to goal

£01

01

2

Goal

2

0

0

1

0

1

0

0 0

0

01

0

0

1

2

3

Goal

0

0

2

2

1

1

1

Third Trial

2

1 0

1

3

2

1

Partial Greedy Policy

1 0

1

3

2

1

Probability 1 Cycles

0

repeat find a state s in cycle hyb(s) = mbp(s)until cycle is broken

1 0

1

3

2

1

Probability 1 Cycles

0

repeat find a state s in cycle hyb(s) = mbp(s)until cycle is broken

1 0

1

3

2

1

Probability 1 Cycles

0

repeat find a state s in cycle hyb(s) = mbp(s)until cycle is broken

1 0

1

3

2

1

Probability 1 Cycles

0

repeat find a state s in cycle hyb(s) = mbp(s)until cycle is broken

1 0

1

3

2

1

Probability 1 Cycles

0

01

01

2

Goal

2

repeat find a state s in cycle hyb(s) = mbp(s)until cycle is broken

0

2

Goal

2

1

1

1

After 1st trial

0

J(hyb) = 5

5

4

3

2

4

3

Error Bound

J*(s0) · 5J*(s0) ¸ 1) Error(hyb) = 5-1 = 4

Terminationwhen a policy of required error bound is foundwhen the planning time exhaustswhen the available memory exhausts

Propertiesoutputs a proper policyanytime algorithm (once MBP terminates)HybPlan = RTDP, if infinite resources availableHybPlan = MBP, if extremely limited resourcesHybPlan = better than both, otherwise

Outline

MotivationPlanning with Probabilistic Uncertainty

(RTDP)Planning with Disjunctive Uncertainty

(MBP)Hybridizing RTDP and MBP (HybPlan)Experiments

Anytime Properties Scalability

Conclusions and Future Work

Domains

NASA Rover Domain Factory Domain Elevator domain

Anytime Properties

RTDP

Anytime Properties

RTDP

Scalability

Problems Time before memory exhausts

J(rtdp) J(mbp) J(hyb)

Rov5 ~1100 sec 55.36 67.04 48.16

Rov2 ~800 sec 1 65.22 49.91

Mach9 ~1500 sec 143.95 66.50 48.49

Mach6 ~300 sec 1 71.56 71.56

Elev14 ~10000 sec

1 46.49 44.48

Elev15 ~10000 sec

1 233.07 87.46

Conclusions

First algorithm that integrates disjunctive and probabilistic planners.

Experiments show that HybPlan is anytime scales better than RTDP produces better quality solutions than MBP can interleaved planning and execution

Hybridized Planning: A General Notion

Hybridize other pairs of planners an optimal or close-to-optimal planner a sub-optimal but fast planner

to yield a planner that produces a good quality solution in intermediate running times

Examples POMDP : RTDP/PBVI with POND/MBP/BBSP Oversubscription Planning : A* with greedy solutions Concurrent MDP : Sampled RTDP with single-action RTDP