Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial...

32
Partial Plan Development for Hierarchical POMDPs Philipp Heeg 19. November 2013 Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 1 / 32

Transcript of Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial...

Page 1: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

Partial Plan Development for Hierarchical POMDPs

Philipp Heeg

19. November 2013

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 1 / 32

Page 2: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

Contents

1 Motivation and Introduction

2 Background InformationPOMDPs and FSCsHPOMDPs and PAFSCsThe UCT-Algorithm

3 ApproachPartial Plan Development for HPOMDPsPartial Plan Development with UCT

4 Evaluation

5 Summary

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 2 / 32

Page 3: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

1. Motivation and Introduction

Planning with Uncertainty and Partial Observability

POMDPs can be used to model partially observable, uncertaindomains, but solving them is PSPACE-complete

in Hierarchical POMDPs, expert knowledge is introduced to optimizeplanning

Partial Plan Development

usually, the whole plan is developed before execution is started=⇒ all eventualities have to be accounted for

Idea: alternating between partial execution and planning, so theinformation gained in execution can be used to guide further planning

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 3 / 32

Page 4: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

2.1 POMDPs and FSCs

POMDP: Partially Observable Markov Decision ProcessA POMDP has 7 components: S ,A,O,T ,Z ,R,H

a finite set of states S

a finite set of actions A

a finite set of observations O

a transition function T with T (s, a, s ′) ∈ [0, 1]

an observation function Z with Z (s, a, o) ∈ [0, 1]

a reward function R with R(s, a) ∈ Ra horizon H ∈ N

Solution: A policy that maximizes the total expected reward

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 4 / 32

Page 5: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

2.1 POMDPs and FSCs

FSC: Finite State ControllerPolicy-representation that uses internal states instead of belief statesAn FSC has 3 components: N, α, δ

a set of controller-nodes Nan action association function α with α(n) ∈ Aa transition function δ with δ(n, n′) ∈ 2O

av

aw

ax

ay

o1 o1

o2

o1

o2

o2o1,o2

start

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 5 / 32

Page 6: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

2.2 HPOMDPs and PAFSCs

HPOMDP: Hierarchical POMDPExtension to POMDPs that allows for exploitation of expert knowledge:

a new set of abstract actions Aa

a new set of abstract observations Oa

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 6 / 32

Page 7: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

2.2 HPOMDPs and PAFSCs

HPOMDP: Hierarchical POMDPExtension to POMDPs that allows for exploitation of expert knowledge:

a new set of abstract actions Aa

a new set of abstract observations Oa

PAFSC: Partially abstract FSC

controller-nodes can be associated with either primitive or abstractactions.

abstract nodes can be decomposed using Method-FSCs (MFSCs)

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 7 / 32

Page 8: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

2.2 HPOMDPs and PAFSCs

findObjecttrue

analyzeObjectstart

Initial PAFSC

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 8 / 32

Page 9: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

2.2 HPOMDPs and PAFSCs

findObjecttrue

analyzeObjectstart

Initial PAFSC

moveLeft

findObject

start

Method-FSC for findObject

true

foundObject

¬foundObject

findObject ⇒

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 9 / 32

Page 10: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

2.2 HPOMDPs and PAFSCs

findObjecttrue

analyzeObjectstart

Initial PAFSC

moveLeft

findObject

moveLeft

findObject

analyzeObject

start

Method-FSC for findObject

true

true

foundObject

¬foundObject

foundObject

¬foundObject

Resulting decomposed PAFSC

start

findObject ⇒

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 10 / 32

Page 11: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

2.3 The UCT-Algorithm

UCT: Upper Confidence Bound for Trees

UCT is an instance of Monte Carlo Tree Search (MCTS)

MCTS builds a partial search tree by interacting with a domainsimulator

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 11 / 32

Page 12: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

2.3 The UCT-Algorithm

UCT: Upper Confidence Bound for Trees

UCT is an instance of Monte Carlo Tree Search (MCTS)

MCTS builds a partial search tree by interacting with a domainsimulator

A domain simulator consists of 4 components:

a set of states S with initial state s0

a set of actions A

a transitionsimulator T

a rewardsimulator R

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 12 / 32

Page 13: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

2.3 The UCT-Algorithm

...

n(s)=1

n(s)=1

n(s)=1

n(s)=1

n(s,a)=1

n(s,a)=1

n(s,a)=1

n(s,a)=1

QUCT(s,a)=2

QUCT(s,a)=2

QUCT(s,a)=2

QUCT(s,a)=2

Iteration 1

Total Reward: 2

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 13 / 32

Page 14: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

2.3 The UCT-Algorithm

... ...

n(s)=2

n(s)=1 n(s)=1

n(s)=1

n(s)=1

n(s)=1

n(s)=1

n(s,a)=1 n(s,a)=1

n(s,a)=1 n(s,a)=1

n(s,a)=1 n(s,a)=1

n(s,a)=1n(s,a)=1

QUCT(s,a)=2

QUCT(s,a)=2

QUCT(s,a)=2

QUCT(s,a)=‒3

QUCT(s,a)=‒3

QUCT(s,a)=‒3

QUCT(s,a)=‒3QUCT(s,a)=2

Iteration 2

Total Reward: 2 Total Reward: ‒3

...

n(s)=1

n(s)=1

n(s)=1

n(s)=1

n(s,a)=1

n(s,a)=1

n(s,a)=1

n(s,a)=1

QUCT(s,a)=2

QUCT(s,a)=2

QUCT(s,a)=2

QUCT(s,a)=2

Iteration 1

Total Reward: 2

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 14 / 32

Page 15: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

2.3 The UCT-Algorithm

... ......

n(s)=3

n(s)=2 n(s)=1

n(s)=1

n(s)=1

n(s)=1

n(s)=1

n(s)=1

n(s)=1

n(s,a)=2 n(s,a)=1

n(s,a)=1 n(s,a)=1 n(s,a)=1

n(s,a)=1 n(s,a)=1

n(s,a)=1n(s,a)=1

n(s,a)=1

n(s,a)=1

QUCT(s,a)=2

QUCT(s,a)=2

QUCT(s,a)=2

QUCT(s,a)=‒3

QUCT(s,a)=‒3

QUCT(s,a)=‒3

QUCT(s,a)=‒3QUCT(s,a)=‒1

QUCT(s,a)=‒4

QUCT(s,a)=‒4

QUCT(s,a)=‒4

Iteration 3

Total Reward: 2Total Reward: ‒4

... ...

n(s)=2

n(s)=1 n(s)=1

n(s)=1

n(s)=1

n(s)=1

n(s)=1

n(s,a)=1 n(s,a)=1

n(s,a)=1 n(s,a)=1

n(s,a)=1 n(s,a)=1

n(s,a)=1n(s,a)=1

QUCT(s,a)=2

QUCT(s,a)=2

QUCT(s,a)=2

QUCT(s,a)=‒3

QUCT(s,a)=‒3

QUCT(s,a)=‒3

QUCT(s,a)=‒3QUCT(s,a)=2

Iteration 2

Total Reward: 2 Total Reward: ‒3

...

n(s)=1

n(s)=1

n(s)=1

n(s)=1

n(s,a)=1

n(s,a)=1

n(s,a)=1

n(s,a)=1

QUCT(s,a)=2

QUCT(s,a)=2

QUCT(s,a)=2

QUCT(s,a)=2

Iteration 1

Total Reward: 2 Total Reward: ‒3

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 15 / 32

Page 16: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

2.3 The UCT-Algorithm

Applying UCT to HPOMDP problems

states are reachable PAFSCs

actions are decompositions

state transitions are deterministic

rewards are generated by simulating an execution of the final primitiveFSC

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 16 / 32

Page 17: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

2.3 The UCT-Algorithm

Applying UCT to HPOMDP problems

states are reachable PAFSCs

actions are decompositions

state transitions are deterministic

rewards are generated by simulating an execution of the final primitiveFSC

since order of decompositions is irrelevant, generation ofdecompositions at each node in the search tree can be limited to 1controller-node

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 17 / 32

Page 18: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

3.1 Partial Plan Development for HPOMDPs

Partial executability of PAFSCs

PAFSCs can be partially executed until the current controller-node isassociated with an abstract action

Partial Plan Development: Alternating between an execution phase anda planning phase. Total planning time is distributed over all planningphases.

Execution phase: partially executing the current PAFSC until anabstract controller-node is reached

Planning phase: refining the current PAFSC by applying adecomposition to the abstract controller-node that was reached inlast execution phase

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 18 / 32

Page 19: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

3.1 Partial Plan Development for HPOMDPs

n0

n1

na1

starto1

o2 true

true

n0

n1

na1

starto1

o2 true

true

n0

n1

na1

starto1

o2 true

true

Beginning of execution phase 1 After one executed action

After two executed actions

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 19 / 32

Page 20: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

3.1 Partial Plan Development for HPOMDPs

n2 na2nt1

true truestart

MFSC that was selected in planning phase 1

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 20 / 32

Page 21: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

3.1 Partial Plan Development for HPOMDPs

n2 na2nt1

true truestart

n0 n1start

o1

o2 true

n2 na2

true

true

n0

n1starto1

o2 true

n2 na2

true

true

Beginning of execution phase 2 End of execution phase 2

MFSC that was selected in planning phase 1

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 21 / 32

Page 22: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

3.1 Partial Plan Development for HPOMDPs

Goal: higher total reward with same total planning time

Disadvantage: less time to plan for earlier decompositions

Advantage: planning specifically for the controller-node that wasreached in the last execution phase

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 22 / 32

Page 23: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

3.2 Partial Plan Development with UCT

Idea: reusing the same search tree over all planning phases

in each planning phase, the latest PAFSC is used as root node

only one decomposition applied at plan extraction

as first decomposition in each planning phase, only decompositionsfor the abstract node that was reached in the latest execution phaseare allowed

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 23 / 32

Page 24: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

3.2 Partial Plan Development with UCT

First Planning Phase

C0

C1

C2

C3

C4

C5

C6

C7

C8

... ... ...

d1 d2

d3 d4 d5d6 d7 d8

QUCT(C1)=7,54n(C1)=346

QUCT(C2)=4,17n(C2)=154

QUCT(C0)=6,50n(C0)=500

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 24 / 32

Page 25: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

3.2 Partial Plan Development with UCT

Second Planning Phase

C2

C3

C4

C5

C6

C7

C8

... ... ...

C0

d1 d2

d3 d4 d5

d6 d7 d8

C1QUCT(C1)=7,54n(C1)=346

QUCT(C2)=4,17

n(C2)=154

QUCT(C0)=6,50

n(C0)=500

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 25 / 32

Page 26: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

3.2 Partial Plan Development with UCT

Problem: limiting decompositions to 1 abstract controller-node foreach tree-node

na1

n1

na2

o1 o2

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 26 / 32

Page 27: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

3.2 Partial Plan Development with UCT

Problem: limiting decompositions to 1 abstract controller-node foreach tree-node

na1

n1

na2

o1 o2

na1

n1

na2

o1 o2

na3

true

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 27 / 32

Page 28: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

3.2 Partial Plan Development with UCT

Problem: limiting decompositions to 1 abstract controller-node foreach tree-node

na1

n1

na2

o1 o2

na1

n1

na2

o1 o2

na3

true

Therefore: allow decompositions for all abstract controller-nodes forwhich there’s a primitive path from the initial controller-node

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 28 / 32

Page 29: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

4 Evaluation

comparing partial planner to non-partial planner (Christian SpathMaster Thesis), using the same total planning time

for the partial planner, the total time Z is distributed over theplanning phases by a geometric series:

t(n) = (1− q)qn−1Z

3 different evaluation domains with several instances each

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 29 / 32

Page 30: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

4 Evaluation

Results for the Reconnaissance domain, 100s planning time

non-partial q = 0.1 q = 0.3 q = 0.5 q = 0.7 q = 0.9 Ø

Instance 1 0.58 0.52 0.52 0.64 0.52 0.64 0.57Instance 2 1.81 1.16 1.64 1.27 1.66 1.49 1.45Instance 3 0.92 0.67 1.23 0.69 0.56 1.09 0.85Instance 4 0.77 0.54 0.95 0.5 0.65 0.73 0.67Instance 5 0.92 0.8 0.93 1.01 0.43 1.17 0.87Instance 6 1.66 1.13 0.52 0.52 0.32 1.76 0.85Instance 7 0.88 0.47 0.41 0.81 0.68 0.49 0.57Instance 8 0.39 0.24 0.16 0.24 0.4 0.26 0.26Instance 9 0.61 0.7 0.58 0.4 0.5 0.26 0.49

Instance 10 0.66 0.79 0.69 0.48 0.83 0.54 0.66

Øw 1.28 0.99 1.02 0.92 0.96 1.11 1

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 30 / 32

Page 31: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

5 Summary

Partial Plan Development for Hierarchical POMDPs using theUCT-Algorithm

alternating between execution phases and planning phases

same UCT-Tree is used for all planning phases, with changing rootnode

branching factor had to be increased to allow for directed plandevelopment

therefore slightly worse performance in the Reconnaissance domaincompared to non-partial planner

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 31 / 32

Page 32: Partial Plan Development for Hierarchical POMDPs€¦ · Idea: alternating between partial execution and planning, so the information gained in execution can be used to guide further

2.3 The UCT-Algorithm

MCTS uses a Tree policy and a Rollout policy for selecting Actionsduring Simulations.

UCT uses an adapted Upper Confidence Bound Algorithm as Treepolicy. The selected Action a∗ is determined by:

a∗ = arg maxa∈A

[QUCT (s, a) + c

√ln n(s)

n(s, a)

]

QUCT (s, a) = average Simulation Reward when a was selected in s

n(s) = number of visits of s in previous Simulations

n(s, a) = number of times a was executed in s in previous Simulations

Philipp Heeg Partial Plan Development for HPOMDPs 19. November 2013 32 / 32