Solving POMDPs through Macro Decomposition

68
Solving POMDPs through Macro Decomposition Larry Bush Tony Jimenez Brian Bairstow

description

Solving POMDPs through Macro Decomposition. Larry Bush Tony Jimenez Brian Bairstow. Outline. Introduction to POMDPs Brian Bairstow Demonstration of POMDPs Larry Bush Approximating POMDPs with Macro Actions Tony Jimenez. Introduction to POMDPs. Introduction to POMDPs - PowerPoint PPT Presentation

Transcript of Solving POMDPs through Macro Decomposition

Page 1: Solving POMDPs through Macro Decomposition

Solving POMDPs through Macro Decomposition

Larry Bush

Tony Jimenez

Brian Bairstow

Page 2: Solving POMDPs through Macro Decomposition

Outline

• Introduction to POMDPs– Brian Bairstow

• Demonstration of POMDPs– Larry Bush

• Approximating POMDPs with Macro Actions– Tony Jimenez

Page 3: Solving POMDPs through Macro Decomposition

Introduction to POMDPs

• Introduction to POMDPs– Markov Decision Processes (MDPs)– Value Iteration– Partially Observable Markov Decision Processes

(POMDPs)– Overview of Techniques

• Demonstration of POMDPs• Approximating POMDPs with Macro Actions

Page 4: Solving POMDPs through Macro Decomposition

Navigation of a Building

• Robot– Knows building map– Wants to reach goal– Uncertainty in actions– What is the best way to

get there?

Page 5: Solving POMDPs through Macro Decomposition

Markov Decision Processes

• Model– States, S

– Actions, A

– Transition Probabilities, p(s,a)

– Rewards, r(s,a)

• Process– Observe state st in S

– Choose action at in A

– Receive reward rt = r(st,at)

– State becomes st+1 according to probabilities p(s,a)

• Goal– Create a policy for choosing actions that maximizes the

lifetime reward• Discount Factor

Value = r0 + r1 + 2r2 + …

Page 6: Solving POMDPs through Macro Decomposition

MDP Model Example

• States A, B, C• Actions 1, 2• Transition probabilities p and rewards r in diagram• C is terminal state

A B C

21

1

2

p = 0.3

p = 0.7

p = 0.4

p = 0.6

p = 0.5

p = 0.5

p = 0.7p = 0.3

r = 1

r = 1

r = 1r = 1

r = 0.2

r = 0.2

r = 0

r = 0

Page 7: Solving POMDPs through Macro Decomposition

Decision Tree Representation of MDP

A

1

2

A

C

0.7

0.3

C

B0.6

0.41

1

0.2

0

0.3

0.52

0.52

RewardExpected Reward

Max Q Value

Page 8: Solving POMDPs through Macro Decomposition

Decision Tree Representation of MDP

A

1

2

A

C

0.7

0.3

C

B0.6

0.41

1

0.9

0.52

0.664

0.94

0.94

Max Q ValueTotal Future Reward

Max Q Value 1

2

A

C

0.7

0.31

0

B

C

0.6

0.4

1

0.20.52

0.3

1

2

A

C

0.3

0.71

0

B

C

0.5

0.5

1

0.20.6

0.7

Expected Reward

Reward

Page 9: Solving POMDPs through Macro Decomposition

Value Iteration

• Finite Horizon, 1 stepQ1(s,a) = r(s,a)

• Finite Horizon, n stepsQn(s,a) = r(s,a) + [p(s’|s,a) maxa’Qn-1(s’,a’)]

• Infinite HorizonQ(s,a) = r(s,a) + [p(s’|s,a) maxa’Q(s’,a’)]

• Policy(s) = argmaxaQ(s,a)

Page 10: Solving POMDPs through Macro Decomposition

Q Reinforcement Learning

• Q can be calculated only if p(s,a) and r(s,a) are known

• Otherwise Q can be trained:Qt+1(s,a) = (1-)Qt(s,a) + [R+maxa’Qt(s’,a’)]

• Perform trial runs; get data from observations• Keep running combinations of s and a until

convergence

Page 11: Solving POMDPs through Macro Decomposition

Partially Observable MDP’s

• Can no longer directly observe the state of the system• Instead at each step make an observation O, and know the

probability of each observation for each state p(O|S)• Also have initial state distribution p(s0)

S S S

O O O

A A

p(O|S)

p(s,a) Hidden

Observable

t1 t2 t3

Page 12: Solving POMDPs through Macro Decomposition

State Estimation

• States and Observations not independent of past observations

• Maintain a distribution of state probabilities called a belief bbt = p(st|ot,at,ot-1,at-1,…,o0,a0,s0)

StateEstimator

Policyb a

o

Page 13: Solving POMDPs through Macro Decomposition

Value Iteration with POMDPs

b(s2)0 1

Q(b,a)Q(a1) = Q(s1,a1)(1-b(s2)) + Q(s2,a1)b(s2)

Q(a2)

• Finite horizon value function is piecewise linear and convex– Depending on the value of b(s2), a different action is optimal

• Here is a one-step finite horizon problem– Need to consider effect on future belief states

Page 14: Solving POMDPs through Macro Decomposition

Decision Tree Representation of POMDP

b1

a1

a2

b211

b212

o2

o1

b221

b222

o2

o1

a1

a2

a1

a2

a1

a2

a1

a2

Page 15: Solving POMDPs through Macro Decomposition

Approximation: Point-Based

• Concentrate on specific points in continuous belief space

• How should beliefs be chosen?• Problems with high dimensionality

b2

b1

b3

b8b5

b6

b7

b4

Page 16: Solving POMDPs through Macro Decomposition

Approximation: Monte Carlo

• Estimate values of actions for samples of belief space• Interpolate for arbitrary values

• Computationally difficult to learn policy

a1

bib1

bkbj

Page 17: Solving POMDPs through Macro Decomposition

Approximation: Grid-Based

• Like point-based approximation, but with regular spacing• Can use variable resolution

• Size increases with dimensionality and resolution

b2b1 b3

b8

b5 b6

b7

b4

b9

b10

b12

b14

b13b11

Page 18: Solving POMDPs through Macro Decomposition

Heuristic: Maximum-Likelihood

• Assume that the state is the most likely state, and perform corresponding MDP.

• Only works if this is a good assumption!

b = <0.08, 0.02, 0.1, 0.8> POMDP

b = < 0 , 0 , 0 , 1 > MDP

Page 19: Solving POMDPs through Macro Decomposition

Heuristic: Information Gathering

• MDP is easy to solve

• Valuable to learn about the state

b = < 1 , 0 , 0 , 0 > b = < 0.1, 0.2, 0.4, 0.3 >

b1

Page 20: Solving POMDPs through Macro Decomposition

Heuristic: Hierarchical Method

• Split into top level POMDP and low level POMDPs• Smaller state space reduces amount of computation

Belief Space

POMDP of

POMDPs

Page 21: Solving POMDPs through Macro Decomposition

Heuristic: Policy Search

• Reinforcement learning on policy instead of value

• Value and Policy Search– Combine value and policy reinforcement

learning into a single update rule

Page 22: Solving POMDPs through Macro Decomposition

Outline

• Introduction to POMDPs

• Demonstration of POMDPs– Problem : UAV Navigation– MDP Visualizations– POMDP Visualizations

• Approximating POMDPs with Macro Actions

Page 23: Solving POMDPs through Macro Decomposition

Demonstrations*

• MATLAB• MDP

– Map to Problem Domain– Value Iteration– Policy Execution– Visualization**

• POMDP – Map to Problem Domain– Belief Space Search– Policy Execution– Visualization**

* All coding is mine except for some basic MDP value iteration functionality.

** Visualizations run in line with code.

Page 24: Solving POMDPs through Macro Decomposition

INFO

i

UAV Land at Airport

Problem : Introduction

Landing Runway

Take-off Runway+

UAV: +

Page 25: Solving POMDPs through Macro Decomposition

Outline• Introduction to POMDPs

• Demonstration of POMDPs– Problem : UAV Navigation– MDP Visualizations

• Value Iteration

• Policy Execution

– POMDP Visualizations

• Approximating POMDPs with Macro Actions

Page 26: Solving POMDPs through Macro Decomposition

INFO

i

MDP Utility Overlay

UAV: +

Reward (#)

State UtilityExclusion

Zone

.04

Transition Model

Cost Probability

.8.1 .1

Problem : Utility Overlay

Page 27: Solving POMDPs through Macro Decomposition

INFO

i

MDP Value Iteration : Until Convergence

UAV: +

Reward (#)

State Utility

Policy

.04

Transition Model

Cost Probability

.8.1 .1

MDP Value Iteration

Page 28: Solving POMDPs through Macro Decomposition

INFO

i

MDP Value Iteration : Until Convergence

UAV: +

Reward (#)

State Utility

Policy

.

.04

Transition Model

Cost Probability

.8.1 .1

MDP Value Iteration

Page 29: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

State Utility

Policy

INFO

i

MDP Policy Execution

.04

Transition Model

Cost Probability

.8.1 .1

MDP Policy Execution

Page 30: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

State Utility

Policy

INFO

.

.04

Transition Model

Cost Probability

.8.1 .1

i

MDP Policy Execution

MDP Policy Execution

Page 31: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

State Utility

Policy

INFO

i

MDP Policy Execution

.04

Transition Model

Cost Probability

.8.1 .1

MDP Policy Execution

Page 32: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

State Utility

Policy

INFO

.

i

MDP Policy Execution

.04

Transition Model

Cost Probability

.8.1 .1

MDP Policy Execution

Page 33: Solving POMDPs through Macro Decomposition

Outline• Introduction to POMDPs• Demonstration of POMDPs

– Problem : UAV Navigation

– MDP Visualizations

– POMDP Visualizations• Belief State Space Search

• Policy Comparison– 3 Step Horizon

– 10 Step Horizon

– AIMA (Russell & Norvig)

– Fuel Observation

• Approximating POMDPs with Macro Actions

Page 34: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

State Probability

INFO

TotalBeliefStateReward

Policy

3 Step Horizon

i

POMDP Belief State Space Search : No Observations

POMDP Belief State Space Search

Page 35: Solving POMDPs through Macro Decomposition

.

State Probability

Policy

3 Step Horizon

UAV: +

Reward (#)

INFO

TotalBeliefStateReward

i

POMDP Belief State Space Search : No Observations

State Probability

3 Step Horizon

Policy

POMDP Belief State Space Search

Page 36: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

INFO

TotalBeliefStateReward

Policy

< Right, Right, Right >

State Probability

i

POMDP Policy Execution : No Observations

POMDP Policy (3 Iterations)

Page 37: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

INFO

TotalBeliefStateReward

Policy

< Right, Right, Right >

.

State Probability

i

POMDP Policy Execution : No Observations

POMDP Policy (3 Iterations)

Page 38: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

INFO

TotalBeliefStateReward

Policy

< Left, Up, Up, Right, Up, Up, Right, Right, Right, Right >

State Probability

i

POMDP Policy Execution : No Observations

POMDP Policy (10 Iterations)

Page 39: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

INFO

TotalBeliefStateReward

Policy

< Left, Up, Up, Right, Up, Up, Right, Right, Right, Right >

.

State Probability

i

POMDP Policy Execution : No Observations

POMDP Policy (10 Iterations)

Page 40: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

INFO

TotalBeliefStateReward

Policy

< Left, Up, Up, Right, Up, Up, Right, Up, Up, Right >

State Probability

i

POMDP Policy Execution : No Observations

POMDP Policy (AIMA)

Page 41: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

INFO

TotalBeliefStateReward

Policy

< Left, Up, Up, Right, Up, Up, Right, Up, Up, Right >

.

State Probability

i

POMDP Policy Execution : No Observations

POMDP Policy (AIMA)

Page 42: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

INFO

TotalBeliefStateReward

Policy

< Up, Up, Right, Right, Right, Up, Right, Right, Right, Right >

State Probability

i

POMDP Policy Execution : Fuel Observation

POMDP Policy (Fuel Observation)

Page 43: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

INFO

TotalBeliefStateReward

Policy

< Up, Up, Right, Right, Right, Up, Right, Right, Right, Right >

.

State Probability

i

POMDP Policy Execution : Fuel Observation

POMDP Policy (Fuel Observation)

Page 44: Solving POMDPs through Macro Decomposition

Demonstration Conclusion

• Difficult to Solve

• Efficient Methods

Page 45: Solving POMDPs through Macro Decomposition

Outline

• Introduction to POMDPs

• Demonstration of POMDPs

• Approximating POMDPs with Macro Actions– Belief Compression using Grid Approximation– Macro-Actions in the Semi-MDP’s framework– Reinforcement Learning with a model

Page 46: Solving POMDPs through Macro Decomposition

Approximation Methods

• Point-Based Approximations

• Monte Carlo Method

• Grid-Based Approximations– Dynamic Grid-Resolution

Page 47: Solving POMDPs through Macro Decomposition

Policy Heuristics

• Maximum-Likelihood

• Information Gathering

• Hierarchical Methods– Macro-Actions

• Policy Search

Page 48: Solving POMDPs through Macro Decomposition

Grid Approximation

S2(0,1,0)

S3(0,0,1)

S1(1,0,0)

Resolution 1

S2(0,1,0)

S3(0,0,1)

S1(1,0,0)

Resolution 2

S2(0,1,0)

S3(0,0,1)

S1(1,0,0)

Resolution 4

Page 49: Solving POMDPs through Macro Decomposition

Dynamic Grid Allocation

• Allocate grid points from a uniformly spaced grid dynamically– Simulate trajectories of the agent through the belief

space

• When experiencing belief state:– Find grid point closest to belief state

– Add it to set of grid points explicitly considered

• Derived set of grid points is small and adapted to parts of belief space typically inhabited by agent

Page 50: Solving POMDPs through Macro Decomposition

Macro-Actions

• Temporally Extended Actions– For example:

• Move-Down-Corridor• Go-to-Chicago

– Versus primitive actions, such as:• Turn-Right• One-Step-Forward

• POMDP’s breakdown into a hierarchical structure of smaller POMDP’s

Page 51: Solving POMDPs through Macro Decomposition

Macro-Actions

MDP

SMDP

Page 52: Solving POMDPs through Macro Decomposition

Semi-Markov Decision Process

• Defined as a five-tuple (S,A,P,R,F)– S : finite set of states– A : set of actions– P : state and action transition probability functions– R : reward function– F : probability of transition time function for each

state-action pair• F(s’, |s,a) : specifies the joint probability that a transition

from state s to state s’ occurs after steps with action a

• Q-learning rule for SMDP’sQt+1(s,a) = (1-)Qt(s,a) + [R+maxa’Qt(s’,a’)]

Page 53: Solving POMDPs through Macro Decomposition

Problem Reformulation

• Monte Carlo Updates of the Q-values for the SMDP are used to learn the policy

POMDPGrid

Approx.MDP

MacroActions

SMDP

Page 54: Solving POMDPs through Macro Decomposition

Graphical Overview of the Algorithm

b’

g

b

b’’

g3

g2

g1

Page 55: Solving POMDPs through Macro Decomposition

Reinforcement LearningWith a Model

1. Assume a current true state• Which is physical true location of agent

• b(s) 0

2. Discretize the current belief state b gi

• gi is the closest grid-point in a regular discretization of the belief space.

• If gi is missing add it to the table of belief states to explicitly consider

• Interpolate its initial value from coarser resolutions.

Page 56: Solving POMDPs through Macro Decomposition

Reinforcement LearningWith a Model

3. Choose the best macro-action from the current Q values

• Random action % of the time• Interpolate over the Q values of the vertices

of the sub-simplex that contains b

Page 57: Solving POMDPs through Macro Decomposition

Reinforcement LearningWith a Model

4. Estimate E [R(gi,) + t V (b’)] by sampling:

• Sample a state s from the current grid-belief state gi

– Choose primitive action a according to macro-action – Sample the next state s’ and observation z from models

– Store the reward and update the belief state

– Set t = t+1, b = b’, s = s’ and repeat with next primitive action until terminates

• Compute the value of the resulting belief state b’– Interpolate over vertices in resulting belief sub-simplex

• Repeat multiple times and average the estimate

Page 58: Solving POMDPs through Macro Decomposition

Reinforcement LearningWith a Model

5. Update the state action value:Q(gi,) = (1-)q(gi,) + [R + t v(b’)]

6. Update the state value:V(gi) = argmaxQ(gi,)

Page 59: Solving POMDPs through Macro Decomposition

Reinforcement LearningWith a Model

7. Execute the macro-action starting from the belief state until termination

• Generate observations during execution• Set b = b’ and s = s’ and go to step 2

8. Repeat this learning epoch multiple times starting from the same b

Page 60: Solving POMDPs through Macro Decomposition

Tested Results

• Benefits of using dynamic grid-resolution– For 1068 world states and resolution of 4,

number of grid points:• 5.410 with regular discretization

• ~3500 with dynamic grid-resolution

• ~1000 with dynamic grid-resolution and macro-actions

Page 61: Solving POMDPs through Macro Decomposition

Tested Results

• Benefits of using macro-actions– Requires fewer training steps per episode– Performs better with increased grid-resolution in

the dynamic grid model– Better quality policy overall

• Benefits of using reward-shaping– Successful completion of test with macro-

actions 90~95%– 0~10% without reward-shaping

Page 62: Solving POMDPs through Macro Decomposition

What Is Next?

• Able to solve POMDP’s with 1000+ states

• How do we generate a set of “good” macro-actions?– Current research is focused on the dynamic

generation of macro-actions

Page 63: Solving POMDPs through Macro Decomposition

End

Page 64: Solving POMDPs through Macro Decomposition

An Extra Visualization

• This visualization (and others) were cut from the talk, for brevity. However, we thought that it was worthwhile to include this one in the tutorial slides.

Page 65: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

INFO

TotalBeliefStateReward

Policy

3 Step Horizon

State Probability

i

POMDP Belief State Space Search : Coastal Observation

POMDP Belief State Space Search(Coastal Observation)

Page 66: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

INFO

TotalBeliefStateReward

Policy

.

State Probability

3 Step Horizon

i

POMDP Belief State Space Search : Coastal Observation

POMDP Belief State Space Search (Coastal Observation)

Page 67: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

INFO

TotalBeliefStateReward

Policy

< Up, Up, Right, Up, Right, Up, Right, Up, Right >

State Probability

i

POMDP Policy Execution

POMDP Policy (Coastal Observation)

Page 68: Solving POMDPs through Macro Decomposition

UAV: +

Reward (#)

INFO

TotalBeliefStateReward

Policy

< Up, Up, Right, Up, Right, Up, Right, Up, Right >

.

State Probability

i

POMDP Policy Execution

POMDP Policy (Coastal Observation)