Solving POMDPs through Macro Decomposition

Larry Bush

Tony Jimenez

Brian Bairstow

Outline

• Introduction to POMDPs– Brian Bairstow

• Demonstration of POMDPs– Larry Bush

• Approximating POMDPs with Macro Actions– Tony Jimenez

Introduction to POMDPs

• Introduction to POMDPs– Markov Decision Processes (MDPs)– Value Iteration– Partially Observable Markov Decision Processes

(POMDPs)– Overview of Techniques

• Demonstration of POMDPs• Approximating POMDPs with Macro Actions

Navigation of a Building

• Robot– Knows building map– Wants to reach goal– Uncertainty in actions– What is the best way to

get there?

Markov Decision Processes

• Model– States, S

– Actions, A

– Transition Probabilities, p(s,a)

– Rewards, r(s,a)

• Process– Observe state st in S

– Choose action at in A

– Receive reward rt = r(st,at)

– State becomes st+1 according to probabilities p(s,a)

• Goal– Create a policy for choosing actions that maximizes the

lifetime reward• Discount Factor

Value = r0 + r1 + 2r2 + …

MDP Model Example

• States A, B, C• Actions 1, 2• Transition probabilities p and rewards r in diagram• C is terminal state

p = 0.3

p = 0.7

p = 0.4

p = 0.6

p = 0.5

p = 0.7p = 0.3

r = 1r = 1

r = 0.2

Decision Tree Representation of MDP

RewardExpected Reward

Max Q Value

Decision Tree Representation of MDP

Max Q ValueTotal Future Reward

Max Q Value 1

0.20.52

0.20.6

Expected Reward

Reward

Value Iteration

• Finite Horizon, 1 stepQ1(s,a) = r(s,a)

• Finite Horizon, n stepsQn(s,a) = r(s,a) + [p(s’|s,a) maxa’Qn-1(s’,a’)]

• Infinite HorizonQ(s,a) = r(s,a) + [p(s’|s,a) maxa’Q(s’,a’)]

• Policy(s) = argmaxaQ(s,a)

Q Reinforcement Learning

• Q can be calculated only if p(s,a) and r(s,a) are known

• Otherwise Q can be trained:Qt+1(s,a) = (1-)Qt(s,a) + [R+maxa’Qt(s’,a’)]

• Perform trial runs; get data from observations• Keep running combinations of s and a until

convergence

Partially Observable MDP’s

• Can no longer directly observe the state of the system• Instead at each step make an observation O, and know the

probability of each observation for each state p(O|S)• Also have initial state distribution p(s0)

p(O|S)

p(s,a) Hidden

Observable

t1 t2 t3

State Estimation

• States and Observations not independent of past observations

• Maintain a distribution of state probabilities called a belief bbt = p(st|ot,at,ot-1,at-1,…,o0,a0,s0)

StateEstimator

Policyb a

Value Iteration with POMDPs

b(s2)0 1

Q(b,a)Q(a1) = Q(s1,a1)(1-b(s2)) + Q(s2,a1)b(s2)

• Finite horizon value function is piecewise linear and convex– Depending on the value of b(s2), a different action is optimal

• Here is a one-step finite horizon problem– Need to consider effect on future belief states

Decision Tree Representation of POMDP

Approximation: Point-Based

• Concentrate on specific points in continuous belief space

• How should beliefs be chosen?• Problems with high dimensionality

Approximation: Monte Carlo

• Estimate values of actions for samples of belief space• Interpolate for arbitrary values

• Computationally difficult to learn policy

Approximation: Grid-Based

• Like point-based approximation, but with regular spacing• Can use variable resolution

• Size increases with dimensionality and resolution

b2b1 b3

b13b11

Heuristic: Maximum-Likelihood

• Assume that the state is the most likely state, and perform corresponding MDP.

• Only works if this is a good assumption!

b = <0.08, 0.02, 0.1, 0.8> POMDP

b = < 0 , 0 , 0 , 1 > MDP

Heuristic: Information Gathering

• MDP is easy to solve

• Valuable to learn about the state

b = < 1 , 0 , 0 , 0 > b = < 0.1, 0.2, 0.4, 0.3 >

Heuristic: Hierarchical Method

• Split into top level POMDP and low level POMDPs• Smaller state space reduces amount of computation

Belief Space

POMDP of

POMDPs

Heuristic: Policy Search

• Reinforcement learning on policy instead of value

• Value and Policy Search– Combine value and policy reinforcement

learning into a single update rule

Outline

• Introduction to POMDPs

• Demonstration of POMDPs– Problem : UAV Navigation– MDP Visualizations– POMDP Visualizations

• Approximating POMDPs with Macro Actions

Demonstrations*

• MATLAB• MDP

– Map to Problem Domain– Value Iteration– Policy Execution– Visualization**

• POMDP – Map to Problem Domain– Belief Space Search– Policy Execution– Visualization**

* All coding is mine except for some basic MDP value iteration functionality.

** Visualizations run in line with code.

UAV Land at Airport

Problem : Introduction

Landing Runway

Take-off Runway+

UAV: +

Outline• Introduction to POMDPs

• Demonstration of POMDPs– Problem : UAV Navigation– MDP Visualizations

• Value Iteration

• Policy Execution

– POMDP Visualizations

MDP Utility Overlay

UAV: +

Reward (#)

State UtilityExclusion

Transition Model

Cost Probability

.8.1 .1

Problem : Utility Overlay

MDP Value Iteration : Until Convergence

UAV: +

Reward (#)

State Utility

Policy

Transition Model

Cost Probability

.8.1 .1

MDP Value Iteration

MDP Value Iteration : Until Convergence

UAV: +

Reward (#)

State Utility

Policy

Transition Model

Cost Probability

.8.1 .1

MDP Value Iteration

UAV: +

Reward (#)

State Utility

Policy

MDP Policy Execution

Transition Model

Cost Probability

.8.1 .1

UAV: +

Reward (#)

State Utility

Policy

Transition Model

Cost Probability

.8.1 .1

UAV: +

Reward (#)

State Utility

Policy

Transition Model

Cost Probability

.8.1 .1

UAV: +

Reward (#)

State Utility

Policy

Transition Model

Cost Probability

.8.1 .1

Outline• Introduction to POMDPs• Demonstration of POMDPs

– Problem : UAV Navigation

– MDP Visualizations

– POMDP Visualizations• Belief State Space Search

• Policy Comparison– 3 Step Horizon

– 10 Step Horizon

– AIMA (Russell & Norvig)

– Fuel Observation

UAV: +

Reward (#)

State Probability

TotalBeliefStateReward

Policy

3 Step Horizon

POMDP Belief State Space Search : No Observations

POMDP Belief State Space Search

State Probability

Policy

3 Step Horizon

UAV: +

Reward (#)

POMDP Belief State Space Search : No Observations

State Probability

3 Step Horizon

Policy

POMDP Belief State Space Search

UAV: +

Reward (#)

Policy

< Right, Right, Right >

State Probability

POMDP Policy Execution : No Observations

POMDP Policy (3 Iterations)

UAV: +

Reward (#)

Policy

< Right, Right, Right >

State Probability

UAV: +

Reward (#)

Policy

< Left, Up, Up, Right, Up, Up, Right, Right, Right, Right >

State Probability

UAV: +

Reward (#)

Policy

< Left, Up, Up, Right, Up, Up, Right, Right, Right, Right >

State Probability

UAV: +

Reward (#)

Policy

< Left, Up, Up, Right, Up, Up, Right, Up, Up, Right >

State Probability

POMDP Policy (AIMA)

UAV: +

Reward (#)

Policy

< Left, Up, Up, Right, Up, Up, Right, Up, Up, Right >

State Probability

POMDP Policy (AIMA)

UAV: +

Reward (#)

Policy

< Up, Up, Right, Right, Right, Up, Right, Right, Right, Right >

State Probability

POMDP Policy Execution : Fuel Observation

POMDP Policy (Fuel Observation)

UAV: +

Reward (#)

Policy

< Up, Up, Right, Right, Right, Up, Right, Right, Right, Right >

State Probability

POMDP Policy Execution : Fuel Observation

POMDP Policy (Fuel Observation)

Demonstration Conclusion

• Difficult to Solve

• Efficient Methods

Outline

• Introduction to POMDPs

• Demonstration of POMDPs

• Approximating POMDPs with Macro Actions– Belief Compression using Grid Approximation– Macro-Actions in the Semi-MDP’s framework– Reinforcement Learning with a model

Approximation Methods

• Point-Based Approximations

• Monte Carlo Method

• Grid-Based Approximations– Dynamic Grid-Resolution

Policy Heuristics

• Maximum-Likelihood

• Information Gathering

• Hierarchical Methods– Macro-Actions

• Policy Search

Grid Approximation

S2(0,1,0)

S3(0,0,1)

S1(1,0,0)

Resolution 1

S2(0,1,0)

S3(0,0,1)

S1(1,0,0)

Resolution 2

S2(0,1,0)

S3(0,0,1)

S1(1,0,0)

Resolution 4

Dynamic Grid Allocation

• Allocate grid points from a uniformly spaced grid dynamically– Simulate trajectories of the agent through the belief

• When experiencing belief state:– Find grid point closest to belief state

– Add it to set of grid points explicitly considered

• Derived set of grid points is small and adapted to parts of belief space typically inhabited by agent

Macro-Actions

• Temporally Extended Actions– For example:

• Move-Down-Corridor• Go-to-Chicago

– Versus primitive actions, such as:• Turn-Right• One-Step-Forward

• POMDP’s breakdown into a hierarchical structure of smaller POMDP’s

Macro-Actions

Semi-Markov Decision Process

• Defined as a five-tuple (S,A,P,R,F)– S : finite set of states– A : set of actions– P : state and action transition probability functions– R : reward function– F : probability of transition time function for each

state-action pair• F(s’, |s,a) : specifies the joint probability that a transition

from state s to state s’ occurs after steps with action a

• Q-learning rule for SMDP’sQt+1(s,a) = (1-)Qt(s,a) + [R+maxa’Qt(s’,a’)]

Problem Reformulation

• Monte Carlo Updates of the Q-values for the SMDP are used to learn the policy

POMDPGrid

Approx.MDP

MacroActions

Graphical Overview of the Algorithm

b’’

Reinforcement LearningWith a Model

1. Assume a current true state• Which is physical true location of agent

• b(s) 0

2. Discretize the current belief state b gi

• gi is the closest grid-point in a regular discretization of the belief space.

• If gi is missing add it to the table of belief states to explicitly consider

• Interpolate its initial value from coarser resolutions.

3. Choose the best macro-action from the current Q values

• Random action % of the time• Interpolate over the Q values of the vertices

of the sub-simplex that contains b

4. Estimate E [R(gi,) + t V (b’)] by sampling:

• Sample a state s from the current grid-belief state gi

– Choose primitive action a according to macro-action – Sample the next state s’ and observation z from models

– Store the reward and update the belief state

– Set t = t+1, b = b’, s = s’ and repeat with next primitive action until terminates

• Compute the value of the resulting belief state b’– Interpolate over vertices in resulting belief sub-simplex

• Repeat multiple times and average the estimate

5. Update the state action value:Q(gi,) = (1-)q(gi,) + [R + t v(b’)]

6. Update the state value:V(gi) = argmaxQ(gi,)

7. Execute the macro-action starting from the belief state until termination

• Generate observations during execution• Set b = b’ and s = s’ and go to step 2

8. Repeat this learning epoch multiple times starting from the same b

Tested Results

• Benefits of using dynamic grid-resolution– For 1068 world states and resolution of 4,

number of grid points:• 5.410 with regular discretization

• ~3500 with dynamic grid-resolution

• ~1000 with dynamic grid-resolution and macro-actions

Tested Results

• Benefits of using macro-actions– Requires fewer training steps per episode– Performs better with increased grid-resolution in

the dynamic grid model– Better quality policy overall

• Benefits of using reward-shaping– Successful completion of test with macro-

actions 90~95%– 0~10% without reward-shaping

What Is Next?

• Able to solve POMDP’s with 1000+ states

• How do we generate a set of “good” macro-actions?– Current research is focused on the dynamic

generation of macro-actions

An Extra Visualization

• This visualization (and others) were cut from the talk, for brevity. However, we thought that it was worthwhile to include this one in the tutorial slides.

UAV: +

Reward (#)

Policy

3 Step Horizon

State Probability

POMDP Belief State Space Search : Coastal Observation

POMDP Belief State Space Search(Coastal Observation)

UAV: +

Reward (#)

Policy

State Probability

3 Step Horizon

POMDP Belief State Space Search : Coastal Observation

POMDP Belief State Space Search (Coastal Observation)

UAV: +

Reward (#)

Policy

< Up, Up, Right, Up, Right, Up, Right, Up, Right >

State Probability

POMDP Policy Execution

POMDP Policy (Coastal Observation)

UAV: +

Reward (#)

Policy

< Up, Up, Right, Up, Right, Up, Right, Up, Right >

State Probability

POMDP Policy Execution

POMDP Policy (Coastal Observation)

Solving POMDPs through Macro Decomposition

Documents

Transcript of Solving POMDPs through Macro Decomposition

Achieving Goals in Decentralized POMDPs

TIMES-Macro: Decomposition into Hard-Linked LP … Decomposition into Hard-Linked LP and NLP Problems ... 2.3 The Standard TIMES LP Formulation ... the investment and fixed costs of

A Concise Introduction to Decentralized POMDPs

Modeling and Planning with Macro-Actions in …Modeling and Planning with Macro-Actions in Decentralized POMDPs grid problem and a domain based on robots navigating among movable obstacles

Online Planning Algorithms for POMDPs · 2009-02-08 · Online Planning Algorithms for POMDPs The optimal policy π∗ deﬁned in Equation 5 represents the action-selection strategy

POMDPs: Partially Observable Markov Decision Processes Advanced AI Wolfram Burgard.

KINETIC/FLUID MICRO-MACRO NUMERICAL SCHEMES ...people.rennes.inria.fr/Nicolas.Crouseilles/ccl-krm.pdfmicro-macro decomposition as in [3] where asymptotic preserving schemes have been

Partially Observable Markov Decision Processes (POMDPs)

Online Planning Algorithms for POMDPs

Adaptive Sampling using POMDPs with Domain-Speciﬁc ...

Solving POMDPs through Macro Decompositionweb.mit.edu/16.412j/www/html/Advanced lectures/Slides... · 2005-04-21 · Decision Tree Representation of MDP A 1 2 A C 0.7 0.3 C 0.6 B

Representing hierarchical POMDPs as DBNs for multi-scale robot localization

Learning To Grasp Under Uncertainty Using POMDPs

1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.

The Low Beta Anomaly: A Decomposition into Micro and ... Files/FINAL...The Low Beta Anomaly: A Decomposition into Micro and Macro Effects Malcolm Baker* Brendan Bradley Ryan Taliaferro

Partially Observable Markov Decision Processes (POMDPs)robotics.usc.edu/~geoff/cs599/POMDP.pdf · Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential

CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Grasping POMDPs

Quantitative Strategy Academic Insights · 2013-02-22 · Quantitative Strategy Academic Insights Date ... A decomposition into micro and macro effects Generalized risk-based investing

On the use of the micro-macro decomposition to design ...