POMDPs

POMDPs

Slides based on Hansen et. Al.’s tutorial + R&N 3rd Ed Sec 17.4

Planning using Partially Observable Markov Decision Processes: A Tutorial

Presenters:Eric Hansen, Mississippi State University

Daniel Bernstein, University of Massachusetts/AmherstZhengzhu Feng, University of Massachusetts/Amherst

Rong Zhou, Mississippi State University

Introduction and foundations

Definition of POMDPGoals, rewards and optimality criteriaExamples and applicationsComputational complexityBelief states and Bayesian conditioning

Planning under partial observability

Environment

Action

Imperfect observation

Goal

Environment

Two Approaches to Planning under Partial Observability

Nondeterministic planning Uncertainty is represented by set of possible states No possibility is considered more likely than any other

Probabilistic (decision-theoretic) planning Uncertainty is represented by probability distribution

over possible states In this tutorial we consider the second, more

general approach

Markov models

Prediction

Planning

Fully observable

Markov chain

MDP

(Markov decision process)

Partially observable

Hidden

Markov model

POMDP

(Partially observable Markov decision process)

Definition of POMDP

s0 S1 S2

z0

a0

r0

z1

a1

hidden states:

r1

z2

a2

r2

observations:

actions:

rewards:

Goals, rewards and optimality criteria

Rewards are additive and time-separable, and objective is to maximize expected total reward

Traditional planning goals can be encoded in reward function Example: achieving a state satisfying property P at

minimal cost is encoded by making any state satisfying P a zero-reward absorbing state, and assigning all other states negative reward.

POMDP allows partial satisfaction of goals and tradeoffs among competing goals

Planning horizon can be finite, infinite or indefinite

Machine Maintenance

X

Canonical application of POMDPs in Operations Research

Robot Navigation

Actions: N, S, E, W, Stop+1

–1

Start

0.80.1

0.1

Canonical application of POMDPs in AI Toy example from Russell & Norvig’s AI textbook

Observations: sense surrounding walls

Many other applications Helicopter control [Bagnell & Schneider 2001]

Dialogue management [Roy, Pineau & Thrun 2000]

Preference elicitation [Boutilier 2002]

Optimal search and sensor scheduling [Krishnamurthy & Singh 2000]

Medical diagnosis and treatment [Hauskrecht & Fraser 2000]

Packet scheduling in computer networks [Chang et al. 2000; Bent & Van Hentenryck 2004]

Computational complexity

Finite-horizon PSPACE-hard [Papadimitriou & Tsitsiklis 1987] NP-complete if unobservable

Infinite-horizon Undecidable [Madani, Hanks & Condon 1999] NP-hard for -approximation [Lusena, Goldsmith &

Mundhenk 2001] NP-hard for memoryless or bounded-memory

control problem [Littman 1994; Meuleau et al. 1999]

POMDP <S, A, T, R, Ω, O> tuple

S, A, T, R of MDP Ω – finite set of observations O:SxA-> Π(Ω)

Belief state - information state – b, probability distribution over S - b(s1)

POMDP

Goal is to maximize expected long-term reward from the initial state distribution

State is not directly observed

worlda

o

Two sources of POMDP complexity Curse of dimensionality

size of state space shared by other planning problems

Curse of memory size of value function (number of vectors) or equivalently, size of controller (memory) unique to POMDPs

ZnAS |||||| 12 || n

Complexity of each iteration of DP:

dimensionality memory

Two representations of policy Policy maps history to action Since history grows exponentially

with horizon, it needs to be summarized, especially in infinite-horizon case

Two ways to summarize history belief state

finite-state automaton – partitions history into finite number of “states”

Belief simplex

S1 S0

S2

0S0

S1

0

3 states2 states

(1, 0)

(0, 1)(0, 0, 1)

(0, 1, 0) (1, 0, 0)

Belief state has Markov property

The process of maintaining the belief state is Markovian

For any belief state, the successor belief state depends only on the action and observation

z1

z2

z2

z1

a2

a1

P(s0) = 0 P(s0) = 1

Belief-state MDP State space: the belief simplex Actions: same as before State transition function:

P(b’|b,a) = e E P(b’|b,a,e)P(e|b,a) Reward function:

r(b,a) =sS b(s)r(s,a) Bellman optimality equation:

'

)'(),'(),(max)(b

Aa bVabbPabrbV

Should be Integration…

Belief-state controller

P(b|b,a,e)

CurrentBeliefState

(Register)

Policy

Obs. e b b a Action

Update belief state after action and observation Policy maps belief state to action Policy is found by solving the belief-state MDP

“State Estimation”

POMDP as MDP in Belief Space

Dynamic Programming for POMDPs We’ll start with some important concepts:

a1

s2s1

policy tree linear value functionbelief state

s1 0.25

s2 0.40

s3 0.35

a2 a3

a3 a2 a1 a1

o1

o1 o2 o1 o2

o2

Dynamic Programming for POMDPs

a1 a2 s1 s2


s1 s2

a1

a1 a1

o1 o2

a1

a1 a2

o1 o2

a1

a2 a1

o1 o2

a1

a2 a2

o1 o2

a2

a1 a1

o1 o2

a2

a1 a2

o1 o2

a2

a2 a1

o1 o2

a2

a2 a2

o1 o2


s1 s2

a1

a1 a1

o1 o2

a1

a2 a1

o1 o2

a2

a1 a1

o1 o2

a2

a2 a2

o1 o2


s1 s2

POMDP Value Iteration: Basic Idea[Finitie Horizon Case]

First Problem Solved Key insight: value function

piecewise linear & convex (PWLC) Convexity makes intuitive sense

Middle of belief space – high entropy, can’t select actions appropriately, less long-term reward

Near corners of simplex – low entropy, take actions more likely to be appropriate for current world state, gain more reward

Each line (hyperplane) represented with vector Coefficients of line (hyperplane) e.g. V(b) = c1 x b(s1) + c2 x (1-b(s1))

To find value function at b, find vector with largest dot pdt with b

Two states: 0 and 1R(0)=0 ; R(1) = 1[stay]0.9 stay; 0.1 go [go] 0.9 go; 0.1 staySensor reports correct state with 0.6 probDiscount facto=1

POMDP Value Iteration: Phase 1: One action plans

POMDP Value Iteration: Phase 2: Two action (conditional) plans

stay

stay stay

0 1

Point-based Value Iteration: Approximating with Exemplar Belief States

Solving infinite-horizon POMDPs Value iteration: iteration of dynamic

programming operator computes value function that is arbitrarily close to optimal

Optimal value function is not necessarily piecewise linear, since optimal control may require infinite memory

But in many cases, as Sondik (1978) and Kaelbling et al (1998) noticed, value iteration converges to a finite set of vectors. In these cases, an optimal policy is equivalent to a finite-state controller.

Policy evaluation

q1 q2o1o2 q2

s1q1

s2q1

s1q2

s2q2

o2o2

o2

o2o1

o1

o1

o1

As in the fully observable case, policy evaluation involves solving a system of linear equations. There is one unknown (and one equation) for each pair of system state and controller node

Policy improvement0a0

1a1

3a1

2a0

4a0

z0

z0,z1

z1

0a0

1a1

z0

z1

z0,z1

z0,z1

z0

z1

z0,z1

0a0

4a0

z0,z1

z0,z1

3a1

z0

z1

0 1

V(b)

0 1

V(b)

0 1

V(b)0,2

4

3

11

0 0 34

Per-Iteration Complexity of POMDP value iteration..

Number of a vectors needed at tth iteration

Time for computing each a vector

Approximating POMDP value function with bounds It is possible to get approximate value

functions for POMDP in two ways Over constrain it to be a NOMDP: You get

Blind Value function which ignores the observation

A “conformant” policy For infinite horizon, it will be same action always! (only |A| policies)

Relax it to be a FOMDP: You assume that the state is fully observable.

A “state-based” policy

Under-estimates value(over-estimates cost)

Over-estimates value(under-estimates cost)

Per iteration

Upper bounds for leaf nodes can come from FOMDP VI and lower bounds from NOMDP VI

Observations are written as o or z

Comparing POMDPs with Non-deterministic conditional planningPOMDP Non-Deterministic Case

RTDP-Bel doesn’t do look ahead, and also stores the current estimate of value function (see update)

---SLIDES BEYOND THIS NOT COVERED--

Two Problems How to represent value function over continuous belief

space? How to update value function Vt with Vt-1?

POMDP -> MDPS => B, set of belief statesA => sameT => τ(b,a,b’)R => ρ(b, a)

Running Example POMDP with

Two states (s1 and s2) Two actions (a1 and a2) Three observations (z1, z2, z3)

1D belief space for a 2 state POMDPProbability that state is s1

Second Problem Can’t iterate over all belief states (infinite) for value-

iteration but… Given vectors representing Vt-1, generate vectors

representing Vt

Horizon 1 No future Value function consists only

of immediate reward e.g.

R(s1, a1) = 0, R(s2, a1) = 1.5, R(s1, a2) = 1, R(s2, a2) = 0 b = <0.25, 0.75>

Value of doing a1 = 1 x b(s1) + 0 x b(s2) = 1 x 0.25 + 0 x 0.75

Value of doing a2 = 0 x b(s1) + 1.5 x b(s2) = 0 x 0.25 + 1.5 x 0.75

Second Problem Break problem down into 3 steps

-Compute value of belief state given action and observation

-Compute value of belief state given action -Compute value of belief state

Horizon 2 – Given action & obs If in belief state b,what is the best value of

doing action a1 and seeing z1? Best value = best value of immediate action +

best value of next action Best value of immediate action = horizon 1

value function

Horizon 2 – Given action & obs Assume best immediate action is a1 and obs is z1 What’s the best action for b’ that results from initial b

when perform a1 and observe z1? Not feasible – do this for all belief states (infinite)

Horizon 2 – Given action & obs Construct function over entire (initial) belief space

from horizon 1 value function with belief transformation built in

Horizon 2 – Given action & obs S(a1, z1) corresponds to paper’s

S() built in: - horizon 1 value function - belief transformation - “Weight” of seeing z after performing a - Discount factor - Immediate Reward

S() PWLC

Horizon 2 – Given action What is the horizon 2 value of a belief state given

immediate action is a1? Horizon 2, do action a1 Horizon 1, do action…?

Horizon 2 – Given action What’s the best strategy at b? How to compute line (vector) representing best

strategy at b? (easy) How many strategies are there in figure? What’s the max number of strategies (after taking

immediate action a1)?

Horizon 2 – Given action

How can we represent the 4 regions (strategies) as a value function?

Note: each region is a strategy

Horizon 2 – Given action Sum up vectors representing region Sum of vectors = vectors (add lines, get lines) Correspond to paper’s transformation

Horizon 2 – Given action What does each region represent? Why is this step hard (alluded to in paper)?

Horizon 2

a1

a2

U

Horizon 2

This tells youhow to act! =>




Use horizon 2 value function to update horizon 3’s ...

The Hard Step Easy to visually inspect to obtain different regions But in higher dimensional space, with many actions and

observations….hard problem

Naïve way - Enumerate

How does Incremental Pruning do it?

Incremental Pruning How does IP improve

naïve method? Will IP ever do worse

than naïve method?

Combinations

Purge/Filter

Incremental Pruning What’s other novel idea(s) in IP?

RR: Come up with smaller set D as argument to Dominate()

RR has more linear pgms but less contraints in the worse case. Empirically ↓ constraints saves more time than ↑ linear

programs require

Incremental Pruning What’s other novel idea(s) in IP?

RR: Come up with smaller set D as argument to Dominate()

Why are the terms after U needed?

Identifying Witness Witness Thm:

-Let Ua be a set of vectors representing value function

-Let u be in Ua (e.g. u = αz1,a2 + αz2,a1 + αz3,a1 ) -If there is a vector v which differs from u in one

observation (e.g. v = αz1,a1 + αz2,a1 + αz3,a1) and there is a b such that b.v > b.u, -then Ua is not equal to the true value function

Witness Algm Randomly choose a belief state b Compute vector representing best value at b (easy) Add vector to agenda While agenda is not empty

• Get vector Vtop from top of agenda• b’ = Dominate(Vtop, Ua)• If b’ is not null (there is a witness),

compute vector u for best value at b’ and add it to Ua compute all vectors v’s that differ from u at one observation

and add them to agenda

b’ b’’ b’ b’’b

Linear Support If value function is incorrect, biggest diff is at edges

(convexity)

Linear Support

Number of of policy trees

|A||Z|T-1 at horizon T

Example for |A| = 4 and |Z| = 2 Horizon # of policy trees

0 11 42 643 16,3844 1,073,741,824

Policy graph

2 states

z0z0

z1z1

z0, z1

a0

a1

a2

a0

a1

a2

z0

z0,z1z1

z0

z1

V(b)

Policy iteration for POMDPs Sondik’s (1978) algorithm represents policy as

a mapping from belief states to actions only works under special assumptions very difficult to implement never used

Hansen’s (1998) algorithm represents policy as a finite-state controller fully general easy to implement faster than value iteration

Properties of policy iteration

Theoretical Monotonically improves finite-state controller Converges to -optimal finite-state controller

after finite number of iterations Empirical

Runs from 10 to over 100 times faster than value iteration

Scaling up

State abstraction and factored representationBelief compressionForward search and sampling approachesHierarchical task decomposition

State abstraction and factored representation of POMDP

DP algorithms are typically state-based Most AI representations are “feature-based” |S| is typically exponential in the number of

features (or variables) – the “curse of dimensionality”

State-based representations for problems with more than a few variables are impractical

Factored representations exploit regularities in transition and observation probabilities, and reward

Example: Part-painting problem[Draper, Hanks, Weld 1994]

Boolean state variablesflawed (FL), blemished (BL), painted (PA), processed (PR), notified (NO)

actionsInspect, Paint, Ship, Reject, Notify

cost functionCost of 1 for each actionCost of 1 for shipping unflawed part that is not paintedCost of 10 for shipping flawed part or rejecting unflawed part

initial belief state Pr(FL) = 0.3, Pr(BL|FL) = 1.0, Pr(BL|FL) = 0.0, Pr(PA) = 0.0, Pr(PR) = 0.0, Pr(NO) = 0.0

Factored representation of MDP[Boutilier et al. 1995; Hoey, St. Aubin, Hu, & Boutilier 1999]

Dynamic Bayesian network captures variable independence Algebraic decision diagram captures value independence

FL

SL-FL

NO

PA

SH

RE

FL’

SL-FL’

NO’

PA’

SH’

RE’

FL FL’T 1.0F 0.0

PA SH RE NO PA’ T T/F T/F T/F 1.0 F F F F 0.95 F T T/F T/F 0.0 F T/F T T/F 0.0 F T/F T/F T 0.0

FL’

FL

1.0 0.0 PA’

PA

SH

RE

NO

0.95 0.0 1.0

Dynamic Belief Network Decision DiagramsProbability Tables

Decision diagramsX

Y

Z

TRUE FALSE

X

Y

Z

5.8 3.6

Y

Z

18.6 9.5

Binary decision Diagram (BDD)

Algebraic decisiondiagram (ADD)

Addition (subtraction), multiplication (division), minimum (maximum), marginalization, expected value

Complexity of operators depends on size of decision diagrams, not number of states!

Operations on decision diagrams

= +

X

Y

Z

11.0 12.0

Y

Z

22.0 23.0

X

Z

10.0 20.0

Y

30.0

X

Y

Z

1.0 2.0 3.0 33.0

Symbolic dynamic programming for factored POMDPs [Hansen & Feng 2000]

Factored representation of value function: replace |S|-vectors with ADDs that only make relevant state distinctions

Two steps of DP algorithm Generate new ADDs for value function Prune dominated ADDs

State abstraction is based on aggregating states with the same value

ait

obs1

obs2

transition probabilities

observation probabilities

obs3

action reward

akt+1

Generation step: Symbolic implementation

Pruning step: Symbolic implementation

pruning is the most computationally expensive part of algorithm

must solve a linear program for each (potential) ADD in value function

because state abstraction reduces the dimensionality of linear programs, it significantly improves efficiency

Improved performance

Speedup factor Test problem

Degree of abstraction Generate Prune

1 0.01 42 26 2 0.03 17 11 3 0.10 0.4 3 4 0.12 0.8 0.6 5 0.44 -3.4 0.4 6 0.65 -0.7 0.1 7 1.00 -6.5 -0.1

Number abstract statesNumber primitive states Degree of abstraction =

Optimal plan (controller) for part-painting problem

Inspect

Inspect Reject

Notify

Paint Ship

~OKOK

OK

~OK

PR NO PR NOPR

FLFL

FL PA FLFL PA

FL FL

FL BLFL BL FL BLFL BL

Approximate state aggregation

= 0.4

Simplify each ADD in value function by mergingleaves that differ in value by less than .

Approximate pruning Prune vectors from value function that add

less than to value of any belief statea1

a2

a3

a4

(0,1) (1,0)

Error bound

These two methods of approximation share the same error bound

“Weak convergence,” i.e., convergence to within 2/(1-) of optimal (where is discount factor)

After “weak convergence,” decreasing allows further improvement

Starting with relatively high and gradually decreasing it accelerates convergence

Strategy: ignore differences of value less than some threshold

Complementary methods Approximate state aggregation Approximate pruning

…address two sources of complexity size of state space size of value function (memory)

Approximate dynamic programming

Belief compression Reduce dimensionality of belief space by

approximating the belief state Examples of approximate belief states

tuple of mostly-likely state plus entropy of belief state [Roy & Thrun 1999]

belief features learned by exponential family Principal Components Analysis [Roy & Gordon 2003]

standard POMDP algorithms can be applied in the lower-dimensional belief space, e.g., grid-based approximation

Forward search

a0 a1

z0

a0 a0 a0 a0

z0z1 z1

a1 a1 a1 a1

z0 z0z1 z0 z0 z0 z0 z0 z0z1 z1 z1 z1 z1 z1 z1

Sparse sampling Forward search can be combined with

Monte Carlo sampling of possible observations and action outcomes [Kearns et al 2000; Ng & Jordan 2000]

Remarkably, complexity is independent of size of state space!!!

On-line planner selects -optimal action for current belief state

State-space decomposition

For some POMDPs, each action/observation pair identifies a specific region of the state space

Motivating Example Continued A “deterministic observation” reveals that world is

in one of a small number of possible states

Same for “hybrid POMDPs”, which are POMDPs with some fully observable and some partially observable state variables

Region-based dynamic programming

Tetrahedron and surfaces

Hierarchical task decomposition We have considered abstraction in state space Now we consider abstraction in action space For fully observable MDPs:

Options [Sutton, Precup & Singh 1999] HAMs [Parr & Russell 1997] Region-based decomposition [Hauskrecht et al 1998] MAXQ [Dietterich 2000]

Hierarchical approach may cause sub-optimality, but limited forms of optimality can be guaranteed Hierarchical optimality (Parr and Russell) Recursive optimality (Dietterich)

Hierarchical approach to POMDPs Theocharous & Mahadevan (2002)

based on hierarchical hidden Markov model approximation ~1000 state robot hallway-navigation problem

Pineau et al (2003) based on Dietterich’s MAXQ decomposition approximation ~1000 state robot navigation and dialogue

Hansen & Zhou (2003) also based on Dietterich’s MAXQ decomposition convergence guarantees and epsilon-optimality

Macro action as finite-state controller

Allows exact modeling of macro’s effects macro state transition probabilities macro rewards

West Stop

NorthEast

EastSouthgoal

goal

clear

wall

wall

clear

clear

wall

Taxi example [Dietterich 2000]

Task hierarchy [Dietterich 2000]

Taxi

Get Put

Navigate

Pickup Putdown

NorthSouthEastWest

Hierarchical finite-state controller

Get Put

Nav.Pickup Nav. Putdown

East North

North

South East

West StopStop

Stop Stop

Stop

MAXQ-hierarchical policy iterationCreate initial sub-controller for each sub-POMDP in hierarchy

Repeat until error bound is less than Identify subtask that contributes most to overall

error Use policy iteration to improve the

corresponding controller For each node of controller, create abstract

action (for parent task) and compute its model Propagate error up through hierarchy

Modular structure of controller

Complexity reduction Per-iteration complexity of policy iteration

|A| |Q||Z|, where A is set of actions Z is set of observations

Q is set of controller nodes Per-iteration complexity of hierarchical PI

|A| i |Qi||Z| , where |Q| = i|Qi| With hierarchical decomposition,

complexity is sum of the complexity of the subproblems, instead of product

Scalability

MAXQ-hierarchical policy iteration can solve any POMDP, if it can decompose it into sub-POMDPs that can be solved by policy iteration Although each sub-controller is limited in

size, the hierarchical controller is not limited in size

Although the (abstract) state space of each subtask is limited in size, the total state space is not limited in size

Multi-Agent Planning with POMDPs

Partially observable stochastic gamesGeneralized dynamic programming

Multi-Agent Planning with POMDPs

Many planning problems involve multiple agents acting in a partially observable environment

The POMDP framework can be extended to address this

world

a1

z1, r1

z2, r2

a2

1

2

Partially observable stochastic game (POSG) A POSG is S, A1, A2, Z1, Z2,, P, r1, r2, where

S is a finite state set, with initial state s0

A1, A2 are finite action sets Z1, Z2 are finite observation sets P(s’|s, a1, a2) is state transition function P(z1, z2| s, a1, a2) is observation function r1(s, a1, a2) and r2(s, a1, a2) are reward functions

Special cases: All agents share the same reward function Zero-sum games

Plans and policies

A local policy is a mapping i : Zi* Ai

A joint policy is a pair 1, 2 Each agent wants to maximize its own

long-term expected reward Although execution is distributed, planning

can be centralized

Beliefs in POSGs

With a single agent, a belief is a distribution over states

How does this generalize to multiple agents?

Could have beliefs over beliefs over beliefs, but there is no algorithm for working with these

Example

States: grid cell pairs Actions: ,,,

Transitions: noisy

Goal: pick up balls

Observations: red lines

Another Example

States: who has a message to send? Actions: send or don’t send Reward: +1 for successful broadcast

0 if collision or channel not used Observations: was there a collision? (noisy)

msg msg

Strategy Elimination in POSGs Could simply convert to normal form But the number of strategies is doubly

exponential in the horizon length

R111,

R112

… R1n1,

R1n2

… … …

Rm11,

Rm12

… Rmn1,

Rmn2

…

…

Generalized dynamic programming Initialize 1-step policy trees to be actions Repeat:

Evaluate all pairs of t-step trees from current sets

Iteratively prune dominated policy trees Form exhaustive sets of t+1-step trees from

remaining t-step trees

What Generalized DP Does

The algorithm performs iterated elimination of dominated strategies in the normal form game without first writing it down

For cooperative POSGs, the final sets contain the optimal joint policy

Some Implementation Issues As before, pruning can be done using linear

programming Algorithm keeps value function and policy

trees in memory (unlike POMDP case) Currently no way to prune in an incremental

fashion

A Better Way to Do Elimination We use dynamic programming to eliminate

dominated strategies without first converting to normal form

Pruning a subtree eliminates the set of trees containing it

a1

a1 a2

a2 a2 a3 a3

o1

o1 o2 o1 o2

o2a3

a2 a1

o1 o2

a2

a2 a3

a3 a2 a2 a1

o1

o1 o2 o1 o2

o2

prune

eliminate

Dynamic Programming Build policy tree sets simultaneously Prune using a generalized belief space

s1

s2

agent 2 state space

agent 1 state space

a3

a1 a1

o1 o2

a2

a3 a1

o1 o2p1 p2

a2

a2 a2

o1 o2

a3

a1 a2

o1 o2q1 q2

Dynamic Programming

a1 a2 a1 a2

Dynamic Programming

a1

a1 a2

o1 o2

a2

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a1 a1

o1 o2

a2

a2 a1

o1 o2

a1

a2 a2

o1 o2

a2

a2 a2

o1 o2

a1

a1 a1

o1 o2

a1

a1 a2

o1 o2

a2

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a1 a1

o1 o2

a2

a2 a1

o1 o2

a1

a2 a2

o1 o2

a2

a2 a2

o1 o2

a1

a1 a1

o1 o2

Dynamic Programming

a1

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a2 a1

o1 o2

a1

a2 a2

o1 o2

a2

a2 a2

o1 o2

a1

a1 a1

o1 o2

a1

a1 a2

o1 o2

a2

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a1 a1

o1 o2

a2

a2 a1

o1 o2

a1

a2 a2

o1 o2

a2

a2 a2

o1 o2

a1

a1 a1

o1 o2

Dynamic Programming

a1

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a2 a1

o1 o2

a1

a2 a2

o1 o2

a2

a2 a2

o1 o2

a1

a1 a1

o1 o2

a1

a1 a2

o1 o2

a2

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a2 a1

o1 o2

a2

a2 a2

o1 o2

Dynamic Programming

a1

a2 a1

o1 o2

a2

a2 a1

o1 o2

a2

a2 a2

o1 o2

a1

a1 a1

o1 o2

a1

a1 a2

o1 o2

a2

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a2 a1

o1 o2

a2

a2 a2

o1 o2

Dynamic Programming

a1

a2 a1

o1 o2

a2

a2 a1

o1 o2

a2

a2 a2

o1 o2

a1

a1 a1

o1 o2

a1

a1 a2

o1 o2

a1

a2 a1

o1 o2

a2

a2 a1

o1 o2

a2

a2 a2

o1 o2

Dynamic Programming

Complexity of POSGs

The cooperative finite-horizon case is NEXP-hard, even with two agents whose observations completely determine the state [Bernstein et al. 2002]

Implications: The problem is provably intractable (because

P NEXP) It probably requires doubly exponential time

to solve in the worst case

POMDPs

Documents

Transcript of POMDPs