POMDPs
description
Transcript of POMDPs
POMDPs
Slides based on Hansen et. Al.’s tutorial + R&N 3rd Ed Sec 17.4
Planning using Partially Observable Markov Decision Processes: A Tutorial
Presenters:Eric Hansen, Mississippi State University
Daniel Bernstein, University of Massachusetts/AmherstZhengzhu Feng, University of Massachusetts/Amherst
Rong Zhou, Mississippi State University
Introduction and foundations
Definition of POMDPGoals, rewards and optimality criteriaExamples and applicationsComputational complexityBelief states and Bayesian conditioning
Planning under partial observability
Environment
Action
Imperfect observation
Goal
Environment
Two Approaches to Planning under Partial Observability
Nondeterministic planning Uncertainty is represented by set of possible states No possibility is considered more likely than any other
Probabilistic (decision-theoretic) planning Uncertainty is represented by probability distribution
over possible states In this tutorial we consider the second, more
general approach
Markov models
Prediction
Planning
Fully observable
Markov chain
MDP
(Markov decision process)
Partially observable
Hidden
Markov model
POMDP
(Partially observable Markov decision process)
Definition of POMDP
s0 S1 S2
z0
a0
r0
z1
a1
hidden states:
r1
z2
a2
r2
observations:
actions:
rewards:
Goals, rewards and optimality criteria
Rewards are additive and time-separable, and objective is to maximize expected total reward
Traditional planning goals can be encoded in reward function Example: achieving a state satisfying property P at
minimal cost is encoded by making any state satisfying P a zero-reward absorbing state, and assigning all other states negative reward.
POMDP allows partial satisfaction of goals and tradeoffs among competing goals
Planning horizon can be finite, infinite or indefinite
Machine Maintenance
X
Canonical application of POMDPs in Operations Research
Robot Navigation
Actions: N, S, E, W, Stop+1
–1
Start
0.80.1
0.1
Canonical application of POMDPs in AI Toy example from Russell & Norvig’s AI textbook
Observations: sense surrounding walls
Many other applications Helicopter control [Bagnell & Schneider 2001]
Dialogue management [Roy, Pineau & Thrun 2000]
Preference elicitation [Boutilier 2002]
Optimal search and sensor scheduling [Krishnamurthy & Singh 2000]
Medical diagnosis and treatment [Hauskrecht & Fraser 2000]
Packet scheduling in computer networks [Chang et al. 2000; Bent & Van Hentenryck 2004]
Computational complexity
Finite-horizon PSPACE-hard [Papadimitriou & Tsitsiklis 1987] NP-complete if unobservable
Infinite-horizon Undecidable [Madani, Hanks & Condon 1999] NP-hard for -approximation [Lusena, Goldsmith &
Mundhenk 2001] NP-hard for memoryless or bounded-memory
control problem [Littman 1994; Meuleau et al. 1999]
POMDP <S, A, T, R, Ω, O> tuple
S, A, T, R of MDP Ω – finite set of observations O:SxA-> Π(Ω)
Belief state - information state – b, probability distribution over S - b(s1)
POMDP
Goal is to maximize expected long-term reward from the initial state distribution
State is not directly observed
worlda
o
Two sources of POMDP complexity Curse of dimensionality
size of state space shared by other planning problems
Curse of memory size of value function (number of vectors) or equivalently, size of controller (memory) unique to POMDPs
ZnAS |||||| 12 || n
Complexity of each iteration of DP:
dimensionality memory
Two representations of policy Policy maps history to action Since history grows exponentially
with horizon, it needs to be summarized, especially in infinite-horizon case
Two ways to summarize history belief state
finite-state automaton – partitions history into finite number of “states”
Belief simplex
S1 S0
S2
0S0
S1
0
3 states2 states
(1, 0)
(0, 1)(0, 0, 1)
(0, 1, 0) (1, 0, 0)
Belief state has Markov property
The process of maintaining the belief state is Markovian
For any belief state, the successor belief state depends only on the action and observation
z1
z2
z2
z1
a2
a1
P(s0) = 0 P(s0) = 1
Belief-state MDP State space: the belief simplex Actions: same as before State transition function:
P(b’|b,a) = e E P(b’|b,a,e)P(e|b,a) Reward function:
r(b,a) =sS b(s)r(s,a) Bellman optimality equation:
'
)'(),'(),(max)(b
Aa bVabbPabrbV
Should be Integration…
Belief-state controller
P(b|b,a,e)
CurrentBeliefState
(Register)
Policy
Obs. e b b a Action
Update belief state after action and observation Policy maps belief state to action Policy is found by solving the belief-state MDP
“State Estimation”
POMDP as MDP in Belief Space
Dynamic Programming for POMDPs We’ll start with some important concepts:
a1
s2s1
policy tree linear value functionbelief state
s1 0.25
s2 0.40
s3 0.35
a2 a3
a3 a2 a1 a1
o1
o1 o2 o1 o2
o2
Dynamic Programming for POMDPs
a1 a2 s1 s2
Dynamic Programming for POMDPs
s1 s2
a1
a1 a1
o1 o2
a1
a1 a2
o1 o2
a1
a2 a1
o1 o2
a1
a2 a2
o1 o2
a2
a1 a1
o1 o2
a2
a1 a2
o1 o2
a2
a2 a1
o1 o2
a2
a2 a2
o1 o2
Dynamic Programming for POMDPs
s1 s2
a1
a1 a1
o1 o2
a1
a2 a1
o1 o2
a2
a1 a1
o1 o2
a2
a2 a2
o1 o2
Dynamic Programming for POMDPs
s1 s2
POMDP Value Iteration: Basic Idea[Finitie Horizon Case]
First Problem Solved Key insight: value function
piecewise linear & convex (PWLC) Convexity makes intuitive sense
Middle of belief space – high entropy, can’t select actions appropriately, less long-term reward
Near corners of simplex – low entropy, take actions more likely to be appropriate for current world state, gain more reward
Each line (hyperplane) represented with vector Coefficients of line (hyperplane) e.g. V(b) = c1 x b(s1) + c2 x (1-b(s1))
To find value function at b, find vector with largest dot pdt with b
Two states: 0 and 1R(0)=0 ; R(1) = 1[stay]0.9 stay; 0.1 go [go] 0.9 go; 0.1 staySensor reports correct state with 0.6 probDiscount facto=1
POMDP Value Iteration: Phase 1: One action plans
POMDP Value Iteration: Phase 2: Two action (conditional) plans
stay
stay stay
0 1
Point-based Value Iteration: Approximating with Exemplar Belief States
Solving infinite-horizon POMDPs Value iteration: iteration of dynamic
programming operator computes value function that is arbitrarily close to optimal
Optimal value function is not necessarily piecewise linear, since optimal control may require infinite memory
But in many cases, as Sondik (1978) and Kaelbling et al (1998) noticed, value iteration converges to a finite set of vectors. In these cases, an optimal policy is equivalent to a finite-state controller.
Policy evaluation
q1 q2o1o2 q2
s1q1
s2q1
s1q2
s2q2
o2o2
o2
o2o1
o1
o1
o1
As in the fully observable case, policy evaluation involves solving a system of linear equations. There is one unknown (and one equation) for each pair of system state and controller node
Policy improvement0a0
1a1
3a1
2a0
4a0
z0
z0,z1
z1
0a0
1a1
z0
z1
z0,z1
z0,z1
z0
z1
z0,z1
0a0
4a0
z0,z1
z0,z1
3a1
z0
z1
0 1
V(b)
0 1
V(b)
0 1
V(b)0,2
4
3
11
0 0 34
Per-Iteration Complexity of POMDP value iteration..
Number of a vectors needed at tth iteration
Time for computing each a vector
Approximating POMDP value function with bounds It is possible to get approximate value
functions for POMDP in two ways Over constrain it to be a NOMDP: You get
Blind Value function which ignores the observation
A “conformant” policy For infinite horizon, it will be same action always! (only |A| policies)
Relax it to be a FOMDP: You assume that the state is fully observable.
A “state-based” policy
Under-estimates value(over-estimates cost)
Over-estimates value(under-estimates cost)
Per iteration
Upper bounds for leaf nodes can come from FOMDP VI and lower bounds from NOMDP VI
Observations are written as o or z
Comparing POMDPs with Non-deterministic conditional planningPOMDP Non-Deterministic Case
RTDP-Bel doesn’t do look ahead, and also stores the current estimate of value function (see update)
---SLIDES BEYOND THIS NOT COVERED--
Two Problems How to represent value function over continuous belief
space? How to update value function Vt with Vt-1?
POMDP -> MDPS => B, set of belief statesA => sameT => τ(b,a,b’)R => ρ(b, a)
Running Example POMDP with
Two states (s1 and s2) Two actions (a1 and a2) Three observations (z1, z2, z3)
1D belief space for a 2 state POMDPProbability that state is s1
Second Problem Can’t iterate over all belief states (infinite) for value-
iteration but… Given vectors representing Vt-1, generate vectors
representing Vt
Horizon 1 No future Value function consists only
of immediate reward e.g.
R(s1, a1) = 0, R(s2, a1) = 1.5, R(s1, a2) = 1, R(s2, a2) = 0 b = <0.25, 0.75>
Value of doing a1 = 1 x b(s1) + 0 x b(s2) = 1 x 0.25 + 0 x 0.75
Value of doing a2 = 0 x b(s1) + 1.5 x b(s2) = 0 x 0.25 + 1.5 x 0.75
Second Problem Break problem down into 3 steps
-Compute value of belief state given action and observation
-Compute value of belief state given action -Compute value of belief state
Horizon 2 – Given action & obs If in belief state b,what is the best value of
doing action a1 and seeing z1? Best value = best value of immediate action +
best value of next action Best value of immediate action = horizon 1
value function
Horizon 2 – Given action & obs Assume best immediate action is a1 and obs is z1 What’s the best action for b’ that results from initial b
when perform a1 and observe z1? Not feasible – do this for all belief states (infinite)
Horizon 2 – Given action & obs Construct function over entire (initial) belief space
from horizon 1 value function with belief transformation built in
Horizon 2 – Given action & obs S(a1, z1) corresponds to paper’s
S() built in: - horizon 1 value function - belief transformation - “Weight” of seeing z after performing a - Discount factor - Immediate Reward
S() PWLC
Second Problem Break problem down into 3 steps
-Compute value of belief state given action and observation
-Compute value of belief state given action -Compute value of belief state
Horizon 2 – Given action What is the horizon 2 value of a belief state given
immediate action is a1? Horizon 2, do action a1 Horizon 1, do action…?
Horizon 2 – Given action What’s the best strategy at b? How to compute line (vector) representing best
strategy at b? (easy) How many strategies are there in figure? What’s the max number of strategies (after taking
immediate action a1)?
Horizon 2 – Given action
How can we represent the 4 regions (strategies) as a value function?
Note: each region is a strategy
Horizon 2 – Given action Sum up vectors representing region Sum of vectors = vectors (add lines, get lines) Correspond to paper’s transformation
Horizon 2 – Given action What does each region represent? Why is this step hard (alluded to in paper)?
Second Problem Break problem down into 3 steps
-Compute value of belief state given action and observation
-Compute value of belief state given action -Compute value of belief state
Horizon 2
a1
a2
U
Horizon 2
This tells youhow to act! =>
Purge
Second Problem Break problem down into 3 steps
-Compute value of belief state given action and observation
-Compute value of belief state given action -Compute value of belief state
Use horizon 2 value function to update horizon 3’s ...
The Hard Step Easy to visually inspect to obtain different regions But in higher dimensional space, with many actions and
observations….hard problem
Naïve way - Enumerate
How does Incremental Pruning do it?
Incremental Pruning How does IP improve
naïve method? Will IP ever do worse
than naïve method?
Combinations
Purge/Filter
Incremental Pruning What’s other novel idea(s) in IP?
RR: Come up with smaller set D as argument to Dominate()
RR has more linear pgms but less contraints in the worse case. Empirically ↓ constraints saves more time than ↑ linear
programs require
Incremental Pruning What’s other novel idea(s) in IP?
RR: Come up with smaller set D as argument to Dominate()
Why are the terms after U needed?
Identifying Witness Witness Thm:
-Let Ua be a set of vectors representing value function
-Let u be in Ua (e.g. u = αz1,a2 + αz2,a1 + αz3,a1 ) -If there is a vector v which differs from u in one
observation (e.g. v = αz1,a1 + αz2,a1 + αz3,a1) and there is a b such that b.v > b.u, -then Ua is not equal to the true value function
Witness Algm Randomly choose a belief state b Compute vector representing best value at b (easy) Add vector to agenda While agenda is not empty
• Get vector Vtop from top of agenda• b’ = Dominate(Vtop, Ua)• If b’ is not null (there is a witness),
compute vector u for best value at b’ and add it to Ua compute all vectors v’s that differ from u at one observation
and add them to agenda
b’ b’’ b’ b’’b
Linear Support If value function is incorrect, biggest diff is at edges
(convexity)
Linear Support
Number of of policy trees
|A||Z|T-1 at horizon T
Example for |A| = 4 and |Z| = 2 Horizon # of policy trees
0 11 42 643 16,3844 1,073,741,824
Policy graph
2 states
z0z0
z1z1
z0, z1
a0
a1
a2
a0
a1
a2
z0
z0,z1z1
z0
z1
V(b)
Policy iteration for POMDPs Sondik’s (1978) algorithm represents policy as
a mapping from belief states to actions only works under special assumptions very difficult to implement never used
Hansen’s (1998) algorithm represents policy as a finite-state controller fully general easy to implement faster than value iteration
Properties of policy iteration
Theoretical Monotonically improves finite-state controller Converges to -optimal finite-state controller
after finite number of iterations Empirical
Runs from 10 to over 100 times faster than value iteration
Scaling up
State abstraction and factored representationBelief compressionForward search and sampling approachesHierarchical task decomposition
State abstraction and factored representation of POMDP
DP algorithms are typically state-based Most AI representations are “feature-based” |S| is typically exponential in the number of
features (or variables) – the “curse of dimensionality”
State-based representations for problems with more than a few variables are impractical
Factored representations exploit regularities in transition and observation probabilities, and reward
Example: Part-painting problem[Draper, Hanks, Weld 1994]
Boolean state variablesflawed (FL), blemished (BL), painted (PA), processed (PR), notified (NO)
actionsInspect, Paint, Ship, Reject, Notify
cost functionCost of 1 for each actionCost of 1 for shipping unflawed part that is not paintedCost of 10 for shipping flawed part or rejecting unflawed part
initial belief state Pr(FL) = 0.3, Pr(BL|FL) = 1.0, Pr(BL|FL) = 0.0, Pr(PA) = 0.0, Pr(PR) = 0.0, Pr(NO) = 0.0
Factored representation of MDP[Boutilier et al. 1995; Hoey, St. Aubin, Hu, & Boutilier 1999]
Dynamic Bayesian network captures variable independence Algebraic decision diagram captures value independence
FL
SL-FL
NO
PA
SH
RE
FL’
SL-FL’
NO’
PA’
SH’
RE’
FL FL’T 1.0F 0.0
PA SH RE NO PA’ T T/F T/F T/F 1.0 F F F F 0.95 F T T/F T/F 0.0 F T/F T T/F 0.0 F T/F T/F T 0.0
FL’
FL
1.0 0.0 PA’
PA
SH
RE
NO
0.95 0.0 1.0
Dynamic Belief Network Decision DiagramsProbability Tables
Decision diagramsX
Y
Z
TRUE FALSE
X
Y
Z
5.8 3.6
Y
Z
18.6 9.5
Binary decision Diagram (BDD)
Algebraic decisiondiagram (ADD)
Addition (subtraction), multiplication (division), minimum (maximum), marginalization, expected value
Complexity of operators depends on size of decision diagrams, not number of states!
Operations on decision diagrams
= +
X
Y
Z
11.0 12.0
Y
Z
22.0 23.0
X
Z
10.0 20.0
Y
30.0
X
Y
Z
1.0 2.0 3.0 33.0
Symbolic dynamic programming for factored POMDPs [Hansen & Feng 2000]
Factored representation of value function: replace |S|-vectors with ADDs that only make relevant state distinctions
Two steps of DP algorithm Generate new ADDs for value function Prune dominated ADDs
State abstraction is based on aggregating states with the same value
ait
obs1
obs2
transition probabilities
observation probabilities
obs3
action reward
akt+1
Generation step: Symbolic implementation
Pruning step: Symbolic implementation
pruning is the most computationally expensive part of algorithm
must solve a linear program for each (potential) ADD in value function
because state abstraction reduces the dimensionality of linear programs, it significantly improves efficiency
Improved performance
Speedup factor Test problem
Degree of abstraction Generate Prune
1 0.01 42 26 2 0.03 17 11 3 0.10 0.4 3 4 0.12 0.8 0.6 5 0.44 -3.4 0.4 6 0.65 -0.7 0.1 7 1.00 -6.5 -0.1
Number abstract statesNumber primitive states Degree of abstraction =
Optimal plan (controller) for part-painting problem
Inspect
Inspect Reject
Notify
Paint Ship
~OKOK
OK
~OK
PR NO PR NOPR
FLFL
FL PA FLFL PA
FL FL
FL BLFL BL FL BLFL BL
Approximate state aggregation
= 0.4
Simplify each ADD in value function by mergingleaves that differ in value by less than .
Approximate pruning Prune vectors from value function that add
less than to value of any belief statea1
a2
a3
a4
(0,1) (1,0)
Error bound
These two methods of approximation share the same error bound
“Weak convergence,” i.e., convergence to within 2/(1-) of optimal (where is discount factor)
After “weak convergence,” decreasing allows further improvement
Starting with relatively high and gradually decreasing it accelerates convergence
Strategy: ignore differences of value less than some threshold
Complementary methods Approximate state aggregation Approximate pruning
…address two sources of complexity size of state space size of value function (memory)
Approximate dynamic programming
Belief compression Reduce dimensionality of belief space by
approximating the belief state Examples of approximate belief states
tuple of mostly-likely state plus entropy of belief state [Roy & Thrun 1999]
belief features learned by exponential family Principal Components Analysis [Roy & Gordon 2003]
standard POMDP algorithms can be applied in the lower-dimensional belief space, e.g., grid-based approximation
Forward search
a0 a1
z0
a0 a0 a0 a0
z0z1 z1
a1 a1 a1 a1
z0 z0z1 z0 z0 z0 z0 z0 z0z1 z1 z1 z1 z1 z1 z1
Sparse sampling Forward search can be combined with
Monte Carlo sampling of possible observations and action outcomes [Kearns et al 2000; Ng & Jordan 2000]
Remarkably, complexity is independent of size of state space!!!
On-line planner selects -optimal action for current belief state
State-space decomposition
For some POMDPs, each action/observation pair identifies a specific region of the state space
Motivating Example Continued A “deterministic observation” reveals that world is
in one of a small number of possible states
Same for “hybrid POMDPs”, which are POMDPs with some fully observable and some partially observable state variables
Region-based dynamic programming
Tetrahedron and surfaces
Hierarchical task decomposition We have considered abstraction in state space Now we consider abstraction in action space For fully observable MDPs:
Options [Sutton, Precup & Singh 1999] HAMs [Parr & Russell 1997] Region-based decomposition [Hauskrecht et al 1998] MAXQ [Dietterich 2000]
Hierarchical approach may cause sub-optimality, but limited forms of optimality can be guaranteed Hierarchical optimality (Parr and Russell) Recursive optimality (Dietterich)
Hierarchical approach to POMDPs Theocharous & Mahadevan (2002)
based on hierarchical hidden Markov model approximation ~1000 state robot hallway-navigation problem
Pineau et al (2003) based on Dietterich’s MAXQ decomposition approximation ~1000 state robot navigation and dialogue
Hansen & Zhou (2003) also based on Dietterich’s MAXQ decomposition convergence guarantees and epsilon-optimality
Macro action as finite-state controller
Allows exact modeling of macro’s effects macro state transition probabilities macro rewards
West Stop
NorthEast
EastSouthgoal
goal
clear
wall
wall
clear
clear
wall
Taxi example [Dietterich 2000]
Task hierarchy [Dietterich 2000]
Taxi
Get Put
Navigate
Pickup Putdown
NorthSouthEastWest
Hierarchical finite-state controller
Get Put
Nav.Pickup Nav. Putdown
East North
North
South East
West StopStop
Stop Stop
Stop
MAXQ-hierarchical policy iterationCreate initial sub-controller for each sub-POMDP in hierarchy
Repeat until error bound is less than Identify subtask that contributes most to overall
error Use policy iteration to improve the
corresponding controller For each node of controller, create abstract
action (for parent task) and compute its model Propagate error up through hierarchy
Modular structure of controller
Complexity reduction Per-iteration complexity of policy iteration
|A| |Q||Z|, where A is set of actions Z is set of observations
Q is set of controller nodes Per-iteration complexity of hierarchical PI
|A| i |Qi||Z| , where |Q| = i|Qi| With hierarchical decomposition,
complexity is sum of the complexity of the subproblems, instead of product
Scalability
MAXQ-hierarchical policy iteration can solve any POMDP, if it can decompose it into sub-POMDPs that can be solved by policy iteration Although each sub-controller is limited in
size, the hierarchical controller is not limited in size
Although the (abstract) state space of each subtask is limited in size, the total state space is not limited in size
Multi-Agent Planning with POMDPs
Partially observable stochastic gamesGeneralized dynamic programming
Multi-Agent Planning with POMDPs
Many planning problems involve multiple agents acting in a partially observable environment
The POMDP framework can be extended to address this
world
a1
z1, r1
z2, r2
a2
1
2
Partially observable stochastic game (POSG) A POSG is S, A1, A2, Z1, Z2,, P, r1, r2, where
S is a finite state set, with initial state s0
A1, A2 are finite action sets Z1, Z2 are finite observation sets P(s’|s, a1, a2) is state transition function P(z1, z2| s, a1, a2) is observation function r1(s, a1, a2) and r2(s, a1, a2) are reward functions
Special cases: All agents share the same reward function Zero-sum games
Plans and policies
A local policy is a mapping i : Zi* Ai
A joint policy is a pair 1, 2 Each agent wants to maximize its own
long-term expected reward Although execution is distributed, planning
can be centralized
Beliefs in POSGs
With a single agent, a belief is a distribution over states
How does this generalize to multiple agents?
Could have beliefs over beliefs over beliefs, but there is no algorithm for working with these
Example
States: grid cell pairs Actions: ,,,
Transitions: noisy
Goal: pick up balls
Observations: red lines
Another Example
States: who has a message to send? Actions: send or don’t send Reward: +1 for successful broadcast
0 if collision or channel not used Observations: was there a collision? (noisy)
msg msg
Strategy Elimination in POSGs Could simply convert to normal form But the number of strategies is doubly
exponential in the horizon length
R111,
R112
… R1n1,
R1n2
… … …
Rm11,
Rm12
… Rmn1,
Rmn2
…
…
Generalized dynamic programming Initialize 1-step policy trees to be actions Repeat:
Evaluate all pairs of t-step trees from current sets
Iteratively prune dominated policy trees Form exhaustive sets of t+1-step trees from
remaining t-step trees
What Generalized DP Does
The algorithm performs iterated elimination of dominated strategies in the normal form game without first writing it down
For cooperative POSGs, the final sets contain the optimal joint policy
Some Implementation Issues As before, pruning can be done using linear
programming Algorithm keeps value function and policy
trees in memory (unlike POMDP case) Currently no way to prune in an incremental
fashion
A Better Way to Do Elimination We use dynamic programming to eliminate
dominated strategies without first converting to normal form
Pruning a subtree eliminates the set of trees containing it
a1
a1 a2
a2 a2 a3 a3
o1
o1 o2 o1 o2
o2a3
a2 a1
o1 o2
a2
a2 a3
a3 a2 a2 a1
o1
o1 o2 o1 o2
o2
prune
eliminate
Dynamic Programming Build policy tree sets simultaneously Prune using a generalized belief space
s1
s2
agent 2 state space
agent 1 state space
a3
a1 a1
o1 o2
a2
a3 a1
o1 o2p1 p2
a2
a2 a2
o1 o2
a3
a1 a2
o1 o2q1 q2
Dynamic Programming
a1 a2 a1 a2
Dynamic Programming
a1
a1 a2
o1 o2
a2
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a1 a1
o1 o2
a2
a2 a1
o1 o2
a1
a2 a2
o1 o2
a2
a2 a2
o1 o2
a1
a1 a1
o1 o2
a1
a1 a2
o1 o2
a2
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a1 a1
o1 o2
a2
a2 a1
o1 o2
a1
a2 a2
o1 o2
a2
a2 a2
o1 o2
a1
a1 a1
o1 o2
Dynamic Programming
a1
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a2 a1
o1 o2
a1
a2 a2
o1 o2
a2
a2 a2
o1 o2
a1
a1 a1
o1 o2
a1
a1 a2
o1 o2
a2
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a1 a1
o1 o2
a2
a2 a1
o1 o2
a1
a2 a2
o1 o2
a2
a2 a2
o1 o2
a1
a1 a1
o1 o2
Dynamic Programming
a1
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a2 a1
o1 o2
a1
a2 a2
o1 o2
a2
a2 a2
o1 o2
a1
a1 a1
o1 o2
a1
a1 a2
o1 o2
a2
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a2 a1
o1 o2
a2
a2 a2
o1 o2
Dynamic Programming
a1
a2 a1
o1 o2
a2
a2 a1
o1 o2
a2
a2 a2
o1 o2
a1
a1 a1
o1 o2
a1
a1 a2
o1 o2
a2
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a2 a1
o1 o2
a2
a2 a2
o1 o2
Dynamic Programming
a1
a2 a1
o1 o2
a2
a2 a1
o1 o2
a2
a2 a2
o1 o2
a1
a1 a1
o1 o2
a1
a1 a2
o1 o2
a1
a2 a1
o1 o2
a2
a2 a1
o1 o2
a2
a2 a2
o1 o2
Dynamic Programming
Complexity of POSGs
The cooperative finite-horizon case is NEXP-hard, even with two agents whose observations completely determine the state [Bernstein et al. 2002]
Implications: The problem is provably intractable (because
P NEXP) It probably requires doubly exponential time
to solve in the worst case