Advances in Multiagent Decision Making under...

77
Advances in Multiagent Decision Making under Uncertainty Frans A. Oliehoek Maastricht University Coauthors: Matthijs Spaan (TUD), Shimon Whiteson (UvA), Nikos Vlassis (U. Luxembourg), Jilles Dibangoye (INRIA), Chris Amato (MIT)

Transcript of Advances in Multiagent Decision Making under...

Page 1: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

Advances in Multiagent Decision Making under Uncertainty

Frans A. Oliehoek

Maastricht University

Coauthors: Matthijs Spaan (TUD), Shimon Whiteson (UvA), Nikos Vlassis (U. Luxembourg), Jilles Dibangoye (INRIA), Chris Amato (MIT)

Page 2: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 2

Dynamics, Decisions & Uncertainty Why care about formal decision making?

Page 3: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 3

Uncertainty Outcome Uncertainty

Partial Observability

Multiagent Systems: uncertainty about others

Page 4: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 4

Outline

Background: sequential decision making

Optimal Solutions of Decentralized POMDPs [JAIR'13] incremental clustering incremental expansion sufficient plan-time statistics [IJCAI'13]

Other/current work Exploiting Structure [AAMAS'13]

Multiagent RL under uncertainty [MSDM'13]

Page 5: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 5

Background: sequential decision making

Page 6: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 6

Single-Agent Decision Making Background: MDPs & POMDPs

An MDP S – set of states A – set of actions PT – transition function

R – reward function h – horizon (finite)

A POMDP O – set of observations PO – observation function

⟨S , A , PT , R ,h⟩

P(s '∣s , a)

P(o∣a , s ')

R (s , a)

⟨S , A , PT ,O , PO , R ,h⟩

Page 7: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 7

Example: Predator-Prey Domain Predator-Prey domain

1 agent: predator prey is part of environment

Formalization: states (-3,4) actions N,W,S,E transitions failing to move, prey moves rewards reward for capturing

preypredator

Page 8: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 8

Example: Predator-Prey Domainprey

predator

Markov decision process (MDP)

►Markovian state s... (which is observed!)

►policy π maps states actions→

►Value function Q(s,a)►Compute via value iteration / policy iteration

Markov decision process (MDP)

►Markovian state s... (which is observed!)

►policy π maps states actions→

►Value function Q(s,a)►Compute via value iteration / policy iteration

Q(s , a)=R (s , a)+γ∑s '

P (s '∣s ,a)maxa' Q(s ' , a ')

Page 9: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 9

Partial Observability Now: partial observability

E.g., limited range of sight

MDP + observations explicit observations observation probabilities

noisy observations(detection probability)

o=' nothing '

Page 10: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 10

Partial Observability Now: partial observability

E.g., limited range of sight

MDP + observations explicit observations observation probabilities

noisy observations(detection probability)

o=(−1,1)

Page 11: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 11

Can not observe the state → Need to maintain a belief over states b(s) → Policy maps beliefs to actions

Can not observe the state → Need to maintain a belief over states b(s) → Policy maps beliefs to actions π(b)=a

Partial Observability Now: partial observability

E.g., limited range of sight

MDP + observations explicit observations observation probabilities

noisy observations(detection probability)

o=(−1,1)

Page 12: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 12

Multiple Agents

multiple agents, fully observable

Formalization: states ((3,-4), (1,1), (-2,0)) actions {N,W,S,E} joint actions {(N,N,N), (N,N,W),...,(E,E,E)} transitions probability of failing to move, prey moves rewards reward for capturing jointly

Can coordinate based upon the state → reduction to single agent: 'puppeteer' agent → takes joint action

Can coordinate based upon the state → reduction to single agent: 'puppeteer' agent → takes joint action

Page 13: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 13

Multiple Agents &Partial Observability

Dec-POMDP [Bernstein et al. '02]

Reduction possible → MPOMDP (multiagent POMDP)

requires broadcasting observations! instantaneous, cost-free, noise-free communication optimal→

[Pynadath and Tambe 2002] Without such communication: no easy reduction.

Page 14: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 14

Acting Based On Local Observations

Acting on global information can be impractical:

communication not possible significant cost (e.g battery power)

not instantaneous or noise free scales poorly with number of agents!

Page 15: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 15

Formal Model A Dec-POMDP

n agents S – set of states A – set of joint actions PT – transition function

O – set of joint observations PO – observation function

R – reward function h – horizon (finite)

⟨S , A , PT ,O , PO , R ,h⟩

a=⟨a1,a2, ... ,an⟩

o=⟨o1,o2, ... , on⟩

P(s '∣s , a)

P(o∣a , s ')

R (s , a)

Page 16: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 16

Running Example

2 generals problem small army large army

Page 17: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 17

Running Example

2 generals problem small army large army

S – { sL, sS }Ai – { (O)bserve, (A)ttack }Oi – { (L)arge, (S)mall }

Transitions● Both Observe no state change→● At least 1 Attack reset (50% probability s→ L, sS )

Observations● Probability of correct observation: 0.85● E.g., P(<L, L> | sL ) = 0.85 * 0.85 = 0.7225

Rewards● 1 general attacks he loses the battle: → R(*,<A,O>) = -10● Both generals Observe small cost: → R(*,<O,O>) = -1● Both Attack depends on state: → R(sL,<A,A>) = -20

R(sS,<A,A>) = +5

S – { sL, sS }Ai – { (O)bserve, (A)ttack }Oi – { (L)arge, (S)mall }

Transitions● Both Observe no state change→● At least 1 Attack reset (50% probability s→ L, sS )

Observations● Probability of correct observation: 0.85● E.g., P(<L, L> | sL ) = 0.85 * 0.85 = 0.7225

Rewards● 1 general attacks he loses the battle: → R(*,<A,O>) = -10● Both generals Observe small cost: → R(*,<O,O>) = -1● Both Attack depends on state: → R(sL,<A,A>) = -20

R(sS,<A,A>) = +5

Page 18: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 18

Off-line / On-line phases

off-line planning, on-line execution is decentralized

(Smart generals make a plan in advance!)

Planning Phase Execution Phase

π=⟨π1,π2⟩

Page 19: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 19

Goal of Planning

Find an optimal joint policy

Value:expected sum of rewards:

π∗=⟨π1,π2⟩ πi :O⃗i → A i

V (π)=E [∑t=0

h−1

R(s ,a) ∣ π ,b0]No compact representation...No compact representation...

The problem is NEXP-complete [Bernstein et al. 2002]►Also for ε-approximate solution! [Rabinovich et al. 2003]

The problem is NEXP-complete [Bernstein et al. 2002]►Also for ε-approximate solution! [Rabinovich et al. 2003]

Page 20: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 20

Should we give up on optimality?

but we care about these problems...

complexity: worst case may be able to optimally solve important problems

optimal methods can provide insight in problems serve as inspiration for approximate methods need to benchmark: no usable upper bounds

Page 21: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 21

Heuristic search + limitations Interpret search-tree nodes as 'Bayesian Games' Incremental Clustering Incremental Expansion Sufficient plan-time statistics

Advances in Exact Planning Methods

Page 22: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 22

Heuristic Search – 1

Incrementally construct all (joint) policies 'forward in time'

O

A

S L

A O

S LA

A O

S L

O

A

S L

A A

S LO

A O

S L

1 joint policy

Page 23: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 23

Heuristic Search – 1

Incrementally construct all (joint) policies 'forward in time'

?

?

S L

? ?

S L?

? ?

S L

?

?

S L

S L?

S L

? ?? ?

1 partial joint policy

Start with unspecified policyStart with unspecified policy

Page 24: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 24

Heuristic Search – 1

Incrementally construct all (joint) policies 'forward in time'

O

?

S L

? ?

S L?

? ?

S L

O

?

S L

S L?

S L

? ?? ?

1 partial joint policy

Page 25: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 25

Heuristic Search – 1

Incrementally construct all (joint) policies 'forward in time'

O

A

S L

? ?

S LA

? ?

S L

O

A

S L

S LO

S L

? ?? ?

1 partial joint policy

Page 26: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 26

Heuristic Search – 1

Incrementally construct all (joint) policies 'forward in time'

O

A

S L

A O

S LA

A O

S L

O

A

S L

A A

S LO

A O

S L

1 complete joint policy(full-length)

Page 27: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 27

Heuristic Search – 2

Creating ALL joint policies tree structure!→

?

?

S L

? ?

S L?

? ?

S L

?

?

S L

S L?

S L

? ?? ?

Root node:unspecified joint policy

Page 28: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 28

Heuristic Search – 2

Creating ALL joint policies tree structure!→

?

?

S L

? ?

S L?

? ?

S L

?

?

S L

S L?

S L

? ?? ?

O

?

S L

? ?

S L?

? ?

S L

O

?

S L

S L?

S L

? ?? ?

Creating a child node: assignment actions at t=0

Creating a child node: assignment actions at t=0

Page 29: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 29

Heuristic Search – 2

Creating ALL joint policies tree structure!→

?

?

S L

? ?

S L?

? ?

S L

?

?

S L

S L?

S L

? ?? ?

O

?

S L

? ?

S L?

? ?

S L

O

?

S L

S L?

S L

? ?? ?

A

?

S L

? ?

S L?

? ?

S L

A

?

S L

S L?

S L

? ?? ?

...

Node expansion:create all children

Node expansion:create all children

Page 30: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 30

Heuristic Search – 2

Creating ALL joint policies tree structure!→

t=0

Page 31: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 31

Heuristic Search – 2

Creating ALL joint policies tree structure!→

... t=1

Next expansion: more children!

need to assign action to4 OHs now: 2^4 = 16

Next expansion: more children!

need to assign action to4 OHs now: 2^4 = 16

Page 32: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 32

Heuristic Search – 2

Creating ALL joint policies tree structure!→

... t=2 Last stage: even more!

need to assign action to8 OHs now: 2^8 = 256 children (for each node at level 2!)

Last stage: even more!

need to assign action to8 OHs now: 2^8 = 256 children (for each node at level 2!)

...

Page 33: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 33

Heuristic Search – 3

too big to create completely... Idea: use heuristics

avoid going down non-promising branches!

Apply A* → Multiagent A* [Szer et al. 2005]

Page 34: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 34

Heuristic Search – 4

Use heuristics F(n) = G(n) + H(n)

G(n) – actual reward of reaching n a node at depth t specifies φt (i.e., actions for first t stages)

→ can compute V(φt) over stages 0...t-1

H(n) – should overestimate! E.g., pretend that it is an MDP compute

H (n)=H (φt)=∑s

P (s∣φt , b0)V̂ MDP (s)

Page 35: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 35

Heuristics

QPOMDP: Solve 'underlying POMDP' corresponds to immediate communication

QBG corresponds to 1-step delayed communication Hierarchy of upper bounds [Oliehoek et al. 2008]

Q∗≤Q̂ kBG≤Q̂ BG≤Q̂POMDP≤Q̂MDP

H (φt)=∑θ⃗t

P (θ⃗t∣φ

t , b0)V̂ POMDP(bθ⃗t

)

Page 36: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 36

MAA* Limitations

Number of children grows doubly exponentially with nodes depth

For a node last stage, number of children: Total number of joint policies:

→ MAA* can only solve 1 horizon longer than brute force search... [Seuken & Zilberstein '08]

We introduce methods to fix this

O (∣A∗∣n∣O∗∣

h−1

)

O (∣A∗∣(n∣O∗∣

h−1)/(∣O∗∣−1))

Page 37: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 37

Collaborative Bayesian Games

agents, actions

types θi histories↔

probabilities: P(θ) payoffs: Q(θ,a)

A O

... ...

... ...

A O

A +2 -1

O ... ...

... ...

... ...

A … ...

O ... ...

largesmall

large

small

agent 1

agent 2

Page 38: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 38

MAA* via Bayesian Games

Each node a φ↔ t

decision problem for stage t

Page 39: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 39

MAA* via Bayesian Games – 2

MAA* perspective

node φ↔ t

joint decision rule δ maps OHs to actions

Expansion: appending all next-stage decision rules: φt+1=(φt,δt)

B(φ0)

B(φ1)B(φ1) B(φ1) B(φ1)

BG perspective

node a BG↔ joint BG policy β

maps 'types' to actions Expansion: enumeration of all

joint BG policies φt+1=(φt,βt)

φ0

φ1φ1 φ1 φ1

direct correspondence: δ β↔direct correspondence: δ β↔

Page 40: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 40

MAA* via Bayesian Games – 2

MAA* perspective

node φ↔ t

joint decision rule δ maps OHs to actions

Expansion: appending all next-stage decision rules: φt+1=(φt,δt)

B(φ0)

B(φ1)B(φ1) B(φ1) B(φ1)

BG perspective

node a BG↔ joint BG policy β

maps 'types' to actions Expansion: enumeration of all

joint BG policies φt+1=(φt,βt)

φ0

φ1φ1 φ1 φ1

What is the point?

►Generalized MAA* [Oliehoek & Vlassis '07] ►Unified perspective of MAA* and 'BAGA'

approximation [Emery-Montemerlo et al. '04] ►No direct improvements...

However...► BGs provide abstraction layer► Facilitated two improvements that lead to

state-of-the-art performance [Oliehoek et al. '13]

• Clustering of histories• Incremental expansion

What is the point?

►Generalized MAA* [Oliehoek & Vlassis '07] ►Unified perspective of MAA* and 'BAGA'

approximation [Emery-Montemerlo et al. '04] ►No direct improvements...

However...► BGs provide abstraction layer► Facilitated two improvements that lead to

state-of-the-art performance [Oliehoek et al. '13]

• Clustering of histories• Incremental expansion

Page 41: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 41

The Decentralized Tiger Problem Two agents in a hallway

States: tiger left (sL) or right (sR)

Actions: listen, open left, open right Observations: hear left (HL), hear right (HR)

<Listen,Listen> 85% prob. of getting right obs. e.g. P(<HL,HL> | <Li,Li>, SL) = 0.85*0.85 = 0.7225

otherwise: uniform random Reward: get the reward, acting jointly is better

Page 42: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 42

Lossless Clustering

Two types (=action-observation histories) in a BGare probabilistically equivalent iff

P (θ⃗−i∣⃗θi , a)=P (θ⃗−i∣⃗θi , b)

P (s∣⃗θ−i , θ⃗i ,a)=P (s∣⃗θ−i , θ⃗i , b)

Note:

are implicit

Note:

are implicit

φt , b0

Page 43: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 43

Lossless Clustering

Two types (=action-observation histories) in a BGare probabilistically equivalent iff

P (θ⃗−i∣⃗θi , a)=P (θ⃗−i∣⃗θi , b)

P (s∣⃗θ−i , θ⃗i ,a)=P (s∣⃗θ−i , θ⃗i , b)

Page 44: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 44

Lossless Clustering

Two types (=action-observation histories) in a BGare probabilistically equivalent iff

P (θ⃗−i∣⃗θi , a)=P (θ⃗−i∣⃗θi , b)

P (s∣⃗θ−i , θ⃗i ,a)=P (s∣⃗θ−i , θ⃗i , b)

Clustering is lossless

restricting the policy space to clustered policies does not sacrifice optimality

►histories are best-response equivalent►if criterion holds same →

'multiagent belief' bi(s,q-i)

Clustering is lossless

restricting the policy space to clustered policies does not sacrifice optimality

►histories are best-response equivalent►if criterion holds same →

'multiagent belief' bi(s,q-i)

Page 45: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 45

Incremental Clustering No need to cluster from scratch Probabilistic equivalence

'extends forwards' identical extensions of two PE

histories are also PE

→ can bootstrap from CBG of the previous stage

'Incremental clustering'

B(φ1)

B(φ2)

B(φ0)

Page 46: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 46

Incremental Expansion Key idea: nodes have many children, but only few are useful.

i.e., only few will be selected for further expansion others will have too low heuristic value

if we can generate the nodes in decreasing heuristic order → can avoid expansion of redundant nodes

F=7

F=-10F=5.5F=6F=-41

Page 47: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 47

Incremental Expansion

a, F=7

Open lista – 7

Open lista – 7

Page 48: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 48

Incremental Expansion

a, F=7

Open lista – 7

Open lista – 7

Select for expansion →

Page 49: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 49

Incremental Expansion

b, F=6

a, F=7

Open listb – 6

Open listb – 6

1) best child has F=6

Page 50: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 50

Incremental Expansion

a, F=6

b, F=6

Open listb – 6a – 6

Open listb – 6a – 6

1) best child has F=6

2) reinsert parent as place holder (with F=6)

Page 51: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 51

Incremental Expansion

a, F=6

b, F=6

Open listb – 6a – 6

Open listb – 6a – 6

Select for expansion →

Page 52: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 52

Incremental Expansion

a, F=6

b, F=4

c, F=4

Open lista – 6c – 4b – 4

Open lista – 6c – 4b – 4

Page 53: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 53

Incremental Expansion

a, F=5.5

d, F=5.5b, F=4

c, F=4

Open listd – 5.5a – 5.5c – 4b – 4

Open listd – 5.5a – 5.5c – 4b – 4

Page 54: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 54

Incremental Expansion

a, F=5.5

F=-10d, F=5.5b, F=4F=-41

c, F=4

Open listd – 5.5a – 5.5c – 4b – 4

Open listd – 5.5a – 5.5c – 4b – 4

Page 55: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 55

Incremental Expansion: How?

a, F=6

b, F=6

How do we generate the next-best child?

Node BG, so...↔ find the solutions of the BG

in decreasing order of value i.e., 'incremental BG solver'

Modification of BaGaBaB [Oliehoek et al. 2010]

stop searching when next solution found save search tree for next time visited.

Nested A*!

Page 56: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 56

ResultsGMAA*-ICE can solve

higher horizons than listedGMAA*-ICE can solve

higher horizons than listed

incremental expansioncomplements incr. clustering

incremental expansioncomplements incr. clustering

Page 57: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 57

Results

Cases that compress well* excluding heuristic

GMAA* GMAA*-ICE

Page 58: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 58

Sufficient Plan-Time Statistics [Oliehoek 2013]

Optimal decision ruledepends on past joint policy φt search tree→

In fact possible to give an expression for the optimal value function based on φt [Oliehoek et al. 2008]

Recent insight: reformulation based on a sufficient statistic

compact formulation of Q* search tree DAG (“suff. stat-based pruning”)→

Page 59: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 59

Optimal Value Functions

2 parts: Value propagation:

Value optimization:

Page 60: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 60

Optimal Value Functions

2 parts: Value propagation:

last stage t=h-1

Value optimization:

Q∗(φ

h−1 , θ⃗h−1 ,δh−1)=R(θ⃗

h−1 ,δh−1(θ⃗

h−1))

(past Pol, AOH, decis. rule)(past Pol, AOH, decis. rule)expected rewardexpected reward

δt(θ⃗

t)=⟨δ1

t(θ⃗1

t) , ... ,δn

t(θ⃗n

t)⟩

Page 61: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 61

Optimal Value Functions

2 parts: Value propagation:

last stage t=h-1 t<h-1

Value optimization:

Q∗(φ

h−1 , θ⃗h−1 ,δh−1)=R(θ⃗

h−1 ,δh−1(θ⃗

h−1))

Q∗(φ

t , θ⃗t ,δt)=R (θ⃗

t ,δt(θ⃗

t))+∑

o

P(o∣θ⃗t ,δt(θ⃗

t))Q∗

(φt+1 , θ⃗t+1 ,δ∗ t+1

)

φt+1

=(φt ,δt

)

Page 62: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 62

Optimal Value Functions

2 parts: Value propagation:

last stage t=h-1 t<h-1

Value optimization:

(need to do 'stage-wise' maximization)

Q∗(φ

h−1 , θ⃗h−1 ,δh−1)=R(θ⃗

h−1 ,δh−1(θ⃗

h−1))

Q∗(φ

t , θ⃗t ,δt)=R (θ⃗

t ,δt(θ⃗

t))+∑

o

P(o∣θ⃗t ,δt(θ⃗

t))Q∗

(φt+1 , θ⃗t+1 ,δ∗ t+1

)

φt+1

=(φt ,δt

)

δ∗t+1

=argmaxδ

t+1∑θ⃗

t+1 P(θ⃗t+1

∣b0 ,φt+1)Q∗

(φt+1 , θ⃗t+1 ,δt+1

)

Page 63: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 63

Optimal Value Functions

2 parts: Value propagation:

last stage t=h-1 t<h-1

Value optimization:

(need to do 'stage-wise' maximization)

Q∗(φ

h−1 , θ⃗h−1 ,δh−1)=R(θ⃗

h−1 ,δh−1(θ⃗

h−1))

Q∗(φ

t , θ⃗t ,δt)=R (θ⃗

t ,δt(θ⃗

t))+∑

o

P(o∣θ⃗t ,δt(θ⃗

t))Q∗

(φt+1 , θ⃗t+1 ,δ∗ t+1

)

φt+1

=(φt ,δt

)

δ∗t+1

=argmaxδ

t+1∑θ⃗

t+1 P(θ⃗t+1

∣b0 ,φt+1)Q∗

(φt+1 , θ⃗t+1 ,δt+1

)

Q vs V ?

→ we can interpret it as a 'plan-time' MDP►state: φ►actions: δ

Q vs V ?

→ we can interpret it as a 'plan-time' MDP►state: φ►actions: δ

Q∗(φ

t ,δt)=∑

θ⃗t P (θ⃗

t∣b0 ,φt

)Q∗(φ

t , θ⃗t ,δt)

V (φt)=max

δt Q∗

(φt ,δt

)

Page 64: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 64

Optimal Value Functions

2 parts: Value propagation:

last stage t=h-1 t<h-1

Value optimization:

(need to do 'stage-wise' maximization)

Q∗(φ

h−1 , θ⃗h−1 ,δh−1)=R(θ⃗

h−1 ,δh−1(θ⃗

h−1))

Q∗(φ

t , θ⃗t ,δt)=R (θ⃗

t ,δt(θ⃗

t))+∑

o

P(o∣θ⃗t ,δt(θ⃗

t))Q∗

(φt+1 , θ⃗t+1 ,δ∗ t+1

)

φt+1

=(φt ,δt

)

δ∗t+1

=argmaxδ

t+1∑θ⃗

t+1 P(θ⃗t+1

∣b0 ,φt+1)Q∗

(φt+1 , θ⃗t+1 ,δt+1

)

Page 65: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 65

Optimal Value Functions

2 parts: Value propagation:

last stage t=h-1 t<h-1

Value optimization:

(need to do 'stage-wise' maximization)

Q∗(φ

h−1 , θ⃗h−1 ,δh−1)=R(θ⃗

h−1 ,δh−1(θ⃗

h−1))

Q∗(φ

t , θ⃗t ,δt)=R (θ⃗

t ,δt(θ⃗

t))+∑

o

P(o∣θ⃗t ,δt(θ⃗

t))Q∗

(φt+1 , θ⃗t+1 ,δ∗ t+1

)

φt+1

=(φt ,δt

)

δ∗t+1

=argmaxδ

t+1∑θ⃗

t+1 P(θ⃗t+1

∣b0 ,φt+1)Q∗

(φt+1 , θ⃗t+1 ,δt+1

)

Page 66: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 66

Optimal Value Functions

2 parts: Value propagation:

last stage t=h-1 t<h-1

Value optimization:

(need to do 'stage-wise' maximization)

Q∗(φ

h−1 , θ⃗h−1 ,δh−1)=R(θ⃗

h−1 ,δh−1(θ⃗

h−1))

Q∗(φ

t , θ⃗t ,δt)=R (θ⃗

t ,δt(θ⃗

t))+∑

o

P(o∣θ⃗t ,δt(θ⃗

t))Q∗

(φt+1 , θ⃗t+1 ,δ∗ t+1

)

φt+1

=(φt ,δt

)

δ∗t+1

=argmaxδ

t+1∑θ⃗

t+1 P(θ⃗t+1

∣b0 ,φt+1)Q∗

(φt+1 , θ⃗t+1 ,δt+1

)

Page 67: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 67

Optimal Value Functions

2 parts: Value propagation:

last stage t=h-1 t<h-1

Value optimization:

(need to do 'stage-wise' maximization)

Q∗(φ

h−1 , θ⃗h−1 ,δh−1)=R(θ⃗

h−1 ,δh−1(θ⃗

h−1))

Q∗(φ

t , θ⃗t ,δt)=R (θ⃗

t ,δt(θ⃗

t))+∑

o

P(o∣θ⃗t ,δt(θ⃗

t))Q∗

(φt+1 , θ⃗t+1 ,δ∗ t+1

)

φt+1

=(φt ,δt

)

δ∗t+1

=argmaxδ

t+1∑θ⃗

t+1 P(θ⃗t+1

∣b0 ,φt+1)Q∗

(φt+1 , θ⃗t+1 ,δt+1

)

But: initial dependence only throughthis probability term!But: initial dependence only throughthis probability term!

Page 68: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 68

Sufficient Statistic – 1

2 parts: Value propagation:

Value optimization:

Q∗(σ

t , θ⃗t ,δt)=R (θ⃗

t ,δt(θ⃗

t))+∑

o

P (o∣⃗θt ,δt(θ⃗

t))Q∗

(σt+1 , θ⃗t+1 ,δ∗t+1

)

δ∗t+1

=argmaxδ

t+1∑θ⃗

t+1 σt+1

(θ⃗t+1

)Q∗(σ

t+1 , θ⃗t+1 ,δt+1)

Page 69: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 69

Sufficient Statistic – 1

2 parts: Value propagation:

Value optimization:

Q∗(σ

t , θ⃗t ,δt)=R (θ⃗

t ,δt(θ⃗

t))+∑

o

P (o∣⃗θt ,δt(θ⃗

t))Q∗

(σt+1 , θ⃗t+1 ,δ∗t+1

)

δ∗t+1

=argmaxδ

t+1∑θ⃗

t+1 σt+1

(θ⃗t+1

)Q∗(σ

t+1 , θ⃗t+1 ,δt+1)

Limited use: every deterministic past joint policy induces a different σ !Limited use: every deterministic past joint policy induces a different σ !

Page 70: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 70

Sufficient Statistic – 2

2 parts: Value propagation:

Value optimization:

Q∗(σ

t , θ⃗t ,δt)=R (θ⃗

t ,δt(θ⃗

t))+∑

o

P (o∣⃗θt ,δt(θ⃗

t))Q∗

(σt+1 , θ⃗t+1 ,δ∗t+1

)

δ∗t+1

=argmaxδ

t+1∑θ⃗

t+1 σt+1

(θ⃗t+1

)Q∗(σ

t+1 , θ⃗t+1 ,δt+1)

use: use: σt(s , o⃗t

)

Page 71: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 71

Sufficient Statistic – 2

2 parts: Value propagation:

Value optimization:

Q∗(σ

t , θ⃗t ,δt)=R (θ⃗

t ,δt(θ⃗

t))+∑

o

P (o∣⃗θt ,δt(θ⃗

t))Q∗

(σt+1 , θ⃗t+1 ,δ∗t+1

)

δ∗t+1

=argmaxδ

t+1∑θ⃗

t+1 σt+1

(θ⃗t+1

)Q∗(σ

t+1 , θ⃗t+1 ,δt+1)

use: use: σt(s , o⃗t

)

►substitute AOH OH→►but then also adapt R(..) and P(o|...)→

►substitute AOH OH→►but then also adapt R(..) and P(o|...)→

Page 72: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 72

Sufficient Statistic – 2

2 parts: Value propagation:

Value optimization:

Q∗(σ

t , o⃗t ,δt)=R (σ

t , o⃗t ,δt)+∑

o

P(o∣σ t , o⃗t ,δt)Q∗

(σt+1 , o⃗ t+1 ,δ∗ t+1

)

δ∗ t+1

=argmaxδ

t+1∑o⃗t+1 σt( o⃗ t+1

)Q∗(σ

t+1 , o⃗t+1 ,δt+1)

use: use: σt(s , o⃗t

)

Page 73: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 73

Results –1

Reduction in size of Q*

Page 74: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 74

Sufficient statistic-based pruning

Before

Page 75: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 75

σ0

σ1

σ2

σ1

σ2

σ2

Sufficient statistic-based pruning

Now many φ same σ↔

GMAA*-ICE with SSBP: perform GMAA*-ICE, but at each node compute σ if same σ but lower G-value prune branch →

Page 76: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 76

Results – 2

Speed-up GMAA*-ICE due to SSBP

promising, but does not address the current bottleneck...promising, but does not address the current bottleneck...

Page 77: Advances in Multiagent Decision Making under Uncertaintypeople.csail.mit.edu/fao/docs/pres_CWI_2013-05.pdf · 2013. 5. 30. · Advances in Multiagent Decision Making under Uncertainty

May 14, 2013 77

References

Most references can be found inFrans A. Oliehoek. Decentralized POMDPs. In Wiering, Marco and van Otterlo, Martijn, editors, Reinforcement Learning: State of the Art, Adaptation, Learning, and Optimization, pp. 471–503, Springer Berlin Heidelberg, Berlin, Germany, 2012.

Other: Dibangoye, Amato, Buffet, & Charpillet. Optimally Solving Dec-POMDPs as Continuous-

State MDPs. IJCAI, 2013. Oliehoek, Spaan, Amato, & Whiteson. Incremental Clustering and Expansion for Faster

Optimal Planning in Decentralized POMDPs. JAIR, 2013. Oliehoek. Sufficient Plan-Time Statistics for Decentralized POMDPs. IJCAI, 2013.