Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.
-
date post
20-Jan-2016 -
Category
Documents
-
view
215 -
download
0
Transcript of Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.
![Page 1: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/1.jpg)
Multiagent Planning with Factored MDPs
Carlos Guestrin
Stanford University
![Page 2: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/2.jpg)
Collaborative Multiagent Planning
Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control
Long-termgoals
Multiple agents
Coordinateddecisions
CollaborativeMultiagentPlanning
![Page 3: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/3.jpg)
Exploiting Structure
Real-world problems have:
Hundreds of objects Googles of states
Real-world problems have structure!
Approach: Exploit structured representation to obtain efficient approximate solution
![Page 4: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/4.jpg)
peasant
footman
building
Real-time Strategy GamePeasants collect resources and buildFootmen attack enemiesBuildings train peasants and footmen
![Page 5: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/5.jpg)
Joint Decision Space
State space: Joint state x of entire system
Action space: Joint action a= {a1,…, an} for all agents
Reward function: Total reward R(x,a)
Transition model: Dynamics of the entire system P(x’|x,a)
Markov Decision Process (MDP) Representation:
![Page 6: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/6.jpg)
Policy
Policy: (x) = aAt state x, action a for all agents
(x0) = both peasants get woodx0
(x1) = one peasant gets gold, other builds barrack
x1
(x2) = Peasants get gold, footmen attack
x2
![Page 7: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/7.jpg)
Value of Policy
Value: V(x)Expected long-term
reward starting from
xStart from x0
x0
R(x0)
(x0
)
V(x0) = E[R(x0) + R(x1) + 2 R(x2) + 3 R(x3) + 4 R(x4) + ]
Future rewards discounted by 2 [0,1)x1
R(x1)
x1’’
x1’R(x1’)
R(x1’’)
(x1
)x2
R(x2)
(x2
)x3
R(x3)
(x3
) x4
R(x4)
(x1’)
(x1’’)
![Page 8: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/8.jpg)
Optimal Long-term Plan
Optimal Policy: *(x)
Optimal value function V*(x)
'
)'(),|'(),(max)(x
axaxxaxx VPRV
Optimal policy:)a,x(maxarg)x(
a
Q
Bellman Equations:
'
)'(),|'(),(),(x
xaxxaxax VPRQ
![Page 9: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/9.jpg)
Solving an MDP
Policy iteration [Howard ’60, Bellman ‘57]
Value iteration [Bellman ‘57]
Linear programming [Manne ’60]
…
Solve Bellman equation
Optimal value V*(x)
Optimal policy *(x)
Many algorithms solve the Bellman equations:
![Page 10: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/10.jpg)
LP Solution to MDP
Value computed by linear programming:
One variable V (x) for each state One constraint for each state x and action a Polynomial time solution
[Manne ’60]
),(
:subject to
:minimize
, ax
xa
x
Q)(xV
)(xV )(xV
, ax)(xV
![Page 11: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/11.jpg)
Planning under Bellman’s “Curse”
Planning is Polynomial in #states and #actions
#states exponential in number of variables
#actions exponential in number of agents
Efficient approximation by exploiting structure!
![Page 12: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/12.jpg)
F’
E’
G’
P’
Structure in Representation: Factored MDP
State Dynamics Decisions Rewards
Peasant
Footman
Enemy
Gold
RComplexity of representation:Exponential in #parents (worst
case)
[Boutilier et al. ’95]t t+1TimeAPeasant
ABuild
AFootman
P(F’|F,G,AB,AF)
![Page 13: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/13.jpg)
Structured Value function ?Factored MDP Structure in V*
Y’’
Z’’
X’’
R
Y’’’
Z’’’
X’’’
Time t t+1
R
Y’
Z’
X’
t+2 t+3
R
Z
Y
X
R
Factored MDP Structure in V*
Almost!
Structured V yields
good approximate value
function
?
![Page 14: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/14.jpg)
Linear combination of restricted domain functions [Bellman et al. ‘63][Tsitsiklis & Van Roy ’96][Koller & Parr ’99,’00][Guestrin et al. ’01]
Structured Value Functions
Each hi is status of small part(s) of a complex system: State of footman and enemy Status of barracks Status of barracks and state of footman
Structured V Structured Q
Must find w giving good approximate value function
i
ihiwV )()(~
xx
i
iQQ~
small #of Ai’s, Xj’s
![Page 15: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/15.jpg)
Approximate LP Solution
:subject to
, ax
:minimize x
),( xaQ)(xV
)(xV
),( xa
iiQ)( x
iii hw
)( xi
ii hw
One variable wi for each basis function Polynomial number of LP variables
One constraint for every state and action Exponentially many LP constraints
)( xi
iihw
)( xi
iihw
, ax
[Schweitzer and Seidmann ‘85]
![Page 16: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/16.jpg)
,),()( :subject to
axxaxi
ii
ii Qhw
Representing Exponentially Many Constraints
)x()x,a( :to subject max0x,a
i
iii hwQ
Exponentially many linear = one nonlinear constraint
,)(),(0 :subject to
axxxai
iii hwQ
[Guestrin, Koller, Parr ’01]
Maximization over exponential space
![Page 17: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/17.jpg)
)x()x,a( :to subject max0x,a
i
iii hwQ
Variable Elimination
i
iii hwQ )x()x,a(maxx,a
A
D
B C
1f
4f 3f
2f
Here we need only 23, instead of 63 sum operations
),(),(),(max 121,,
CBgCAfBAfCBA
),(),(max),(),(max 4321,,
DBfDCfCAfBAfDCBA
),(),(),(),(max 4321,,,
DBfDCfCAfBAfDCBA
Variable elimination to maximize over state space [Bertele & Brioschi ‘72]
Maximization only exponential in largest factor Tree-width characterizes complexity
Graph-theoretic measure of “connectedness” Arises in many settings: integer prog., Bayes nets, comput. geometry, …
Structured Value Function
i
iii
XXAA
hwQ
m
n
)(),(
,,,,
1
1
max
small #of Ai’s, Xj’s
small #of Xj’s
![Page 18: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/18.jpg)
Representing the Constraints
Use Variable Elimination to represent constraints:
),(),(max),(),(max0 4321,,
DBfDCfCAfBAfDCBA
),(),(
),(),(max0
43),(
1
),(121
,,
DBfDCfg
gCAfBAf
CB
CB
CBA
Number of constraints exponentially smaller!
)x()x,a(max0 :to subjectx,a
i
iii hwQ i
iii hwQ )x()x,a(maxx,a
i
iii hwQ )x()x,a(maxx,a
![Page 19: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/19.jpg)
Understanding Scaling Properties
Explicit LP Factored LP
k = tree-width
2n (n+1-k)2k
Explicit LP
0
10000
20000
30000
40000
2 4 6 8 10 12 14 16number of variables
nu
mb
er o
f co
nst
rain
ts
Factored LP
k = 3
k = 5
k = 8
k = 10
k = 12
![Page 20: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/20.jpg)
Network Management Problem
Ring
Star
Ring of Rings
k-grid
Computer status = {good, dead, faulty}
Dead neighbors increase dying probability
Computer runs processes
Reward for successful processes
Each SysAdmin takes local action = {reboot, not reboot }
Problem with n machines 9n states, 2n actions
![Page 21: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/21.jpg)
Running Time
0
500
1000
1500
2000
2500
3000
0 2 4 6 8 10 12
number of machines
Ru
nn
ing
tim
e (
s)
RingExact solution
RingSingle basis k=4
StarSingle basis
k=4
3-gridSingle basis
k=5
StarPair basis
k=4RingPair basis
k=8
k – tree-width
![Page 22: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/22.jpg)
Summary of Algorithm
1. Pick local basis functions hi
2. Factored LP computes value function
3. Policy is argmaxa of Q
![Page 23: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/23.jpg)
Large-scale Multiagent Coordination
Efficient algorithm computes V Action at state x is:
)a,x(maxarga
Q
#actions is exponential Complete observability Full communication
![Page 24: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/24.jpg)
Distributed Q Function
Q(A1,…,A4, X1,…,X4)
[Guestrin, Koller, Parr ’02]
2
3
4
1
Q4
≈
Q2(A1, A2, X1,X2)
Q4(A3, A4, X3,X4)
Q1(A1, A4,
X1,X4) Q3(A2, A3, X2,X3)+
++
Each agent maintains a part of the Q function
Distributed Q
function
![Page 25: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/25.jpg)
Multiagent Action Selection
2
3
4
1
Q2(A1, A2, X1,X2)
Q4(A3, A4, X3,X4)
Q1(A1, A4,
X1,X4)
Q3(A2, A3, X2,X3)
Distributed Q
function
Instantiate current state x
Maximal action
argmaxa
![Page 26: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/26.jpg)
Instantiate Current State x
2
3
4
1
Q2(A1, A2, X1,X2)
Q4(A3, A4, X3,X4)
Q1(A1, A4,
X1,X4)
Q3(A2, A3, X2,X3)
Q2(A1, A2)
Q3(A2, A3)
Q4(A3, A4)
Q1(A1, A4)
Observe only
X1 and X2
Instantiate current state x
Limited observability: agent i only observes variables in Qi
![Page 27: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/27.jpg)
Multiagent Action Selection
2
3
4
1
Distributed Q
function
Instantiate current state x
Maximal action
argmaxa
Q2(A1, A2)
Q3(A2, A3)
Q4(A3, A4)
Q1(A1, A4)
![Page 28: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/28.jpg)
Coordination Graph
Q2(A1, A2)
Q3(A2, A3)
Q4(A3, A4)
Q1(A1, A4)
maxa
+ + +
Use variable elimination for maximization:
Limited communication for optimal action choice
Comm. bandwidth = tree-width of coord. graph
A1
A3
A2 A4
2Q
3Q 4Q
1Q
),(),(),(max 421212411,, 321
AAgAAQAAQA A A
),(),(max),(),(max 434323212411,, 3321
AAQAAQAAQAAQAA A A
),(),(),(),(max 434323212411,,, 4321
AAQAAQAAQAAQA A A A
A2 A4 Value of optimal A3
action
Attack Attack 5
Attack Defend
6
Defend
Attack 8
Defend
Defend
12
![Page 29: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/29.jpg)
Coordination Graph Example
A4
A1
A3
A2
A7
A5
A6
A11
A9
A8
A10
Trees don’t increase communication requirements
Cycles require graph triangulation
![Page 30: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/30.jpg)
Unified View: Function Approximation Multiagent Coordination
Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2)
+
Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4)
A1
A3
A2 A4
Q1(A1, X1) + Q2(A2, X2) +
Q3(A3, X3) + Q4(A4, X4)
A1
A3
A2 A4
Factored MDP and value function representations induce communication, coordination
Tradeoff Communication / Accuracy
![Page 31: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/31.jpg)
How good are the policies?
SysAdmin problem
Power grid problem [Schneider et al. ‘99]
![Page 32: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/32.jpg)
SysAdmin Ring - Quality of Policies
1.5
2.5
3.5
4.5
0 5 10number of machines
va
lue
pe
r m
ac
hin
e
Utopic maximum value
Exact solution
Constraint samplingSingle basis
Constraint samplingPair basis
Factored LP Single basis
![Page 33: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/33.jpg)
Power Grid – Factored Multiagent
Lower is better!
[Guestrin, Lagoudakis, Parr ‘02]
0
10
20
30
40
50
60
70
80
90
100
A B C DGrid
Co
st
DR [Schneider+al '99]
DVF [Schneider+al '99]
Factored Multiagent no comm.
Factored Multiagent pairwise comm.
![Page 34: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/34.jpg)
Summary of Algorithm
1. Pick local basis functions hi
2. Factored LP computes value function
3. Coordination graph computes argmaxa
of Q
![Page 35: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/35.jpg)
Planning Complex Environments
When faced with a complex problem, exploit structure:
For planning For action selection
Given new problem
Replan from scratch: Different MDP New planning problem Huge problems intractable, even with factored LP
![Page 36: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/36.jpg)
Generalizing to New Problems
SolveProblem 1
SolveProblem n
Good solution to
Problem n+1
SolveProblem 2
MDPs are different! Different sets of states, action, reward,
transition, …
Many problems are “similar”
![Page 37: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/37.jpg)
Generalization with Relational MDPs
Avoid need to replan Tackle larger problems
[Guestrin, Koller, Gearhart, Kanodia ’03]
“Similar” domains have similar “types” of objects
Exploit similarities by computing generalizable value functions
RelationalMDP
Generalization
![Page 38: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/38.jpg)
Relational Models and MDPs
Classes: Peasant, Gold, Wood, Barracks,
Footman, Enemy…
Relations Collects, Builds, Trains, Attacks…
Instances Peasant1, Peasant2, Footman1,
Enemy1…
![Page 39: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/39.jpg)
Relational MDPs
Class-level transition probabilities depends on: Attributes; Actions; Attributes of
related objects Class-level reward function
P P’
AP
G
Gold
G’Collects
Very compact representation!Does not depend on # of objects
Peasant
![Page 40: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/40.jpg)
Tactical Freecraft: Relational Schema
Enemy
H’ Health
R
Count
Footman
H’ Health
AFootmanmy_enemy
Enemy’s health depends on #footmen attacking Footman’s health depends on Enemy’s health
![Page 41: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/41.jpg)
World is a Large Factored MDP
Instantiation (world): # instances of each class Links between instances
Well-defined factored MDP
RelationalMDP
Linksbetweenobjects
FactoredMDP
# of objects
![Page 42: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/42.jpg)
World with 2 Footmen and 2 Enemies
F1.Health
F1.A
F1.H’
E1.Health E1.H’
F2.Health
F2.A
F2.H’
E2.Health E2.H’
R1
R2
Footman1
Enemy1
Enemy2
Footman2
![Page 43: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/43.jpg)
World is a Large Factored MDP
Instantiate world Well-defined factored MDP Use factored LP for planning
We have gained nothing!
RelationalMDP
Linksbetweenobjects
FactoredMDP
# of objects
![Page 44: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/44.jpg)
Class-level Value Functions
F1.Health E1.Health F2.Health E2.Health
Footman1
Enemy1
Enemy2
Footman2
VF1(F1.H, E1.H) VE1(E1.H) VF2(F2.H, E2.H) VE2(E2.H)
V(F1.H, E1.H, F2.H, E2.H) = + + + Units are Interchangeable!VF1 VF2 VF VE1 VE2
VE
At state x, each footman has different contribution to V
Given VC — can instantiate value function for any world
Footman1
Enemy1
Enemy2
Footman2
VF VF VE VE
![Page 45: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/45.jpg)
Computing Class-level VC
:minimize
:subject to
, ax
),( xaQ)(xV
x
)(xV
C Co
CV )(][
x
C Co
CQ ),(][
ax
C Co
CV )(][
x
ax,,
Constraints for each world represented by factored LP
Number of worlds exponential or infinite
![Page 46: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/46.jpg)
Sampling Worlds
Many worlds are similar Sample set I of worlds
, x, a I , x, aSampling
![Page 47: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/47.jpg)
Theorem
Exponentially (infinitely) many worlds !
need exponentially many samples?NO!
samples
Value function within of class-level solution optimized for all worlds, with prob. at least 1-
Rmax is the maximum class reward Proof method related to [de Farias, Van Roy ‘02]
![Page 48: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/48.jpg)
Learning Classes of Objects
1
23
4
23
3
4
510
10
20
30
40
50 GoodFaultyDead
V1
0
10
20
30
40
50 GoodFaultyDeadV2
0
20
40
60 GoodFaultyDead
V1
0
10
20
30
40
50 GoodFaultyDeadV2
Plan for sampled worlds
separately
Objects with similar values
belong to same class
Find regularitiesbetween worlds
Used decision tree regression in experiments
![Page 49: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/49.jpg)
Summary of Algorithm
1. Model domain as Relational MDPs
2. Sample set of worlds
3. Factored LP computes class-level value
function for sampled worlds
4. Reuse class-level value function in new world
5. Coordination graph computes argmaxa of Q
![Page 50: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/50.jpg)
Experimental Results
SysAdmin problem
![Page 51: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/51.jpg)
Generalizing to New Problems
3
3.2
3.4
3.6
3.8
4
4.2
4.4
4.6
Ring Star Three legs
Est
imat
ed p
oli
cy v
alu
e p
er a
gen
t
Utopic maximum valueObject-based value with complete replanningClass-based value function - no replanning
![Page 52: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/52.jpg)
Learning Classes of Objects
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Ring Star Three legs
Max
-no
rm e
rro
r o
f va
lue
fun
ctio
n
No class learning
Learnt classes
![Page 53: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/53.jpg)
Classes of Objects Discovered
Learned 3 classes
Server
Intermediate
Intermediate
Intermediate
Leaf
LeafLeaf
![Page 54: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/54.jpg)
Strategic
World: 2 Peasants, 2 Footmen,
1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 1 million state/action pairs
Algorithm: Solve with Factored LP Coordination graph for action
selection
![Page 55: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/55.jpg)
Strategic
World: 9 Peasants, 3 Footmen,
1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 3 trillion state/action pairs
Algorithm: Solve with factored LP Coordination graph for action
selection
grows exponentially in #
agents
![Page 56: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/56.jpg)
Strategic
World: 9 Peasants, 3 Footmen,
1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 3 trillion state/action pairs
Algorithm: Use generalized class-based value
function Coordination graph for action selection
instantiated Q-functionsgrow polynomially in #
agents
![Page 57: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/57.jpg)
Tactical
Planned in 3 Footmen versus 3 Enemies
Generalized to 4 Footmen versus 4 Enemies
3 vs. 3 4 vs. 4
Generalize
![Page 58: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/58.jpg)
Contributions Efficient planning with LP decomposition
[Guestrin, Koller, Parr ’01]
Multiagent action selection [Guestrin, Koller, Parr ’02]
Generalization to new environments [Guestrin, Koller, Gearhart, Kanodia ’03]
Variable coordination structure [Guestrin, Venkataraman, Koller ’02]
Multiagent reinforcement learning [Guestrin, Lagoudakis, Parr ’02] [Guestrin, Patrascu, Schuurmans ’02]
Hierarchical decomposition [Guestrin, Gordon ’02]
![Page 59: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/59.jpg)
Open Issues
High tree-width problems
Basis function selection
Variable relational structure
Partial observability
![Page 60: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/60.jpg)
Daphne Koller Committee
Leslie Kaelbling, Yoav Shoham, Claire Tomlin, Ben Van Roy
Co-authors
DAGS members Kristina and Friends My Family
M.S. Apaydin, D. Brutlag, F. Cozman, C. Gearhart, G. Gordon, D. Hsu, N. Kanodia, D. Koller, E. Krotkov, M. Lagoudakis, J.C. Latombe, D. Ormoneit,
R. Parr, R. Patrascu, D. Schuurmans, C. Varma, S. Venkataraman.
![Page 61: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/61.jpg)
In planning problem –
Factored LP
ExploitStructure
In action selection –
Coord. graph
Between problems –
Generalization
Complex multiagent planning task
Conclusions
Formal framework for multiagent planningthat scales to very large problemsvery large
14436596542203275214816766492036822682859734670489954077831385060806196390977769687258235595095458210061891186534272525795367402762022519832080387801477422896484127439040011758861804112894781562309443806156617305408667449050617812548034440554705439703889581746536825491613622083026856377858229022846398307887896918556404084898937609373242171846359938695516765018940588109060426089671438864102814350385648747165832010614366132173102768902855220001
states
1322070819480806636890455259752
![Page 62: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/62.jpg)
Network Management Problem
Ring
Star
Ring of Rings
k-grid
Computer runs processes
Computer status = {good, dead, faulty}
Dead neighbors increase dying probability
Reward for successful processes
Each SysAdmin takes local action = {reboot, not reboot }
![Page 63: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/63.jpg)
Multiagent Policy QualityComparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]
3.4
3.6
3.8
4
4.2
4.4
2 4 6 8 10 12 14 16
Number of agents
Est
imat
ed v
alue
per
age
nt Utopic maximum value
![Page 64: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/64.jpg)
Multiagent Policy QualityComparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]
3.4
3.6
3.8
4
4.2
4.4
2 4 6 8 10 12 14 16
Number of agents
Est
imat
ed v
alue
per
age
nt Utopic maximum value
Distributedreward
Distributedvalue
![Page 65: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/65.jpg)
Multiagent Policy QualityComparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]
3.4
3.6
3.8
4
4.2
4.4
2 4 6 8 10 12 14 16
Number of agents
Est
imat
ed v
alue
per
age
nt Utopic maximum value
LP single basis
LP pair basis
Distributedreward
Distributedvalue
![Page 66: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/66.jpg)
Comparing to Apricodd [Boutilier et al.]
y = 0.1473x3 - 0.8595x2 + 2.5006x - 1.5964R2 = 0.9997
y = 0.0254x2 + 0.0363x + 0.0725
R2 = 0.9983
0
10
20
30
40
50
6 8 10 12 14 16 18 20
Number of variables
Tim
e (
in s
eco
nd
s)
Apricodd
Rule-based
Apricodd: Exploits context-specific independence (CSI)
Factored LP: Exploits CSI and linear independence
y = 5.275x3 - 29.95x2 + 53.915x - 28.83
R2 = 1
0
100
200
300
400
500
6 8 10 12
Number of variables
Tim
e
(in
se
con
ds)
Apricodd
Rule-based
y = 3E-05 * 2 - 0.0026 * 2 + 5.6737R2 = 0.9999
x x2
![Page 67: Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.](https://reader036.fdocuments.in/reader036/viewer/2022062500/56649d4e5503460f94a2d115/html5/thumbnails/67.jpg)
Appricodd
0
10
20
30
40
50
60
0 2 4 6 8 10 12
Number of machines
Ru
nn
ing
tim
e (
min
ute
s)
Rule-based LP
Apricodd
0
5
10
15
20
25
30
0 2 4 6 8 10 12
Number of machines
Dis
cou
nte
d v
alu
e o
f p
olic
y (a
vg.
50
ru
ns
of
10
0 s
tep
s) Rule-based LP
Apricodd
0
5
10
15
20
25
30
35
40
45
50
0 2 4 6 8 10 12
Number of machines
Ru
nn
ing
tim
e (
min
ute
s)
Rule-based LP
Apricodd
Ring
Star
0
5
10
15
20
25
30
0 2 4 6 8 10 12
Number of machines
Dis
cou
nte
d v
alu
e o
f p
olic
y (a
vg.
50
ru
ns
of
10
0 s
tep
s)
Rule-based LP
Apricodd