Achieving Goals in Decentralized POMDPs
description
Transcript of Achieving Goals in Decentralized POMDPs
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science
Achieving Goals in Decentralized
POMDPsChristopher AmatoShlomo Zilberstein
UMass AmherstMay 14, 2009
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 2
Overview The importance of goals DEC-POMDP model Previous work on goals Indefinite-horizon DEC-POMDPs Goal-directed DEC-POMDPs Results and future work
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 3
Achieving goals in multiagent setting
General setting Problem proceeds over a sequence of steps
until a goal is achieved Multiagent setting
Can terminate when any number of agents achieve local goals or when all agents achieve a global goal
Many problems have this structure Meeting or catching a target Cooperatively completing a task
How do we make use of this structure?
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 4
DEC-POMDPs Decentralized partially observable Markov
decision process (DEC-POMDP) Multiagent sequential decision making under
uncertainty At each stage, each agent receives:
A local observation rather than the actual state A joint immediate reward
Environment
a1
o1a2
o2
r
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 5
DEC-POMDP definition
A two agent DEC-POMDP can be defined with the tuple: M = S, A1, A2, P, R, 1, 2, O S, a finite set of states with designated initial
state distribution b0
A1 and A2, each agent’s finite set of actions P, the state transition model: P(s’ | s, a1, a2) R, the reward model: R(s, a1, a2) 1 and 2, each agent’s finite set of observations O, the observation model: O(o1, o2 | s', a1, a2)
This model can be extended to any number of agents
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 6
DEC-POMDP solutions
A policy for each agent is a mapping from their observation sequences to actions, *
A , allowing distributed execution Note that planning can be centralized but
execution is distributed A joint policy is a policy for each agent Finite-horizon case: goal is to maximize
expected reward over infinite steps Infinite-horizon case: discount the reward
to keep sum finite using factor,
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 7
Achieving goals If problem terminates after goal is
achieved, how do we model it? Unclear how many steps are
needed until termination Want to avoid a discount factor:
value is often arbitrary and can change the solution
[ what else to say here? ]
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 8
Previous work Some in POMDPs, but for DEC-
POMDPs only Goldman and Zilberstein 04
Modeled problems with goals as finite horizon and studied the complexity
Same complexity unless agents have independent transitions and observations and one goal is always better
This assumes negative rewards for non-goal states and no-op available at goal
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 9
Indefinite-horizon DEC-POMDPs
Extend POMDP assumptions Patek 01 and Hansen 07
Our assumptions Each agents possesses a set of terminal actions Negative rewards for non-terminal actions
Problem stops when a terminal action is taken by each agent simultaneously
Can capture uncertainty about reaching goal Many problems can be modeled this way
Example: Capturing a Target All (or a subset) of agents must simultaneously attack Or agents are targets are must meet at same location Agents are unsure when goal is reached, but must
choose when to terminate problem
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 10
Optimal solution Lemma 3.1. An optimal set of indefinite-horizon policy trees must have horizon less than where is the value of the best combination of terminal actions, the value of best combination of non-terminal actions and is the maximum value attained by choosing a set of terminal actions on the first step given the initial state distribution.
Theorem 3.2. Our dynamic programming algorithm for indefinite-horizon POMDPs returns an optimal set of policy trees for the given initial state distribution.
kmax =(Rnow −RmaxT ) / Rmax
NT RmaxT
RmaxNT
Rnow
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 11
Goal-directed DEC-POMDPs
Relax assumptions, but still have goal Problem terminates when:
The set of agents reach a global goal state A single agent or set of agents reach local goal states Any chosen combination of actions and observations is taken
or seen by the set of agents Can no longer guarantee termination, so becomes
subclass of infinite-horizon More problems fall into this class (can terminate without
agent knowledge) Example: Completing a set of experiments Robots must travel to different sites and perform different
experiments at each Some require cooperation (simultaneous action) while some
can be completed independently Problem ends when all necessary experiments are completed
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 12
Sample-based approach
Use sampling to generate agent trajectories From the known initial state until goal conditions
are met Produces only action and observation sequences
that lead to goal This reduces the number of policies to consider
We prove a bound on the number of samples required to approach optimality (extended from Kearns, Mansour and Ng 99)
Showed
Probability that the value attained is at least from optimal is at most with samples
P V π * (s0 )−V %π () ≥⎡⎣ ⎤⎦≤
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 13
Getting more from fewer samples
Optimize a finite-state controller Use trajectories to create a controller Ensures a valid DEC-POMDP policy Allows solution to be more compact Choose actions and adjust resulting
transitions (permitting possibilities that were not sampled)
Optimize in the context of the other agents
Trajectories create an initial controller which is then optimized to produce a high-valued policy
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 14
Generating controllers from trajectories
Trajectories: a1-o1g a1-o3a1-o1g a1-o3 a1-o3a1-o1g a4-o4 a1-o2a3-o1g a4-o3 a1-o1g
a4
g
g
g
g
g
a3
a1
a1
a1a1
a1
o4
o1
o1
o1o1
o1o2
o3
o3o3
a4
g
g
g
g
a3
a1
a1
a1a1
o4
o1
o1o1
o1o2
o3
o3 o3g
g ga1 a1
a1o1
o1o1
o3 o3
a1
o1
o2-4
g
Initial controller:
Optimized controllers:Reduced controller:a)
b)
0
5
43
21
210
43
0 21
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 15
Experiments Compared our goal-directed
approach with leading approximate infinite-horizon algorithms BFS: Szer and Charpillet 05 DEC-BPI: Bernstein, Hansen and Zilberstein 05 NLP: Amato, Bernstein and Zilberstein 07
Each approach was run with larger controllers until resources were exhausted (2GB or 4 hours)
BFS provides an optimal deterministic controller for a given size
Other algs were run 10 times and mean times and values are reported
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 16
Experimental results
We built controllers from a small number of the highest valued trajectories
Our sample-based approach outperforms other methods on these problems
# samples=1000000, 10
# samples=5000000, 25
# samples=1000000, 10
#=500000, 5
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 17
Conclusions Make use of goal structure, when present to
improve efficiency and solution quality Indefinite-horizon approach
Created model for DEC-POMDPs Developed algorithm and proved optimality
Goal-directed problems Described more general goal model Developed sample-based algorithm and
demonstrated high quality results Proved a bound on the number of samples
needed to approach optimality Future: can extend this work to general finite
and infinite-horizon problems
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 18
Thank you