Achieving Goals in Decentralized POMDPs

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Achieving Goals in Decentralized

POMDPsChristopher AmatoShlomo Zilberstein

UMass AmherstMay 14, 2009

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science 2

Overview The importance of goals DEC-POMDP model Previous work on goals Indefinite-horizon DEC-POMDPs Goal-directed DEC-POMDPs Results and future work


Achieving goals in multiagent setting

General setting Problem proceeds over a sequence of steps

until a goal is achieved Multiagent setting

Can terminate when any number of agents achieve local goals or when all agents achieve a global goal

Many problems have this structure Meeting or catching a target Cooperatively completing a task

How do we make use of this structure?


DEC-POMDPs Decentralized partially observable Markov

decision process (DEC-POMDP) Multiagent sequential decision making under

uncertainty At each stage, each agent receives:

A local observation rather than the actual state A joint immediate reward

Environment

a1

o1a2

o2

r


DEC-POMDP definition

A two agent DEC-POMDP can be defined with the tuple: M = S, A1, A2, P, R, 1, 2, O S, a finite set of states with designated initial

state distribution b0

A1 and A2, each agent’s finite set of actions P, the state transition model: P(s’ | s, a1, a2) R, the reward model: R(s, a1, a2) 1 and 2, each agent’s finite set of observations O, the observation model: O(o1, o2 | s', a1, a2)

This model can be extended to any number of agents


DEC-POMDP solutions

A policy for each agent is a mapping from their observation sequences to actions, *

A , allowing distributed execution Note that planning can be centralized but

execution is distributed A joint policy is a policy for each agent Finite-horizon case: goal is to maximize

expected reward over infinite steps Infinite-horizon case: discount the reward

to keep sum finite using factor,


Achieving goals If problem terminates after goal is

achieved, how do we model it? Unclear how many steps are

needed until termination Want to avoid a discount factor:

value is often arbitrary and can change the solution

[ what else to say here? ]


Previous work Some in POMDPs, but for DEC-

POMDPs only Goldman and Zilberstein 04

Modeled problems with goals as finite horizon and studied the complexity

Same complexity unless agents have independent transitions and observations and one goal is always better

This assumes negative rewards for non-goal states and no-op available at goal


Indefinite-horizon DEC-POMDPs

Extend POMDP assumptions Patek 01 and Hansen 07

Our assumptions Each agents possesses a set of terminal actions Negative rewards for non-terminal actions

Problem stops when a terminal action is taken by each agent simultaneously

Can capture uncertainty about reaching goal Many problems can be modeled this way

Example: Capturing a Target All (or a subset) of agents must simultaneously attack Or agents are targets are must meet at same location Agents are unsure when goal is reached, but must

choose when to terminate problem


Optimal solution Lemma 3.1. An optimal set of indefinite-horizon policy trees must have horizon less than where is the value of the best combination of terminal actions, the value of best combination of non-terminal actions and is the maximum value attained by choosing a set of terminal actions on the first step given the initial state distribution.

Theorem 3.2. Our dynamic programming algorithm for indefinite-horizon POMDPs returns an optimal set of policy trees for the given initial state distribution.

kmax =(Rnow −RmaxT ) / Rmax

NT RmaxT

RmaxNT

Rnow


Goal-directed DEC-POMDPs

Relax assumptions, but still have goal Problem terminates when:

The set of agents reach a global goal state A single agent or set of agents reach local goal states Any chosen combination of actions and observations is taken

or seen by the set of agents Can no longer guarantee termination, so becomes

subclass of infinite-horizon More problems fall into this class (can terminate without

agent knowledge) Example: Completing a set of experiments Robots must travel to different sites and perform different

experiments at each Some require cooperation (simultaneous action) while some

can be completed independently Problem ends when all necessary experiments are completed


Sample-based approach

Use sampling to generate agent trajectories From the known initial state until goal conditions

are met Produces only action and observation sequences

that lead to goal This reduces the number of policies to consider

We prove a bound on the number of samples required to approach optimality (extended from Kearns, Mansour and Ng 99)

Showed

Probability that the value attained is at least from optimal is at most with samples

P V π * (s0 )−V %π () ≥⎡⎣ ⎤⎦≤


Getting more from fewer samples

Optimize a finite-state controller Use trajectories to create a controller Ensures a valid DEC-POMDP policy Allows solution to be more compact Choose actions and adjust resulting

transitions (permitting possibilities that were not sampled)

Optimize in the context of the other agents

Trajectories create an initial controller which is then optimized to produce a high-valued policy


Generating controllers from trajectories

Trajectories: a1-o1g a1-o3a1-o1g a1-o3 a1-o3a1-o1g a4-o4 a1-o2a3-o1g a4-o3 a1-o1g

a4

g

g

g

g

g

a3

a1

a1

a1a1

a1

o4

o1

o1

o1o1

o1o2

o3

o3o3

a4

g

g

g

g

a3

a1

a1

a1a1

o4

o1

o1o1

o1o2

o3

o3 o3g

g ga1 a1

a1o1

o1o1

o3 o3

a1

o1

o2-4

g

Initial controller:

Optimized controllers:Reduced controller:a)

b)

0

5

43

21

210

43

0 21


Experiments Compared our goal-directed

approach with leading approximate infinite-horizon algorithms BFS: Szer and Charpillet 05 DEC-BPI: Bernstein, Hansen and Zilberstein 05 NLP: Amato, Bernstein and Zilberstein 07

Each approach was run with larger controllers until resources were exhausted (2GB or 4 hours)

BFS provides an optimal deterministic controller for a given size

Other algs were run 10 times and mean times and values are reported


Experimental results

We built controllers from a small number of the highest valued trajectories

Our sample-based approach outperforms other methods on these problems

# samples=1000000, 10

# samples=5000000, 25

# samples=1000000, 10

#=500000, 5


Conclusions Make use of goal structure, when present to

improve efficiency and solution quality Indefinite-horizon approach

Created model for DEC-POMDPs Developed algorithm and proved optimality

Goal-directed problems Described more general goal model Developed sample-based algorithm and

demonstrated high quality results Proved a bound on the number of samples

needed to approach optimality Future: can extend this work to general finite

and infinite-horizon problems


Thank you

Achieving Goals in Decentralized POMDPs

Documents

Transcript of Achieving Goals in Decentralized POMDPs