Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

33
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu , Prashant Doshi Dept. of Computer Science University of Georgia AAMAS 2012

description

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up. AAMAS 2012. Ekhlas Sonu , Prashant Doshi Dept. of Computer Science University of Georgia. Overview. - PowerPoint PPT Presentation

Transcript of Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Page 1: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs:

Scaling Up

Ekhlas Sonu, Prashant DoshiDept. of Computer Science

University of Georgia

AAMAS 2012

Page 2: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

OverviewWe generalize Bounded Policy Iteration for POMDP to the multiagent decision making framework of Interactive POMDPWe discuss the challenges associated with this generalizationSubstantial scalability achieved using the generalized approach

Page 3: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Introduction: Interactive POMDPInteractive POMDP (Gmytrasiewicz&Doshi,05):

Generalization of POMDP to multiagent settingsApplications

Money Laundering (Ng et al.,10) Lemonade stand game (Wunder et al.,11) Modeling human behavior (Doshi et al.,10), and more…

Differs from Dec-POMDPDec-POMDP: Team of agentsI-POMDP: Individual agent in presence of other agents – cooperative, competitive or neutral settings

Page 4: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Introduction: I-POMDP(Finitely-nested and 2 agents)

I-POMDPi,l =<ISi,l, A, Wi, Ti, Oi, Ri, γ>

Physical States(S)

i

ai/Ti(s, ai, aj, s’)

oi/Oi(s’, ai, aj, oi) , Ri (s, ai, aj)

j

oj/Oj(s’, ai, aj, oj) , Rj (s, ai, aj)

aj/Tj(s, ai, aj, s’)

ISi,l = S X Qj,l-1 S: Set of physical statesQj,l-1 : Set of intentional

models of j at level l-1

A = Ai X Aj

Wi: set of observations of iTi: S X Ai X Aj DSOi: S X Ai X Aj DWi

Ri: S X Ai X Aj R

Interactive state

Page 5: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

I-POMDP Belief Update and Value Function

Belief Update:An agent must predict the other agent’s actions by anticipating its updated beliefs over time. Therefore belief update consists of

Updating distribution over physical states: Transition Function, Observation Function of agent i

Updating distribution over dynamic models: Belief update of other agents and its observation function

Value Function:Must incorporate the I-POMDP belief update in computing long term rewards

Page 6: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Solving I-POMDP (Related Work)Previous work: Value iteration algorithms

Interactive particle filtering (I-PF) (Doshi&Gmytrasiewicz,09) nested particle filter: sampled recursive representation of agent’ nested belief

Interactive point-based value iteration (I-PBVI) (Doshi&Perez,08)point based domination check

Iteratively apply Backup Operator:Expensive operatorScale only to toy problems

Over multiple time steps:Curse of historyCurse of dimensionality

Phy. St. (S)b = D(ISi,l)

Page 7: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

BackgroundPolicy Iteration

Class of solution algorithms – search policy spaceExponential growth in solution size

Bounded Policy Iteration (Poupart&Boutilier,03)

Fixed solution size (controlled growth)Applied in POMDP & Dec-POMDP

Dec-BPI (Bernstein,Hansen&Zilberstein,05) -- optional correlation device may not be feasible in non-cooperative settings

Contribution: We present the first policy iteration algorithm (approximate) for I-POMDPs : generalization of BPIShow scalability to larger problems

Page 8: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Policy RepresentationPossible representation of policy

Node actionEdge obs

Finite State Controllers(Hansen, 1998)Tree Representation

Node has an infinite horizon policy rooted at itNode has a value vector associated with it which is a linear vector over the entire belief spaceBeliefs are mapped to a node (n) that optimizes the expected reward from that belief:

i.e. argmaxn b ∙ Vn

Page 9: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Finite State ControllerA finite state controller may be defined as:

where:is the set of nodes in the FSC of agent iis the set of edge labels (Wi)

Let:partitions the entire belief space

Page 10: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Policy IterationStarting with an initial controller, iterate over two steps until convergence:

Policy Evaluation:Evaluate Vn for each nodeSolve system of linear equations

Policy Improvement:Construct a better controllerPossibly by adding new nodes

Page 11: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Policy Improvement (Hansen,98)

Apply Backup operator, i.e. construct new nodes with all possible values of action and transition on observation

|A||N||W| new nodesAdd them to the controller

Prune all dominated nodesDrawback: Leads to exponential growth in controller size

V

0 1P(s)Example of policy iteration for a POMDP

Page 12: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Bounded Policy Iteration (BPI) (Poupart&Boutilier,03)

Instead of performing a complete back up, replace a node with a better node

Linear program for partial backupNew node is a convex combination of two backed up nodesChanges in controller: e

:stochastic action policy:stochastic observation policy

Page 13: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Local OptimaThis form of policy improvement is prone to converging to local optimaWhen all nodes are tangents to backed up nodes: e = 0, no improvementEscape technique suggested by Poupart & Boutilier (2003) in BPI

V

0 1P(s)

Page 14: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

I-POMDP Generalization: Nested ControllersNested Controllers: Analogous to nested beliefs

Embed recursive reasoning

Starting from level 0 upwards, for each level l, construct a Finite state controller for each frame of each agent ( )

For convenience of representation, let’s assume two agents and each one frame for an agent at each level

Agent i’s level 2 controller:

Agent j’s level 1 controller:

Agent i’s level 0 controller:

Page 15: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Interactive BPI: Policy EvaluationCompute the value vector of each node using the estimate of other agent’s model by solving a system of linear equations:For each ni,l, and interactive state, is=(s, nj,l-1), solve:

Page 16: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

I-BPI: Policy Improvement

New vector dominates old vector by e and hence replaces it

e

V

0 1P(s)

Pick a node (ni,l) and perform a partial backup using LP to construct another node (n’i,l) that pointwise dominates ni,l by some e > 0

Page 17: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

I-BPI: Policy ImprovementPick a node (ni,l) and perform a partial backup using LP to construct another node that pointwise dominates ni,l by some e > 0

Objective Function: maximize eVariables: Constraints:

Page 18: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Escaping Local Optima

V

0 1

P(s)

bT bR1

bR2

Analogous to escaping for POMDPs

Page 19: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Algorithm: I-BPI1. Starting from Level 0 up to Level l, construct a 1 node controller

for each level with a random action and transition to itself. 2. Reformulate interactive state space and evaluate

L0

L1

Ll

Time

.

.

.

Page 20: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Algorithm: I-BPI3. Starting from Level 0 up to Level l, perform 1 step of back up

operator. Max |Ai(j)| nodes

L0

L1

Ll

Time

.

.

.

Page 21: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Algorithm: I-BPI4. Starting from Level 0 up to Level l, reformulate IS space, perform

policy evaluation followed by policy improvement at each level

L0

L1

Ll

Time

.

.

.

Page 22: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Algorithm: I-BPI5. Repeat step 4 until convergence6. If converged, push nested controller out of local optima by

adding new nodes

L0

L1

Ll

Time

.

.

.

Page 23: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Evaluation

Runtime for algorithm and the average rewards from simulations * Represents expected rewards obtained from vectors

AUAV: 81 states, 5 actions, 4 observationsMoney Laundering: 99 States, 11 actions, 9 ObservationsScales to larger problems...

Page 24: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Evaluation

Simulations results for multiagent tiger problem showing results obtained by simulating performance of agent controllers of various sizes for Levels 1 – 4

Page 25: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

DiscussionAdvantages of I-BPI

Is significantly quicker and scales to large problems (100s of states, tens of actions and observations)Mitigates curse of history and curse of dimensionalityImproved solution quality

LimitationsProne to local optima

Escape technique may not work for certain local optima

Not entirely free from curses of history and dimensionality

Future WorkScale to even larger problems and more agentsMealy machine implementation for controllers (Amato et al. 2011)

Page 26: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Thank you…

Poster #731 today at 16:00-17:00 (Panel 98)

Acknowledgement:This research is partially supported by an NSF CAREER grant, #IIS-0845036

Page 27: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Policy ImprovementApply Backup operator, i.e. construct new nodes with all possible values of action and transition to nodes in current controller

|A||N||Z| new nodesAdd them to the controller

|A|

Z1 Z2Z|Z|

|N||N|

|N|

Page 28: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Introduction: POMDPPOMDP: Framework for optimal sequential decision making under uncertainty in single agent settings

<S, A, W, T, O, R, g >

Physical States(S)

a/T(s, a, s’)

z/O(s’, a, z) , R(s, a)

b = D(S)

S: set of statesA: set of actionsZ: set of observations

T: S X A DSO: S X A DZR: S X A R

Objective is to find a policy p that maximizes long term expected rewards:ER = Immediate Reward + discounted future reward

•Agent maintains a belief (b) over physical states•Policy p : b A

g: discount factorh: Horizon

Page 29: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Future WorkExtend approach to problems with even larger dimensions

Extend to problems with more than two agents

Mealy machine implementation of finite state controllers (Amato, et.al; 2011)

Page 30: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

I-POMDP Belief Update and Value FunctionBelief Update:

An agent must predict the other agent’s actions by anticipating its updated beliefs over time. Therefore belief update consists of

Updating distribution over physical states: Transition Function, Observation Function of agent i

Updating distribution over dynamic models: Belief update of other agents and its observation function

Value Function:

Page 31: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

Solving I-POMDP (Related Work)Previous work: Value iteration algorithms

I-PF (Doshi, Gmytrasiewicz; 2009): particle filter: sampled recursive representation of agent’ nested belief

I-PBVI (Doshi, Perez; 2008):point based domination check

Iteratively apply Backup Operator:Expensive operator

Over multiple time steps:Curse of historyCurse of dimensionality

Phy. St. (S)

s, a/T(s, a, s’)

s’/O(s’, a, z), R(s, a)

b = D(ISi,l)

Page 32: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

I-POMDP Generalization: Nested Controllers

Embed recursive reasoningStarting from level 0 upwards, for each level l, construct a Finite state controller for each frame of each agent ( )

For convenience of representation, let’s assume two agents and each one frame for an agent at each level

L 0:

L 1:

.

.

.

.

.

.

.

.

.

.

.

.

L l:

Page 33: Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up

I-POMDP Generalization: Nested Controllers

Embed recursive reasoningStarting from level 0 upwards, for each level l, construct a Finite state controller for each frame of each agent ( )

For convenience of representation, let’s assume two agents and each one frame for an agent at each level