Logistics

1(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Logistics

Reading for MonNo class Wed 11/26


Activeresearch area

No uncertainty

Achieve goals

Heuristic search

Uncertainty

Maximize utility

Dynamic programming

Classical AI planning Operations Research

Knowledge-based representation

Markov decision process


Review

MDPsBayesian NetworksDBNsFactored MDPsBDDs & ADDs


Markov Decision ProcessesS = set of states set (|S| = n)

A = set of actions (|A| = m)

Pr = transition function Pr(s,a,s’)represented by set of m n x n stochastic matriceseach defines a distribution over SxS

R(s) = bounded, real-valued reward functionrepresented by an n-vector


Planning

Plan?• Objective?

Policy?• Objective?


Dynamic programming (DP)

Value iteration [Bellman, 1957]

Sj

ijiAa

jfaPaCif maxDP improves

value function

Initial value function

-optimal value function

Policy iteration [Howard, 1960]

Sj

ijiAa

jfaPaCi maxarg

Evaluatepolicy

DP improvespolicy

Initial policy

-optimal policy


Bellman’s Curse of Dimensionality


Bayes NetsCompact Rep’n Joint Prob, Distribution

Earthquake Burglary

Alarm

Nbr2CallsNbr1Calls

Pr(B=t) Pr(B=f) 0.05 0.95

Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)

Radio


DBN Representation: DelC

Tt

Lt

CRt

RHCt

Tt+1

Lt+1

CRt+1

RHCt+1

fCR(Lt,CRt,RHCt,CRt+1)

fT(Tt,Tt+1)

L CR RHC CR(t+1) CR(t+1)

O T T 0.2 0.8

E T T 1.0 0.0

O F T 0.0 1.0

E F T 0.0 1.0

O T F 1.0 0.1

E T F 1.0 0.0

O F F 0.0 1.0

E F F 0.0 1.0

T T(t+1) T(t+1)

T 0.91 0.09

F 0.0 1.0

RHMt RHMt+1

Mt Mt+1

fRHM(RHMt,RHMt+1)RHM R(t+1) R(t+1)

T 1.0 0.0

F 0.0 1.0


Benefits of DBN Representation

- Only 48 parameters vs. 25440 for matrix

s1 s2 ... s160

s1 0.9 0.05 ... 0.0s2 0.0 0.20 ... 0.1

s160 0.1 0.0 ... 0.0

...Tt

Lt

CRt

RHCt

Tt+1

Lt+1

CRt+1

RHCt+1

RHMt RHMt+1

Mt Mt+1


Example (x3 and x2) or not x1

x2

x1

0 1

00

01

1

1

x3

OBDD

10

1 1 1 1 1

Binary decision tree

x3

x2

x1


Action Representation – DBN/ADD

CR

0.0 1.0 0.8

RHC

L

CR(t+1)CR(t+1)CR(t+1)

0.2

Algebraic Decision Diagram (ADD)Tt

Lt

CRt

RHCt

Tt+1

Lt+1

CRt+1

RHCt+1

RHMt RHMt+1

Mt Mt+1

f

t

t

o

t

e

f

ffft

t

fCR(Lt,CRt,RHCt,CRt+1)


Today – Solving the curse

AbstractionApproximationReachability


Structured Computation

Given compact representation, can we solve

MDP without explicit state space enumeration?

Can we avoid O(|S|)-computations by exploiting

regularities made explicit by propositional or first-

order representations?

Two general schemes:

• abstraction/aggregation

• decomposition


State Space Abstraction

General method: state aggregation

• group states, treat aggregate as single state

• commonly used in OR [SchPutKin85, BertCast89]

• viewed as automata minimization [DeanGivan96]

Abstraction is a specific aggregation technique

• aggregate by ignoring details (features)

• ideally, focus on relevant features


Dimensions of Abstraction

A B C

A B C

A B C

A B C

A B C

A B C

A B C

A B C

A

A B C

A B

A B C

A

B

C=

5.3

5.3

5.3

5.3

2.9

2.9 9.3

9.3

5.3

5.2

5.5

5.3

2.9

2.79.3

9.0

Uniform

Nonuniform

Exact

Approximate

Adaptive

Fixed


A Fixed, Uniform Approximate Abstraction Method

Uniformly delete features from domain [BD94/AIJ97]

Ignore features based on degree of relevance• rep’n used to determine importance to sol’n quality

Allows tradeoff between abstract MDP size and solution quality

A B C

A B C

A B C

A B C

A B C

A B C

0.8

0.2

0.5

0.5


Immediately Relevant Variables

Rewards determined by particular variables• impact on reward clear from STRIPS/ADD rep’n of R

• e.g., difference between CR/-CR states is 10, while difference between T/-T states is 3, MW/-MW is 5

Approximate MDP: focus on “important” goals• e.g., we might only plan for CR

• we call CR an immediately relevant variable (IR)

• generally, IR-set is a subset of reward variables


Relevant Variables

We want to control the IR variables• must know which actions influence these and under

what conditions

A variable is relevant if it is the parent in the DBN for some action a of some relevant variable

• ground (fixed pt) definition by making IR vars relevant

• analogous def’n for PSTRIPS

• e.g., CR (directly/indirectly) influenced by L, RHC, CR

Simple “backchaining” algorithm to contruct set• linear in domain descr. size, number of relevant vars


Constructing an Abstract MDP

Simply delete all irrelevant atoms from domain• state space S’: set of assts to relevant vars

• transitions: let Pr(s’,a,t’) = t t’ Pr(s,a,t’) for any ss’

construction ensures identical for all ss’

• reward: R(s’) = max {R(s): ss’} - min {R(s): ss’} / 2 midpoint gives tight error bounds

Construction of DBN/PSTRIPS with these properties involves little more than simplifying action descriptions by deletion


Example

Abstract MDP• only 3 variables

• 20 states instead of 160

• some actions become identical, so action space is simplified

• reward distinguishes only CR and –CR (but “averages” penalties for MW and –T)

Lt

CRt

RHCt

Lt+1

CRt+1

RHCt+1

DelC action

Aspect Condt’n Rew

Coffee CR -14

-CR -4

Reward


Solving Abstract MDPAbstract MDP can be solved using std methodsError bounds on policy quality derivable

• Let be max reward span over abstract states

• Let V’ be optimal VF for M’, V* for original M

• Let ’ be optimal policy for M’ and * for original M

s'any sfor sVsV

)(

|)'(')(| *

12

s'any sfor sVsV

)(

|)'()(| '*

1


FUA Abstraction: Relative Merits

FUA easily computed (fixed polynomial cost)

FUA prioritizes objectives nicely• a priori error bounds computable (anytime tradeoffs)

• can refine online (heuristic search) [DeaBou97]

FUA is inflexible• can’t capture conditional relevance

• approximate (may want exact solution)

• can’t be adjusted during computation

• may ignore the only achievable objectives



A B C

A B C

A B C

A B C

A B C

A B C

A B C

A B C

A

A B C

A B

A B C

A

B

C=

5.3

5.3

5.3

5.3

2.9

2.9 9.3

9.3

5.3

5.2

5.5

5.3

2.9

2.79.3

9.0

Uniform

Nonuniform

Exact

Approximate

Adaptive

Fixed


Constructing Abstract MDPs

Many ways to abstract an MDP• methods will exploit the logical representation

Abstraction can be viewed as a form of automaton minimization

• general minimization schemes require state space enumeration

• Instead, exploit the logical structure of the domain (state, actions, rewards) to construct logical descriptions of abstract states, avoiding state enumeration


Decision-Theoretic Regression

Abstraction based on analog of regression• as abstraction: dynamic, nonuniform, exact/approx.

• exploits logical representation of MDP

Overview• value iteration as variable elimination

• propositional decision-theoretic regression

• approximate decision-theoretic regression

• first-order decision-theoretic regression


Classical Regression

Goal regression a classical abstraction method• Regression of a logical condition/formula G through

action a is a weakest logical formula C = Regr(G,a) such that: G is guaranteed to be true after doing a if C is true before doing a

• Weakest precondition for G wrt a

G

G

C

Cdo(a)


Example: Regression in SitCalc

For the situation calculus• Regr(G(do(a,s))): logical condition C(s) under which a

leads to G (aggregates C states and ~C states)

Regression in sitcalc straightforward

• Regr(F(x, do(a,s))) F(x,a,s)• Regr(1) Regr(1)• Regr(12) Regr(1) Regr(2)• Regr(x.1) x.Regr(1)



In MDPs, we don’t have goals, but regions of distinct value

Decision-theoretic analog: given “logical description” of Vt+1, produce such a description of Vt or optimal policy (e.g., using ADDs)

Cluster together states at any point in calculation

with same best action (policy), or with same

value (VF)



Decision-theoretic complications:• multiple formulae G describe fixed value partitions

• a can leads to multiple partitions (stochastically)

VtQt+1(a)

G2

G3G1

C1

p1

p2

p3


Functional View of DTR

Generally, Vt+1 depends on only a subset of variables @ t (usually in a structured way)

What is value of action a at time t (at any s)?

CR

M

-10 0

Vt+1

Tt

Lt

CRt

RHCt

Tt+1

Lt+1

CRt+1

RHCt+1

RHMt RHMt+1

Mt Mt+1

fRm(Rmt,Rmt+1)

fM(Mt,Mt+1)

fT(Tt,Tt+1)

fL(Lt,Lt+1)

fCr(Lt,Crt,Rct,Crt+1)

fRc(Rct,Rct+1)


Bellman Backup (Regression)

JC

10 012

CP

CC

JP BC JP

9

CR

0.0 1.0 0.8

RHC

L

CR(t+1)CR(t+1)CR(t+1)

0.2

f

t

t

o

t

e

f

ffft

t

CR

M

-10 0


A Simple Action/Reward Example

X

Y

Z

X

Y

Z

X

Y0.9

0.0

W

1.0 0.0

1.0

Y

Z0.9

0.01.0

Z

10 0

Network Rep’n for Action A Reward Function R


Example: Generation of V1

Z

010

Y

Z8.1

0.09.0

V0 = R

Y

ZZ: 0.9

Z: 0.0Z: 1.0

P(Z|a,s)

Y

Z9.0

0.010.0

P(Z|a,s)V0 P(Z|a,s)V0

Y

Z8.1

0.09.0

Maxa …

Y

Z8.1

0.019.0

R(s) +Maxa … = V1

Y

Z8.1

0.09.0

Z

010

+ =


Example: Generation of V2

Y

Z8.1

0.09.0

V1

X

YY: 0.9

Y: 0.0Y: 1.0 Y

ZY: 0.9

Z: 0.9

Y: 0.9

Z: 0.0

Y:0.9

Z: 1.0

Y

ZY: 1.0

Y: 0.0

Z: 0.0

Y:0.0

Z: 1.0

X

P(Y|a, s) P(Z|a,s)


Some Results: Natural Examples


Some Results: Worst-case


Some Results: Best-case


DTR: Relative Merits

Adaptive, nonuniform, exact abstraction method• provides exact solution to MDP

• much more efficient on certain problems (time/space)

• 400 million state problems (ADDs) in a couple hrs

Some drawbacks• produces piecewise constant VF

• some problems admit no compact solution representation (though ADD overhead “minimal”)

• approximation may be desirable or necessary


Criticisms of SPUDD


Future Work



A B C

A B C

A B C

A B C

A B C

A B C

A B C

A B C

A

A B C

A B

A B C

A

B

C=

5.3

5.3

5.3

5.3

2.9

2.9 9.3

9.3

5.3

5.2

5.5

5.3

2.9

2.79.3

9.0

Uniform

Nonuniform

Exact

Approximate

Adaptive

Fixed


Approximate DTR

Easy to approximate solution using DTR

Simple pruning of value function

• Can prune trees [BouDearden96] or ADDs [StaubinHoeyBou00]

Gives regions of approximately same value


A Pruned Value ADD

8.368.45

7.45

U

R

W

6.817.64

6.64

U

R

W

5.626.19

5.19

U

R

WLoc

HCR

HCU

9.00

W

10.00

[7.45, 8.45]

Loc

HCR

HCU

[9.00, 10.00]

[6.64, 7.64]

[5.19, 6.19]


Approximate Structured VIRun normal SVI using ADDs/DTs

• at each leaf, record range of values

At each stage, prune interior nodes whose leaves all have values with some threshold

• tolerance can be chosen to minimize error or size• tolerance can be adjusted to magnitude of VF

Convergence requires some careIf max span over leaves < and term. tol. < :

1

22 )(* VV


Approximate DTR: Relative Merits

Relative merits of ADTR• fewer regions implies faster computation• can provide leverage for optimal computation• 30-40 billion state problems in a couple hours• allows fine-grained control of time vs. solution quality

with dynamic (a posteriori) error bounds• technical challenges: variable ordering, convergence,

fixed vs. adaptive tolerance, etc.

Some drawbacks• (still) produces piecewise constant VF• doesn’t exploit additive structure of VF at all


Reachability


DP vs. heuristic search

Given a start state, heuristic search can find an optimal solution without evaluating all states.

Startstate

Solution graph:all states reachableby optimal solution

Explicit graph:states evaluatedduring search

Implicit graph:all states

Each iteration, DP improves solution for each stateDP solves problem for all possible starting states.


Solution structures

Solution path Acyclicsolution graph

Cyclicsolutiongraph


Solution structure

Simple

path

Acyclic graph

Cyclic graph

Dynamic programming

Forward DP (Dijkstra’s alg.)

Backwards induction

Policy (Value) iteration

Heuristic search

A*

AO*

LAO*

Heuristic search = dynamic programming + starting state + forward expansion of solution + admissible heuristic

DP vs. heuristic search


AO*[Nilsson 1971; Martelli & Montanari 1973]

Initialize partial solution graph to start state

Repeat until a complete solution is found:• Expand some nonterminal

state on the fringe of the best partial solution graph

• Use backwards induction to update the costs of all ancestor states of the expanded state and possibly change their selected action.

Same except allow solution to contain loops, and use value iteration (or policy iteration) instead of backwards induction

LAO*[Hansen & Zilberstein AAAI-98, AIJ]


Heuristic evaluation function

h(i) is an heuristic estimate of the minimal-cost solution for every non-terminal tip state.

h(i) is admissible if h(i) < f*(i). An admissible heuristic estimate f(i) for any state in

the explicit graph is defined as follows:

Sj

ijiiAa

jfjaipac

iih

i

if

)(),,()(min else

state tipterminal-non a is if )(

state goal a is if 0

)(

)(

Underestimate cost (overestimate reward)


Example: Path-finding

1 Start

4 Goal2 3

5

678

Actions: Move one cell to the North, East, South, or WestEach action succeeds with probability 0.5 if there is a cell in intended direction of movement, and otherwise failsEach action has cost of one

Heuristic?


Start of search

1 Start

4 Goal2 3

5

678

1 Start

3.0


After first node expansion

1 Start

4 Goal2 3

5

678

1 Start

4.0

2

2.0

8

4.0

N

ES


After second node expansion

1 Start

4 Goal2 3

5

678

1 Start

5.0

2

3.0

8

4.0

N

ES

3

1.0E

N

S


After third node expansion

1 Start

4 Goal2 3

5

678

1 Start

6.0

2

4.0

8

4.0

N

ES

3

2.0E

N

S

4

0.0E

N

S


Theoretical Properties

Theorem 1: Using an admissible heuristic, LAO* converges to an optimal solution without (necessarily) expanding/evaluating all states.

Theorem 2: If h2(i) is a more informative heuristic than h1(i) (i.e., h1(i) h2(i) f*(i)), LAO* using h2(i) expands a subset of the worst case set of states expanded using h1(i).


Today’s paper

LAO* search over what?


Results

States reach

able by opt

States explored durin

g search


Criticisms?


Future work?


Off-line vs. on-line search

Deterministicstate transitions

Stochasticstate transitions

(MDPs)

Off-line A* LAO*

On-line (real-time) LRTA* (Korf) RTDP (Barto et al.)

Logistics

Documents

Transcript of Logistics