Logistics

63
1 (c) 2002-3, C. Boutilier, E. Hansen, D. Weld Logistics Reading for Mon No class Wed 11/26

description

Logistics. Reading for Mon No class Wed 11/26. Classical AI planning. Operations Research. No uncertainty. Uncertainty. Achieve goals. Maximize utility. Active research area. Knowledge-based representation. Markov decision process. Dynamic programming. Heuristic search. Review. - PowerPoint PPT Presentation

Transcript of Logistics

Page 1: Logistics

1(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Logistics

Reading for MonNo class Wed 11/26

Page 2: Logistics

2(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Activeresearch area

No uncertainty

Achieve goals

Heuristic search

Uncertainty

Maximize utility

Dynamic programming

Classical AI planning Operations Research

Knowledge-based representation

Markov decision process

Page 3: Logistics

3(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Review

MDPsBayesian NetworksDBNsFactored MDPsBDDs & ADDs

Page 4: Logistics

4(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Markov Decision ProcessesS = set of states set (|S| = n)

A = set of actions (|A| = m)

Pr = transition function Pr(s,a,s’)represented by set of m n x n stochastic matriceseach defines a distribution over SxS

R(s) = bounded, real-valued reward functionrepresented by an n-vector

Page 5: Logistics

5(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Planning

Plan?• Objective?

Policy?• Objective?

Page 6: Logistics

6(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Dynamic programming (DP)

Value iteration [Bellman, 1957]

Sj

ijiAa

jfaPaCif maxDP improves

value function

Initial value function

-optimal value function

Policy iteration [Howard, 1960]

Sj

ijiAa

jfaPaCi maxarg

Evaluatepolicy

DP improvespolicy

Initial policy

-optimal policy

Page 7: Logistics

7(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Bellman’s Curse of Dimensionality

Page 8: Logistics

8(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Bayes NetsCompact Rep’n Joint Prob, Distribution

Earthquake Burglary

Alarm

Nbr2CallsNbr1Calls

Pr(B=t) Pr(B=f) 0.05 0.95

Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)

Radio

Page 9: Logistics

9(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

DBN Representation: DelC

Tt

Lt

CRt

RHCt

Tt+1

Lt+1

CRt+1

RHCt+1

fCR(Lt,CRt,RHCt,CRt+1)

fT(Tt,Tt+1)

L CR RHC CR(t+1) CR(t+1)

O T T 0.2 0.8

E T T 1.0 0.0

O F T 0.0 1.0

E F T 0.0 1.0

O T F 1.0 0.1

E T F 1.0 0.0

O F F 0.0 1.0

E F F 0.0 1.0

T T(t+1) T(t+1)

T 0.91 0.09

F 0.0 1.0

RHMt RHMt+1

Mt Mt+1

fRHM(RHMt,RHMt+1)RHM R(t+1) R(t+1)

T 1.0 0.0

F 0.0 1.0

Page 10: Logistics

10(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Benefits of DBN Representation

- Only 48 parameters vs. 25440 for matrix

s1 s2 ... s160

s1 0.9 0.05 ... 0.0s2 0.0 0.20 ... 0.1

s160 0.1 0.0 ... 0.0

...Tt

Lt

CRt

RHCt

Tt+1

Lt+1

CRt+1

RHCt+1

RHMt RHMt+1

Mt Mt+1

Page 11: Logistics

11(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Example (x3 and x2) or not x1

x2

x1

0 1

00

01

1

1

x3

OBDD

10

1 1 1 1 1

Binary decision tree

x3

x2

x1

Page 12: Logistics

12(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Action Representation – DBN/ADD

CR

0.0 1.0 0.8

RHC

L

CR(t+1)CR(t+1)CR(t+1)

0.2

Algebraic Decision Diagram (ADD)Tt

Lt

CRt

RHCt

Tt+1

Lt+1

CRt+1

RHCt+1

RHMt RHMt+1

Mt Mt+1

f

t

t

o

t

e

f

ffft

t

fCR(Lt,CRt,RHCt,CRt+1)

Page 13: Logistics

13(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Today – Solving the curse

AbstractionApproximationReachability

Page 14: Logistics

14(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Structured Computation

Given compact representation, can we solve

MDP without explicit state space enumeration?

Can we avoid O(|S|)-computations by exploiting

regularities made explicit by propositional or first-

order representations?

Two general schemes:

• abstraction/aggregation

• decomposition

Page 15: Logistics

15(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

State Space Abstraction

General method: state aggregation

• group states, treat aggregate as single state

• commonly used in OR [SchPutKin85, BertCast89]

• viewed as automata minimization [DeanGivan96]

Abstraction is a specific aggregation technique

• aggregate by ignoring details (features)

• ideally, focus on relevant features

Page 16: Logistics

16(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Dimensions of Abstraction

A B C

A B C

A B C

A B C

A B C

A B C

A B C

A B C

A

A B C

A B

A B C

A

B

C=

5.3

5.3

5.3

5.3

2.9

2.9 9.3

9.3

5.3

5.2

5.5

5.3

2.9

2.79.3

9.0

Uniform

Nonuniform

Exact

Approximate

Adaptive

Fixed

Page 17: Logistics

17(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

A Fixed, Uniform Approximate Abstraction Method

Uniformly delete features from domain [BD94/AIJ97]

Ignore features based on degree of relevance• rep’n used to determine importance to sol’n quality

Allows tradeoff between abstract MDP size and solution quality

A B C

A B C

A B C

A B C

A B C

A B C

0.8

0.2

0.5

0.5

Page 18: Logistics

18(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Immediately Relevant Variables

Rewards determined by particular variables• impact on reward clear from STRIPS/ADD rep’n of R

• e.g., difference between CR/-CR states is 10, while difference between T/-T states is 3, MW/-MW is 5

Approximate MDP: focus on “important” goals• e.g., we might only plan for CR

• we call CR an immediately relevant variable (IR)

• generally, IR-set is a subset of reward variables

Page 19: Logistics

19(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Relevant Variables

We want to control the IR variables• must know which actions influence these and under

what conditions

A variable is relevant if it is the parent in the DBN for some action a of some relevant variable

• ground (fixed pt) definition by making IR vars relevant

• analogous def’n for PSTRIPS

• e.g., CR (directly/indirectly) influenced by L, RHC, CR

Simple “backchaining” algorithm to contruct set• linear in domain descr. size, number of relevant vars

Page 20: Logistics

20(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Constructing an Abstract MDP

Simply delete all irrelevant atoms from domain• state space S’: set of assts to relevant vars

• transitions: let Pr(s’,a,t’) = t t’ Pr(s,a,t’) for any ss’

construction ensures identical for all ss’

• reward: R(s’) = max {R(s): ss’} - min {R(s): ss’} / 2 midpoint gives tight error bounds

Construction of DBN/PSTRIPS with these properties involves little more than simplifying action descriptions by deletion

Page 21: Logistics

21(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Example

Abstract MDP• only 3 variables

• 20 states instead of 160

• some actions become identical, so action space is simplified

• reward distinguishes only CR and –CR (but “averages” penalties for MW and –T)

Lt

CRt

RHCt

Lt+1

CRt+1

RHCt+1

DelC action

Aspect Condt’n Rew

Coffee CR -14

-CR -4

Reward

Page 22: Logistics

22(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Solving Abstract MDPAbstract MDP can be solved using std methodsError bounds on policy quality derivable

• Let be max reward span over abstract states

• Let V’ be optimal VF for M’, V* for original M

• Let ’ be optimal policy for M’ and * for original M

s'any sfor sVsV

)(

|)'(')(| *

12

s'any sfor sVsV

)(

|)'()(| '*

1

Page 23: Logistics

23(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

FUA Abstraction: Relative Merits

FUA easily computed (fixed polynomial cost)

FUA prioritizes objectives nicely• a priori error bounds computable (anytime tradeoffs)

• can refine online (heuristic search) [DeaBou97]

FUA is inflexible• can’t capture conditional relevance

• approximate (may want exact solution)

• can’t be adjusted during computation

• may ignore the only achievable objectives

Page 24: Logistics

24(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Dimensions of Abstraction

A B C

A B C

A B C

A B C

A B C

A B C

A B C

A B C

A

A B C

A B

A B C

A

B

C=

5.3

5.3

5.3

5.3

2.9

2.9 9.3

9.3

5.3

5.2

5.5

5.3

2.9

2.79.3

9.0

Uniform

Nonuniform

Exact

Approximate

Adaptive

Fixed

Page 25: Logistics

25(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Constructing Abstract MDPs

Many ways to abstract an MDP• methods will exploit the logical representation

Abstraction can be viewed as a form of automaton minimization

• general minimization schemes require state space enumeration

• Instead, exploit the logical structure of the domain (state, actions, rewards) to construct logical descriptions of abstract states, avoiding state enumeration

Page 26: Logistics

26(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Decision-Theoretic Regression

Abstraction based on analog of regression• as abstraction: dynamic, nonuniform, exact/approx.

• exploits logical representation of MDP

Overview• value iteration as variable elimination

• propositional decision-theoretic regression

• approximate decision-theoretic regression

• first-order decision-theoretic regression

Page 27: Logistics

27(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Classical Regression

Goal regression a classical abstraction method• Regression of a logical condition/formula G through

action a is a weakest logical formula C = Regr(G,a) such that: G is guaranteed to be true after doing a if C is true before doing a

• Weakest precondition for G wrt a

G

G

C

Cdo(a)

Page 28: Logistics

28(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Example: Regression in SitCalc

For the situation calculus• Regr(G(do(a,s))): logical condition C(s) under which a

leads to G (aggregates C states and ~C states)

Regression in sitcalc straightforward

• Regr(F(x, do(a,s))) F(x,a,s)• Regr(1) Regr(1)• Regr(12) Regr(1) Regr(2)• Regr(x.1) x.Regr(1)

Page 29: Logistics

29(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Decision-Theoretic Regression

In MDPs, we don’t have goals, but regions of distinct value

Decision-theoretic analog: given “logical description” of Vt+1, produce such a description of Vt or optimal policy (e.g., using ADDs)

Cluster together states at any point in calculation

with same best action (policy), or with same

value (VF)

Page 30: Logistics

30(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Decision-Theoretic Regression

Decision-theoretic complications:• multiple formulae G describe fixed value partitions

• a can leads to multiple partitions (stochastically)

VtQt+1(a)

G2

G3G1

C1

p1

p2

p3

Page 31: Logistics

31(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Functional View of DTR

Generally, Vt+1 depends on only a subset of variables @ t (usually in a structured way)

What is value of action a at time t (at any s)?

CR

M

-10 0

Vt+1

Tt

Lt

CRt

RHCt

Tt+1

Lt+1

CRt+1

RHCt+1

RHMt RHMt+1

Mt Mt+1

fRm(Rmt,Rmt+1)

fM(Mt,Mt+1)

fT(Tt,Tt+1)

fL(Lt,Lt+1)

fCr(Lt,Crt,Rct,Crt+1)

fRc(Rct,Rct+1)

Page 32: Logistics

32(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Bellman Backup (Regression)

JC

10 012

CP

CC

JP BC JP

9

CR

0.0 1.0 0.8

RHC

L

CR(t+1)CR(t+1)CR(t+1)

0.2

f

t

t

o

t

e

f

ffft

t

CR

M

-10 0

Page 33: Logistics

33(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

A Simple Action/Reward Example

X

Y

Z

X

Y

Z

X

Y0.9

0.0

W

1.0 0.0

1.0

Y

Z0.9

0.01.0

Z

10 0

Network Rep’n for Action A Reward Function R

Page 34: Logistics

34(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Example: Generation of V1

Z

010

Y

Z8.1

0.09.0

V0 = R

Y

ZZ: 0.9

Z: 0.0Z: 1.0

P(Z|a,s)

Y

Z9.0

0.010.0

P(Z|a,s)V0 P(Z|a,s)V0

Y

Z8.1

0.09.0

Maxa …

Y

Z8.1

0.019.0

R(s) +Maxa … = V1

Y

Z8.1

0.09.0

Z

010

+ =

Page 35: Logistics

35(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Example: Generation of V2

Y

Z8.1

0.09.0

V1

X

YY: 0.9

Y: 0.0Y: 1.0 Y

ZY: 0.9

Z: 0.9

Y: 0.9

Z: 0.0

Y:0.9

Z: 1.0

Y

ZY: 1.0

Y: 0.0

Z: 0.0

Y:0.0

Z: 1.0

X

P(Y|a, s) P(Z|a,s)

Page 36: Logistics

36(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Some Results: Natural Examples

Page 37: Logistics

38(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Some Results: Worst-case

Page 38: Logistics

40(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Some Results: Best-case

Page 39: Logistics

41(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

DTR: Relative Merits

Adaptive, nonuniform, exact abstraction method• provides exact solution to MDP

• much more efficient on certain problems (time/space)

• 400 million state problems (ADDs) in a couple hrs

Some drawbacks• produces piecewise constant VF

• some problems admit no compact solution representation (though ADD overhead “minimal”)

• approximation may be desirable or necessary

Page 40: Logistics

42(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Criticisms of SPUDD

Page 41: Logistics

43(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Future Work

Page 42: Logistics

44(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Dimensions of Abstraction

A B C

A B C

A B C

A B C

A B C

A B C

A B C

A B C

A

A B C

A B

A B C

A

B

C=

5.3

5.3

5.3

5.3

2.9

2.9 9.3

9.3

5.3

5.2

5.5

5.3

2.9

2.79.3

9.0

Uniform

Nonuniform

Exact

Approximate

Adaptive

Fixed

Page 43: Logistics

45(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Approximate DTR

Easy to approximate solution using DTR

Simple pruning of value function

• Can prune trees [BouDearden96] or ADDs [StaubinHoeyBou00]

Gives regions of approximately same value

Page 44: Logistics

46(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

A Pruned Value ADD

8.368.45

7.45

U

R

W

6.817.64

6.64

U

R

W

5.626.19

5.19

U

R

WLoc

HCR

HCU

9.00

W

10.00

[7.45, 8.45]

Loc

HCR

HCU

[9.00, 10.00]

[6.64, 7.64]

[5.19, 6.19]

Page 45: Logistics

47(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Approximate Structured VIRun normal SVI using ADDs/DTs

• at each leaf, record range of values

At each stage, prune interior nodes whose leaves all have values with some threshold

• tolerance can be chosen to minimize error or size• tolerance can be adjusted to magnitude of VF

Convergence requires some careIf max span over leaves < and term. tol. < :

1

22 )(* VV

Page 46: Logistics

48(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Approximate DTR: Relative Merits

Relative merits of ADTR• fewer regions implies faster computation• can provide leverage for optimal computation• 30-40 billion state problems in a couple hours• allows fine-grained control of time vs. solution quality

with dynamic (a posteriori) error bounds• technical challenges: variable ordering, convergence,

fixed vs. adaptive tolerance, etc.

Some drawbacks• (still) produces piecewise constant VF• doesn’t exploit additive structure of VF at all

Page 47: Logistics

49(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Reachability

Page 48: Logistics

50(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

DP vs. heuristic search

Given a start state, heuristic search can find an optimal solution without evaluating all states.

Startstate

Solution graph:all states reachableby optimal solution

Explicit graph:states evaluatedduring search

Implicit graph:all states

Each iteration, DP improves solution for each stateDP solves problem for all possible starting states.

Page 49: Logistics

51(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Solution structures

Solution path Acyclicsolution graph

Cyclicsolutiongraph

Page 50: Logistics

52(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Solution structure

Simple

path

Acyclic graph

Cyclic graph

Dynamic programming

Forward DP (Dijkstra’s alg.)

Backwards induction

Policy (Value) iteration

Heuristic search

A*

AO*

LAO*

Heuristic search = dynamic programming + starting state + forward expansion of solution + admissible heuristic

DP vs. heuristic search

Page 51: Logistics

53(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

AO*[Nilsson 1971; Martelli & Montanari 1973]

Initialize partial solution graph to start state

Repeat until a complete solution is found:• Expand some nonterminal

state on the fringe of the best partial solution graph

• Use backwards induction to update the costs of all ancestor states of the expanded state and possibly change their selected action.

Same except allow solution to contain loops, and use value iteration (or policy iteration) instead of backwards induction

LAO*[Hansen & Zilberstein AAAI-98, AIJ]

Page 52: Logistics

54(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Heuristic evaluation function

h(i) is an heuristic estimate of the minimal-cost solution for every non-terminal tip state.

h(i) is admissible if h(i) < f*(i). An admissible heuristic estimate f(i) for any state in

the explicit graph is defined as follows:

Sj

ijiiAa

jfjaipac

iih

i

if

)(),,()(min else

state tipterminal-non a is if )(

state goal a is if 0

)(

)(

Underestimate cost (overestimate reward)

Page 53: Logistics

55(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Example: Path-finding

1 Start

4 Goal2 3

5

678

Actions: Move one cell to the North, East, South, or WestEach action succeeds with probability 0.5 if there is a cell in intended direction of movement, and otherwise failsEach action has cost of one

Heuristic?

Page 54: Logistics

56(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Start of search

1 Start

4 Goal2 3

5

678

1 Start

3.0

Page 55: Logistics

57(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

After first node expansion

1 Start

4 Goal2 3

5

678

1 Start

4.0

2

2.0

8

4.0

N

ES

Page 56: Logistics

58(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

After second node expansion

1 Start

4 Goal2 3

5

678

1 Start

5.0

2

3.0

8

4.0

N

ES

3

1.0E

N

S

Page 57: Logistics

59(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

After third node expansion

1 Start

4 Goal2 3

5

678

1 Start

6.0

2

4.0

8

4.0

N

ES

3

2.0E

N

S

4

0.0E

N

S

Page 58: Logistics

60(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Theoretical Properties

Theorem 1: Using an admissible heuristic, LAO* converges to an optimal solution without (necessarily) expanding/evaluating all states.

Theorem 2: If h2(i) is a more informative heuristic than h1(i) (i.e., h1(i) h2(i) f*(i)), LAO* using h2(i) expands a subset of the worst case set of states expanded using h1(i).

Page 59: Logistics

61(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Today’s paper

LAO* search over what?

Page 60: Logistics

62(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Results

States reach

able by opt

States explored durin

g search

Page 61: Logistics

63(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Criticisms?

Page 62: Logistics

64(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Future work?

Page 63: Logistics

65(c) 2002-3, C. Boutilier, E. Hansen, D. Weld

Off-line vs. on-line search

Deterministicstate transitions

Stochasticstate transitions

(MDPs)

Off-line A* LAO*

On-line (real-time) LRTA* (Korf) RTDP (Barto et al.)