Batch RL Via Least Squares Policy Iteration

42
Batch RL Via Least Squares Policy Iteration Alan Fern Based in part on slides by Ronald Parr

description

Batch RL Via Least Squares Policy Iteration. Alan Fern. * Based in part on slides by Ronald Parr. Overview. Motivation LSPI Derivation from LSTD Experimental results. Online versus Batch RL. Online RL: integrates data collection and optimization - PowerPoint PPT Presentation

Transcript of Batch RL Via Least Squares Policy Iteration

Page 1: Batch RL Via Least Squares Policy Iteration

Batch RL ViaLeast Squares Policy Iteration

Alan Fern

* Based in part on slides by Ronald Parr

Page 2: Batch RL Via Least Squares Policy Iteration

Overview

h Motivation

h LSPI5 Derivation from LSTD5 Experimental results

Page 3: Batch RL Via Least Squares Policy Iteration

Online versus Batch RLh Online RL: integrates data collection and optimization

5 Select actions in environment and at the same time update parameters based on each observed (s,a,s’,r)

h Batch RL: decouples data collection and optimization5 First generate/collect experience in the environment giving a

data set of state-action-reward-state pairs {(si,ai,ri,si’)}5 We may not even know where the data came from5 Use the fixed set of experience to optimize/learn a policy

h Online vs. Batch:5 Batch algorithms are often more “data efficient” and stable5 Batch algorithms ignore the exploration-exploitation

problem, and do their best with the data they have

Page 4: Batch RL Via Least Squares Policy Iteration

Batch RL Motivationh There are many applications that naturally fit the batch

RL model

h Medical Treatment Optimization: 5 Input: collection of treatment episodes for an ailment giving

sequence of observations and actions including outcomes5 Ouput: a treatment policy, ideally better than current practice

h Emergency Response Optimization:5 Input: collection of emergency response episodes giving

movement of emergency resources before, during, and after 911 calls

5 Output: emergency response policy

Page 5: Batch RL Via Least Squares Policy Iteration

LSPIh LSPI is a model-free batch RL algorithm

5 Learns a linear approximation of Q-function5 stable and efficient5 Never diverges or gives meaningless answers

h LSPI can be applied to a dataset regardless of how it was collected5 But garbage in, garbage out.

Least-Squares Policy Iteration, Michail Lagoudakis and Ronald Parr, Journal of Machine Learning Research (JMLR), Vol. 4, 2003, pp. 1107-1149.

Page 6: Batch RL Via Least Squares Policy Iteration

Notation h S: state space, s: individual states

h R(s,a): reward for taking action a in state s

h g: discount factor

h V(s): state value function

h P(s’ | s,a) = T(s,a,s’): transition function

h Q(s,a) : state-action value function

h Policy: as )(

0

Et

ttrg

Objective: Maximize expected, total, discounted reward

Page 7: Batch RL Via Least Squares Policy Iteration

Projection Approach to Approximationh Recall the standard Bellman equation:

or equivalently where T[.] is the Bellman operator

h Recall from value iteration, the sub-optimality of a value function can be bounded in terms of the Bellman error:

h This motivates trying to find an approximate value function with small Bellman error

'

** )'(),|'(),(max)(sa sVassPasRsV g

][ ** VTV

][VTV

Page 8: Batch RL Via Least Squares Policy Iteration

Projection Approach to Approximationh Suppose that we have a space of representable value

functions5 E.g. the space of linear functions over given features

h Let P be a projection operator for that space5 Projects any value function (in or outside of the space) to

“closest” value function in the space

h “Fixed Point” Bellman Equation with approximation

5 Depending on space this will have a small Bellman error

h LSPI will attempt to arrive at such a value function5 Assumes linear approximation and least-squares projection

]ˆ[ˆ ** VTV

Page 9: Batch RL Via Least Squares Policy Iteration
Page 10: Batch RL Via Least Squares Policy Iteration

Projected Value Iterationh Naïve Idea: try computing projected fixed point using VI

h Exact VI: (iterate Bellman backups)

h Projected VI: (iterated projected Bellman backups):

][1 ii VTV

]ˆ[ˆ 1 ii VTV

exact Bellman backup(produced value function)

Projects exact Bellmanbackup to closest functionin our restricted function space

Page 11: Batch RL Via Least Squares Policy Iteration

Example: Projected Bellman BackupRestrict space to linear functions over a single feature f:

Suppose just two states s1 and s2 with: Suppose exact backup of Vi gives:

fs1)=1 fs2)=2

)()(ˆ swsV f

2)](ˆ[,2)](ˆ[ 21 sVTsVT ii

fs1)=1, fs2)=2

Can we represent this exactbackup in our linear space?

No

Page 12: Batch RL Via Least Squares Policy Iteration

Example: Projected Bellman BackupRestrict space to linear functions over a single feature f:

Suppose just two states s1 and s2 with: Suppose exact backup of Vi gives:

The backup can’t be represented via our linear function:

fs1)=1 fs2)=2

)()(ˆ swsV f

]ˆ[ˆ 1 ii VTV

2)](ˆ[,2)](ˆ[ 21 sVTsVT ii

fs1)=1, fs2)=2

projected backup is just least-squares fitto exact backup

)(333.1)(ˆ 1 ssV i f

Page 13: Batch RL Via Least Squares Policy Iteration

Problem: Stabilityh Exact value iteration stability ensured by

contraction property of Bellman backups:

h Is the “projected” Bellman backup a contraction:

?

][1 ii VTV

]ˆ[ˆ 1 ii VTV

Page 14: Batch RL Via Least Squares Policy Iteration

Example: Stability Problem [Bertsekas & Tsitsiklis 1996]

Problem: Most projections lead to backups that are not contractions and unstable

s2s1

Rewards all zero, single action, g = 0.9: V* = 0

Consider linear approx. w/ single feature f with weight w. )()(ˆ swsV f Optimal w = 0

since V*=0

Page 15: Batch RL Via Least Squares Policy Iteration

Example: Stability Problem

From Vi perform projected backup for each state

Can’t be represented in our space so find wi+1 that gives least-squares approx. to exact backup

After some math we can get: wi+1 = 1.2 wi

What does this mean?

iii wsVsVT 8.1)(ˆ)](ˆ[ 21 giii wsVsVT 8.1)(ˆ)](ˆ[ 22 g

s2s1fs1)=1 Vi(s1) = wi

fs2)=2 Vi(s2) = 2wi

weight valueat iteration i

Page 16: Batch RL Via Least Squares Policy Iteration

Example: Stability Problem

1 2

Iteration #

S

V(x)

0V

3V2V

1V

Each iteration of Bellman backup makes approximation worse!Even for this trivial problem “projected” VI diverges.

Page 17: Batch RL Via Least Squares Policy Iteration

Understanding the Problem

h What went wrong?5 Exact Bellman backups reduces error in max-norm5 Least squares (= projection) non-expansive in L2 norm

g But may increase max-norm distance!

h Conclusion: Alternating Bellman backups and projection is risky business

Page 18: Batch RL Via Least Squares Policy Iteration

Overview

h Motivation

h LSPI5 Derivation from Least-Squares Temporal Difference

Learning5 Experimental results

Page 19: Batch RL Via Least Squares Policy Iteration

How does LSPI fix these?h LSPI performs approximate policy iteration

5 PI involves policy evaluation and policy improvement5 Uses a variant of least-squares temporal difference learning

(LSTD) for approx. policy evaluation [Bratdke & Barto ‘96]

h Stability:5 LSTD directly solves for the fixed point of the approximate

Bellman equation for policy values5 With singular-value decomposition (SVD), this is always well

defined

h Data efficiency5 LSTD finds best approximation for any finite data set5 Makes a single pass over the data for each policy5 Can be implemented incrementally

Page 20: Batch RL Via Least Squares Policy Iteration

OK, What’s LSTD?h Approximates value function of policy given

trajectories of h Assumes linear approximation of denoted

h The fk are arbitrary feature functions of states

h Some vector notation

k kk swsV )()(ˆ f

)(ˆ

)(ˆˆ

1

nsV

sVV

kw

ww

1

)(

)( 1

nk

k

k

s

s

f

ff Kff 1

Page 21: Batch RL Via Least Squares Policy Iteration

Deriving LSTD

is a linear function in the column space

of f1…fk, that is,

wV ˆ

K basis functions

# states

f1(s1) f2(s1)...

f1(s2) f2(s2)…

.

.

.

=

assigns a value to every state

V

KKwwV ff 11ˆ

Page 22: Batch RL Via Least Squares Policy Iteration

Suppose we know true value of policyh We would like the following:

h Least squares weights minimizes squared error

h Least squares projection is then

VwV ˆ

Vw TT 1)(

Textbook least squares projection operator

VwV TT 1)(ˆ

Sometimes called pseudoinverse

Page 23: Batch RL Via Least Squares Policy Iteration

But we don’t know V…h Recall fixed-point equation for policies

h Will solve a projected fixed-point equation:

h Substituting least squares projection into this gives:

h Solving for w:

g VPRV ˆˆ

wPRw TT g1)(

RPw TTT 1)( g

))(,|())(,|(

))(,|())(,|(,

))(,(

))(,(

11

1111111

nnnn

n

nn sssPsssP

sssPsssPP

ssR

ssRR

'

)'())(,|'())(,()(s

sVsssPssRsV g

Page 24: Batch RL Via Least Squares Policy Iteration

Almost there…

h Matrix to invert is only K x K

h But…5 Expensive to construct matrix (e.g. P is |S|x|S|)

g Presumably we are using LSPI because |S| is enormous 5 We don’t know P5 We don’t know R

RPw TTT 1)( g

Page 25: Batch RL Via Least Squares Policy Iteration

Using Samples for

K basis functions

f1(s1) f2(s1)...

f1(s2) f2(s2)…

.

.

.

Idea: Replace enumeration of states with sampled states

states samples

Suppose we have state transition samples of the policyrunning in the MDP: {(si,ai,ri,si’)}

Page 26: Batch RL Via Least Squares Policy Iteration

Using Samples for R

r1

r2

.

.

.

Idea: Replace enumeration of reward with sampled rewards

samples

Suppose we have state transition samples of the policyrunning in the MDP: {(si,ai,ri,si’)}

R =

Page 27: Batch RL Via Least Squares Policy Iteration

Using Samples for P

K basis functions

f1(s1’) f2(s1’)...

f1(s2’) f2(s2’)…

.

.

.

Idea: Replace expectation over next states with sampled next states.

s’ from (s,a,r,s’)P

Page 28: Batch RL Via Least Squares Policy Iteration

Putting it Togetherh LSTD needs to compute:

h The hard part of which is B the kxk matrix:

h Both B and b can be computed incrementally for each (s,a,r,s’) sample: (initialize to zero)

bBRPw TTT 11)( g

)( PB TT g

)'()()()( ssssBB jijiijij fgfff

)(srbb iii f

Rb T from previous slide

Page 29: Batch RL Via Least Squares Policy Iteration

LSTD Algorithm

h Collect data by executing trajectories of current policy

h For each (s,a,r,s’) sample:

bBw 1

)'()()()( ssssBB jijiijij fgfff

),( asrbb iii f

Page 30: Batch RL Via Least Squares Policy Iteration

LSTD Summaryh Does O(k2) work per datum

5 Linear in amount of data.

h Approaches model-based answer in limith Finding fixed point requires inverting matrix

h Fixed point almost always existsh Stable; efficient

Page 31: Batch RL Via Least Squares Policy Iteration

Approximate Policy Iteration with LSTD

Start with random weights w (i.e. value function)

Repeat Until Convergence

Evaluate using LSTDg Generate sample trajectories of g Use LSTD to produce new weights w

(w gives an approx. value function of )

)( s = greedy( ) // policy improvement),(ˆ wsV

Policy Iteration: iterates between policy improvement and policy evaluation

Idea: use LSTD for approximate policy evaluation in PI

Page 32: Batch RL Via Least Squares Policy Iteration

What Breaks?

h No way to execute greedy policy without a model

h Approximation is biased by current policy5 We only approximate values of states we see when

executing the current policy5 LSTD is a weighted approximation toward those states

h Can result in Learn-forget cycle of policy iteration5 Drive off the road; learn that it’s bad5 New policy never does this; forgets that it’s bad

h Not truly a batch method5 Data must be collected from current policy for LSTD

Page 33: Batch RL Via Least Squares Policy Iteration

LSPIh LSPI is similar to previous loop but replaces LSTD

with a new algorithm LSTDQh LSTD: produces a value function

5 Requires samples from policy under consideration

h LSTDQ: produces a Q-function for current policy5 Can learn Q-function for policy from any (reasonable) set

of samples---sometimes called an off-policy method5 No need to collect samples from current policy

h Disconnects policy evaluation from data collection5 Permits reuse of data across iterations!5 Truly a batch method.

Page 34: Batch RL Via Least Squares Policy Iteration

Implementing LSTDQh Both LSTD and LSTDQ compute:

h But LSTDQ basis functions are indexed by actions

defines greedy policy:

h For each (s,a,r,s’) sample:

)( PB TT

))'(,'(),(),(),( ssasasasBB wjijiijij ffff

),( asrbb iii f

bBw 1

k

kkw aswasQ ),(),(ˆ f

),'(ˆmaxarg asQwa

),(ˆmaxarg)( asQs waw

Page 35: Batch RL Via Least Squares Policy Iteration

Running LSPIh There is a Matlab implementation available!

1. Collect a database of (s,a,r,s’) experiences (this is the magic step)

2. Start with random weights (= random policy)

3. Repeat5 Evaluate current policy against database

g Run LSTDQ to generate new set of weightsg New weights imply new Q-function and hence new

policy5 Replace current weights with new weights

h Until convergence

Page 36: Batch RL Via Least Squares Policy Iteration

Results: Bicycle Riding

h Watch random controller operate simulated bike

h Collect ~40,000 (s,a,r,s’) samples

h Pick 20 simple feature functions (5 actions)

h Make 5-10 passes over data (PI steps)

h Reward was based on distance to goal + goal achievement

h Result: Controller that balances and rides to goal

Page 37: Batch RL Via Least Squares Policy Iteration

Bicycle Trajectories

Page 38: Batch RL Via Least Squares Policy Iteration

What about Q-learning?

h Ran Q-learning with same features

h Used experience replay for data efficiency

Page 39: Batch RL Via Least Squares Policy Iteration

Q-learning Results

Page 40: Batch RL Via Least Squares Policy Iteration

LSPI Robustness

Page 41: Batch RL Via Least Squares Policy Iteration

Some key pointsh LSPI is a batch RL algorithm

5 Can generate trajectory data anyway you want5 Induces a policy based on global optimization over

full dataset

h Very stable with no parameters that need tweaking

Page 42: Batch RL Via Least Squares Policy Iteration

So, what’s the bad news?h LSPI does not address the exploration problem

5 It decouples data collection from policy optimization5 This is often not a major issue, but can be in some cases

h k2 can sometimes be big5 Lots of storage5 Matrix inversion can be expensive

h Bicycle needed “shaping” rewards

h Still haven’t solved5 Feature selection (issue for all machine learning, but RL

seems even more sensitive)