Part I: The Problem Introduction CSE 190: Reinforcement...

CSE 190: Reinforcement Learning:An Introduction

Chapter 7: Eligibility Traces

Acknowledgment:Acknowledgment:A good number of these slidesA good number of these slides are cribbed from Rich Suttonare cribbed from Rich Sutton

22CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

The Book: Where we areand where we’re going

• Part I: The Problem• Introduction• Evaluative Feedback• The Reinforcement Learning Problem

• Part II: Elementary Solution Methods• Dynamic Programming• Monte Carlo Methods• Temporal Difference Learning

• Part III: A Unified View• Eligibility Traces• Generalization and Function Approximation• Planning and Learning• Dimensions of Reinforcement Learning• Case Studies


Chapter 7: Eligibility Traces


Simple Monte Carlo

T T T TT

T T T T T

V (st ) !V (st ) +" Rt #V (st )[ ]where Rt is the actual return following state st .

st

T T

T T

TT T

T TT

Rt


Simplest TD Method

T T T TT

T T T T T

st+1rt+1

st

V (st )!V (st ) +" rt+1 + # V (st+1) $V (st )[ ]

TTTTT

T T T T T


Is there something in between?

T T T TT

T T T T T

st+1rt+1

st

TTTTT

T T T T T


N-step TD Prediction

• Idea: Look farther into the future when you do TD backup(1, 2, 3, …, n steps)


• Monte Carlo:

• TD:• I.e., use V to estimate remaining return

• n-step TD:• 2 step return:

• n-step return:

Mathematics of N-step TD Prediction

Rt = rt+1 + ! rt+2 + !2rt+3 +!+ ! T " t"1rT

Rt(1) = rt+1 + !Vt (st+1)

Rt(2) = rt+1 + ! rt+2 + !

2Vt (st+2 )

Rt(n) = rt+1 + ! rt+2 + !

2rt+3 +!+ ! n"1rt+n + !nVt (st+n )

Note the “T”


• Monte Carlo:


• n-step TD:• 2 step return:

• n-step return:

Mathematics of N-step TD Prediction

Rt = rt+1 + ! rt+2 + !2rt+3 +!+ ! T " t"1rT

Rt(1) = rt+1 + !Vt (st+1)

Rt(2) = rt+1 + ! rt+2 + !

2Vt (st+2 )

Rt(n) = rt+1 + ! rt+2 + !

2rt+3 +!+ ! n"1rt+n + !nVt (st+n )

Note the “T”

Data estimate1010CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

• Monte Carlo:


• n-step TD:• n-step return:

• An “(n)” at the top of the “R” means n-step return (easy tomiss!) - Rt is basically

Note the Notation

Rt = rt+1 + ! rt+2 + !2rt+3 +!+ ! T " t"1rT

Rt(1) = rt+1 + !Vt (st+1)

Rt(n) = rt+1 + ! rt+2 + !

2rt+3 +!+ ! n"1rt+n + !nVt (st+n )

Note the “T”

R!t


Learning with N-step Backups

• Backup (on-line or off-line):

• Error reduction property of n-step returns

• Using this, you can show that n-step methods converge

maxs

E! {Rt(n) | st = s} "V

! (s) # $ nmaxsV (s) "V ! (s)

Maximum error using n-step return Maximum error using V

!Vt (st ) = " Rt(n) #Vt (st )$% &'


Learning with N-step Backups

• Backup (on-line or off-line):

• Error reduction property of n-step returns

• Using this, you can show that n-step methods converge

maxs

E! {Rt(n) | st = s} "V

! (s) # $ nmaxsV (s) "V ! (s)

Maximum error using n-step return Maximum error using V

!Vt (st ) = " Rt(n) #Vt (st )$% &'


Random Walk Examples

• How does 2-step TD work here?

• How about 3-step TD?


A Larger Example• Task: 19 state

random walk• Each curve

corresponds to theerror after 10episodes.

• Each curve uses adifferent n.

• The x-axis is thelearning rate.

• Do you think there isan optimal n?for everything?


Averaging N-step Returns

• n-step methods were introduced to help withTD(!) understanding (next!)

• Idea: backup an average of several returns• e.g. backup half of 2-step and half of 4-step

• Called a complex backup

• To draw the backup diagram:• Draw each component

• Label with the weights for that component

Rtavg =

12Rt(2) +

12Rt(4 )

One backup


Forward View of TD(!)• TD(!) is a method for

averaging allall n-step backups• weighted by !n-1 (time

since visitation)• !-return:

• Backup using !-return:

Rt! = (1" !) !n"1

n=1

#

$ Rt(n)

!Vt (st ) = " Rt# $Vt (st )%& '(


Forward View of TD(!)• TD(!) is a method for

averaging allall n-step backups• weighted by !n-1 (time

since visitation)

• Note: Since the weights addup to 1, rt+1 is not“overemphasized.”


!-return Weighting Function

Rt! = (1" !) !n"1

n=1

T " t"1

# Rt(n) + !T " t"1Rt

Until terminationAfter termination


Relation to TD(0) and MC

• !-return can be rewritten as:

• If ! = 1, you get MC:

• If ! = 0, you get TD(0)

Rt! = (1"1) 1n"1

n=1

T " t"1

# Rt(n) +1T " t"1Rt = Rt

Rt! = (1" 0) 0n"1

n=1

T " t"1

# Rt(n) + 0T " t"1Rt = Rt

(1)

Rt! = (1" !) !n"1

n=1

T " t"1

# Rt(n) + !T " t"1Rt

Until termination After termination


Forward View of TD(!) II

• Look forward from each state to determine update fromfuture states and rewards:


!-return on the Random Walk

• Same 19 state random walk as before• Why do you think intermediate values of ! are best?


Backward View

• Shout !t backwards over time

• The strength of your voice decreases with temporaldistance by "#$

! t = rt+1 + "Vt (st+1) #Vt (st )


Backward View of TD(!)

• The forward view was for theory

• The backward view is for mechanism

• New variable called eligibility trace• On each step, decay all traces by "#$ and increment the trace for the

current state by 1

• This gives the equivalent weighting as the forward algorithmwhen used incrementally

et (s)!"+

et (s) =!"et#1(s) if s $ st

!"et#1(s) +1 if s = st

%&'

('


Policy Evaluation Using TD(!)Initialize V (s) arbitrarilyRepeat (for each episode): e(s) = 0, for all s !S Initialize s Repeat (for each step of episode): a" action given by # for s Take action a, observe reward, r, and next state $s % " r + &V ( $s ) 'V (s) /* Compute error */ e(s) " e(s) +1 /* Increment trace for current state */ For all s: /* Update ALL s's - really only ones we have visited */ V (s) "V (s) +(%e(s) /* incrementally update V(s) according to TD()) e(s) "&)e(s) /* decay trace */ s" $s /* Move to next state */ Until s is terminal


Relation of Backwards View to MC& TD(0)

• Using update rule:

• As before, if you set $ to 0, you get to TD(0)

• If you set $ to 1, you get MC but in a better way• Can apply TD(1) to continuing tasks

• Works incrementally and on-line (instead of waiting to the end ofthe episode)

• Again, by the time you get to the terminal state, theseincrements add up to the one you would have made via theforward (theoretical) view.

!Vt (s) = "# tet (s)


Forward View = Backward View

• The forward (theoretical) view of TD(!) is equivalentto the backward (mechanistic) view for off-lineupdating

• The book shows:

• On-line updating with small " is similar

!VtTD (s)

t=0

T "1

# = $t=0

T "1

# Isst (%&)k" t'kk= t

T "1

# !Vt" (st )Isst

t=0

T #1

$ = %t=0

T #1

$ Isst (&")k# t'kk= t

T #1

$

!VtTD (s)

t=0

T "1

# = !Vt$ (st )

t=0

T "1

# Isst

Backward updates Forward updates

algebra shown in book


On-line versus Off-line on RandomWalk

• Same 19 state random walk

• On-line performs better over a broader range of parameters


Control: Sarsa(!)

• Standard control idea: Learn Qvalues - so, save eligibility forstate-action pairs instead ofjust states

et (s,a) =!"et#1(s,a) +1 if s = st and a = at!"et#1(s,a) otherwise

$%&

'&

Qt+1(s,a) = Qt (s,a) +() tet (s,a)) t = rt+1 + !Qt (st+1,at+1) #Qt (st ,at )


Sarsa(!) AlgorithmInitialize Q(s,a) arbitrarilyRepeat (for each episode): e(s,a) = 0, for all s,a Initialize s,a Repeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) # $ r + %Q( !s , !a ) &Q(s,a) e(s,a) $ e(s,a) +1 For all s,a: Q(s,a) $Q(s,a) +'#e(s,a) e(s,a) $%(e(s,a) s$ !s ;a$ !a Until s is terminal


Sarsa(!) Gridworld Example

• With one trial, the agent has much more information abouthow to get to the goal

• not necessarily the best way

• Can considerably accelerate learning


Three Approaches to Q(!)• How can we extend this to Q-learning?

• Recall Q-learning is an off-policy method to learn Q* - and it usesthe max of the Q values for a state in its backup

• What happens if we make an exploratory move? This is NOT theright thing to back up over then…

• What to do?

• Three answers: Watkins, Peng, and the “naïve” answer.


Three Approaches to Q(!)• If you mark every state action

pair as eligible, you backup overnon-greedy policy

• Watkins: Zero out eligibilitytrace after a non-greedy action.Do max when backing up at firstnon-greedy choice.

et (s,a) =1+ !"et#1(s,a)

0!"et#1(s,a)

if s = st ,a = at ,Qt#1(st ,at ) = maxa Qt#1(st ,a) if Qt#1(st ,at ) $ maxa Qt#1(st ,a)

otherwise

%

&'

('

Qt+1(s,a) = Qt (s,a) +)* tet (s,a)* t = rt+1 + ! max +a Qt (st+1, +a ) #Qt (st ,at )


Watkins’s Q(!)Initialize Q(s,a) arbitrarilyRepeat (for each episode): e(s,a) = 0, for all s,a Initialize s,a Repeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) a* # argmaxb Q( !s ,b) (if a ties for the max, then a* # !a ) $ # r + %Q( !s , !a ) &Q(s,a*) e(s,a) # e(s,a) +1 For all s,a: Q(s,a) #Q(s,a) +'$e(s,a) If !a = a*, then e(s,a) #%(e(s,a) else e(s,a) # 0 s# !s ;a# !a Until s is terminal


Peng’s Q(!)• Disadvantage to Watkins’s

method:• Early in learning, the eligibility

trace will be “cut” (zeroed out)frequently resulting in littleadvantage to traces

• Peng:• Backup max action except at

end

• Never cut traces

• Disadvantage:• Complicated to implement


Naïve Q(!)

• Idea: is it really a problem to backup exploratoryactions?

• Never zero traces

• Always backup max at current action (unlikePeng or Watkins’s)

• Is this truly naïve?

• Well, it works well in some empirical studies

What is the backup diagram?


Comparison Task

From McGovern and Sutton (1997). Towards a better Q(!)

• Compared Watkins’s, Peng’s, and Naïve (called McGovern’shere) Q(!) on several tasks.• See McGovern and Sutton (1997). Towards a Better Q($) for other

tasks and results (stochastic tasks, continuing tasks, etc)

• Deterministic gridworld with obstacles• 10x10 gridworld

• 25 randomly generated obstacles

• 30 runs• " = 0.05, # = 0.9, ! = 0.9, $ = 0.05, accumulating traces


Comparison Results

From McGovern and Sutton (1997). Towards a better Q(!) 3838CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

Convergence of the Q(!)’s

• None of the methods are proven to converge.• Much extra credit if you can prove any of them.

• Watkins’s is thought to converge to Q*

• Peng’s is thought to converge to a mixture of Q% and Q*

• Naïve - Q*?


Eligibility Traces for Actor-Critic Methods

• Critic: On-policy learning of V%. Use TD(!) as describedbefore.

• Actor: Needs eligibility traces for each state-action pair.

• We change the update equation:

pt+1(s,a) =pt (s,a) +!" t if a = at and s = stpt (s,a) otherwise

#$%

&%

pt+1(s,a) = pt (s,a) +!" tet (s,a)to


Eligibility Traces for Actor-Critic Methods

• Critic: On-policy learning of V%. Use TD(!) as describedbefore.

• Actor: Needs eligibility traces for each state-action pair.

• Can change the other actor-critic update:

pt+1(s,a) =pt (s,a) +!" t 1# $ (s,a)[ ] if a = at and s = st

pt (s,a) otherwise

%&'

('

topt+1(s,a) = pt (s,a) +!" tet (s,a)

et (s,a) =!"et#1(s,a) +1# $ t (st ,at ) if s = st and a = at

!"et#1(s,a) otherwise

%&'

('

where


Replacing Traces

• Using accumulating traces, frequently visited states canhave eligibilities greater than greater than 11• This can be a problem for convergence

• Replacing traces: Instead of addingadding 1 when you visit astate, set that trace to set that trace to 11

et (s) =!"et#1(s) if s $ st

1 if s = st

%&'

('


Replacing Traces Example• Same 19 state random walk task as before

• Replacing traces perform better than accumulating traces over morevalues of !


Why Replacing Traces?• Replacing traces can significantly speed learning

• They can make the system perform well for a broader set of parameters

• Accumulating traces can do poorly on certain types of tasks

Why is this task particularly onerous foraccumulating traces?


More Replacing Traces

• Off-line replacing trace TD(1) is identical to first-visit MC

• Extension to action-values (Q values):• When you revisit a state, what should you do with the traces for the

actions you didn’t take?

• Singh and Sutton say to set them to zero:

et (s,a) =10

!"et#1(s,a)

$

%&

'&

if s = st and a = atif s = st and a ( at

if s ( st


Implementation Issues with Traces

• Could require much more computation• But most eligibility traces are VERY close to zero

• If you implement it in Matlab, backup is only one line ofcode and is very fast (Matlab is optimized for matrices)


Variable !• Can generalize to variable !

• Here ! is a function of time• Could define

et (s) =!"tet#1(s) if s $ st

!"tet#1(s) +1 if s = st

%&'

('

!t = !(st ) or !t = !t"


Conclusions

• Provides efficient, incremental way to combine MCand TD• Includes advantages of MC (can deal with lack of Markov

property)

• Includes advantages of TD (using TD error, bootstrapping)

• Can significantly speed learning

• Does have a cost in computation


The two views

END5050CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

Unified View

Part I: The Problem Introduction CSE 190: Reinforcement...

Documents

Transcript of Part I: The Problem Introduction CSE 190: Reinforcement...