Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of...
-
date post
19-Dec-2015 -
Category
Documents
-
view
220 -
download
2
Transcript of Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of...
![Page 1: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/1.jpg)
Chapter 8: Generalization and Function Approximation
Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part.
Overview of function approximation (FA) methods and how they can be adapted to RL
Objectives of this chapter:
![Page 2: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/2.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2
Value Prediction with FA
As usual: Policy Evaluation (the prediction problem): for a given policy , compute the state-value function V
In earlier chapters, value functions were stored in lookup tables.
Here, the value function estimate at time t, Vt , depends
on a parameter vector t , and only the parameter vector
is updated.
e.g., t could be the vector of connection weights
of a neural network.
![Page 3: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/3.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3
Adapt Supervised Learning Algorithms
Supervised Learning SystemInputs Outputs
Training Info = desired (target) outputs
Error = (target output – actual output)
Training example = {input, target output}
![Page 4: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/4.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4
Backups as Training Examples
e.g., the TD(0) backup :
V(st) V(st ) rt1 V(st1) V(st)
description of st , rt1 V (st1 )
As a training example:
input target output
![Page 5: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/5.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5
Any Function Approximation Method?
In principle, yes: artificial neural networks decision trees multivariate regression methods etc.
But RL has some special requirements: usually want to learn while interacting ability to handle nonstationarity other?
![Page 6: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/6.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6
Gradient Descent Methods
t t(1), t(2),,t (n) T
transpose
Assume Vt is a (sufficiently smooth) differentiable function
of t , for all s S.
Assume, for now, training examples of this form :
description of st , V (st)
![Page 7: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/7.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7
Performance Measures
Many are applicable but… a common and simple one is the mean-squared error
(MSE) over a distribution P to weight various errors:
Why P ? Why minimize MSE? Let us assume that P is always the distribution of states at
which backups are done. The on-policy distribution: the distribution created while
following the policy being evaluated. Stronger results are available for this distribution.
MSE( t) P(s) V (s) Vt (s) sS 2
![Page 8: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/8.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8
Gradient Descent
Let f be any function of the parameter space.
Its gradient at any point t in this space is :
f (
t )
f (t )
(1),f (
t)
(2),,
f (t)
(n)
T
.
(1)
(2)
t t(1), t(2) T
t1
t
f (t)
Iteratively move down the gradient:
![Page 9: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/9.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9
Gradient Descent Cont.
t1
t
1
2
MSE(
t)
t
1
2
P(s)sS V (s) Vt(s) 2
t P(s)
sS V (s) Vt(s)
Vt(s)
For the MSE given above and using the chain rule:
![Page 10: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/10.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10
Gradient Descent Cont.
t1
t
1
2
V (st) Vt (st) 2
t V (st ) Vt (st )
Vt(st ),
Assume that states appear with distribution PUse just the sample gradient instead:
Since each sample gradient is an unbiased estimate ofthe true gradient, this converges to a local minimum of the MSE if decreases appropriately with t.
E V (st ) Vt(st )
Vt (st) P(s) V (s) Vt (s) sS
Vt(s)
![Page 11: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/11.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11
But We Don’t have these Targets
Suppose we just have targets v t instead :t1
t v t Vt (st)
Vt(st )
If each vt is an unbiased estimate of V (st ),
i.e., E vt V (st ), then gradient descent converges
to a local minimum (provided decreases appropriately).
e.g., the Monte Carlo target vt Rt :
t1
t Rt Vt(st)
Vt (st )
![Page 12: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/12.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12
What about TD() Targets?
t1
t Rt
Vt(st) Vt (st )
Not for 1
But we do it anyway, using the backwards view :t1
t t
e t ,
where :
t rt1 Vt (st1) Vt (st ), as usual, ande t
e t 1 Vt(st )
![Page 13: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/13.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13
On-Line Gradient-Descent TD()
![Page 14: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/14.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14
Linear Methods
Represent states as feature vectors :
for each s S :s s (1),s(2),,s (n) T
Vt (s)
t
Ts t (i)s(i)
i1
n
Vt (s) ?
![Page 15: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/15.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15
Nice Properties of Linear FA Methods
The gradient is very simple: For MSE, the error surface is simple: quadratic surface
with a single minumum. Linear gradient descent TD() converges:
Step size decreases appropriately On-line sampling (states sampled from the on-policy
distribution) Converges to parameter vector with property:
Vt (s)
s
MSE(
)
1 1
MSE( )
best parameter vector(Tsitsiklis & Van Roy, 1997)
![Page 16: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/16.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16
Coarse Coding
![Page 17: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/17.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17
Learning and Coarse Coding
![Page 18: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/18.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18
Tile Coding
Binary feature for each tile Number of features present at
any one time is constant Binary features means weighted
sum easy to compute Easy to compute indices of the
features present
![Page 19: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/19.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19
Tile Coding Cont.
Irregular tilings
HashingCMAC “Cerebellar model arithmetic computer”Albus 1971
![Page 20: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/20.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20
Radial Basis Functions (RBFs)
s (i) exp s ci
2
2 i2
e.g., Gaussians
![Page 21: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/21.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21
Can you beat the “curse of dimensionality”?
Can you keep the number of features from going up exponentially with the dimension?
Function complexity, not dimensionality, is the problem. Kanerva coding:
Select a bunch of binary prototypes Use hamming distance as distance measure Dimensionality is no longer a problem, only complexity
“Lazy learning” schemes: Remember all the data To get new value, find nearest neighbors and interpolate e.g., locally-weighted regression
![Page 22: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/22.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22
Control with Function Approximation
description of ( st , at ), v t Training examples of the form:
Learning state-action values
The general gradient-descent rule:
Gradient-descent Sarsa() (backward view):t1
t v t Qt (st ,at)
Q(st ,at)
t1
t t
e t
where
t rt1 Qt (st1, at1 ) Qt(st ,at )e t
e t 1
Q t(st ,at )
![Page 23: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/23.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23
GPI with Linear Gradient Descent Sarsa()
![Page 24: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/24.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24
GPI Linear Gradient Descent Watkins’ Q()
![Page 25: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/25.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25
Mountain-Car Task
![Page 26: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/26.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26
Mountain-Car Results
![Page 27: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/27.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27
Baird’s Counterexample
Reward is always zero, so the true value is zero for all s
The approximate of state value function is shown in each state
Updating
Is unstable from some initial conditions
)()](}|)({[ 111 sVsVsssVrE ks
tttttt
![Page 28: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/28.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28
Baird’s Counterexample Cont.
![Page 29: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/29.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29
Should We Bootstrap?
![Page 30: Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d3a5503460f94a15730/html5/thumbnails/30.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30
Summary
Generalization Adapting supervised-learning function approximation
methods Gradient-descent methods Linear gradient-descent methods
Radial basis functions Tile coding Kanerva coding
Nonlinear gradient-descent methods? Backpropation? Subtleties involving function approximation, bootstrapping
and the on-policy/off-policy distinction