TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the...
Transcript of TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the...
![Page 1: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/1.jpg)
TD learning
Biology:
dompanine
![Page 2: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/2.jpg)
TD(λ)
Using a longer trajectory rather than single step:
For a single step:
The expected total Return from S on is:
R(S) = r + γV(S’)
For two steps:
![Page 3: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/3.jpg)
TD(λ)
n-step return at time t: Using a trajectory of length n
The value V(S) can be updated following n
steps from S by:
Estimation of the total return based on n steps
![Page 4: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/4.jpg)
Summary: generalizes the 1-step
update
n-step return at time t:
The value V(S) can be updated following n steps from
S by:
ΔVt(St) = α[rt+1 + γVt(St+1) - Vt(St)]
Generalizes the 1-step learning :
![Page 5: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/5.jpg)
Averaging trajectories: • It is also possible to average trajectories; we can use the sub-trajectories of
the full length-n trajectory to update V(S).
• A particular averaging (particular weights) is the TD(λ) weights:
• The weights are 1, λ, λ2,… with all this multiplied by (1-λ) since a weighted average needs the sum of weights to be 1.
•
![Page 6: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/6.jpg)
λ – Return
Using the single long trajectory we had:
The λ –return is the weighted average of all lengths:
![Page 7: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/7.jpg)
TD(λ)
And the learning rules:
TD(λ) learning:
Singe long trajectory:
![Page 8: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/8.jpg)
Eligibility traces
To compute this at time t, we need the n next steps which we still do not have. We want at time t to update back, the previous n visited states. This can be done with ‘eligibility trace Each visited state becomes ‘eligible’ for update, updates take place later:
TD(λ) learning:
![Page 9: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/9.jpg)
Implementing TD(λ) with Eligibility Traces
A memory called 'eligibility trace' is added to each state et(S) It is updated by:
The trace of S is incremented by 1 when S is visited, and decays by γλ at each step. Here γ is the discount factor and λ is the decay parameter.
![Page 10: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/10.jpg)
Learning with eligibility traces
Update V(S):
Take a step, compute a singel-step TD error:
V(S) is updated at each step, although the current step is different. If S was visited, then S1, S2, S3, then V(S) will be updated with the error of each of them. δ
![Page 11: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/11.jpg)
The full TD(λ) Algorithm:
V(S) is updated at each step, although the current step is different from S. If S was visited, then S1, S2, S3, then V(S) will be updated with the error of each of them.
![Page 12: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/12.jpg)
Eligibility traces
Updating state values V(S) by eligibility traces is mathematically identical to the ‘forward’ TD(λ) learning:
The update does not rely on future values, and has plausible biological models.
![Page 13: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/13.jpg)
SARSA (λ)
![Page 14: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/14.jpg)
Eligibility traces – biology
![Page 15: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/15.jpg)
SDTP
![Page 16: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/16.jpg)
Eligibility
![Page 17: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/17.jpg)
Synaptic Reinforcement
![Page 18: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/18.jpg)
Dopamine story
![Page 19: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/19.jpg)
Behavioral support for ‘prediction error’
Associating light cue with food
![Page 20: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/20.jpg)
‘Blocking’
No response to the bell The bell and food were consistently associated There was no prediction error, prediction error, not association, drives learning
![Page 21: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/21.jpg)
Rescola - Wagner
Associative learning occurs not because two events co-occur but because that co-occurrence is unanticipated on the basis of current associative strength.
Α, β are rate parameters. Vtot is the total association from all cues on this trial. λ is the currently expected value. Learning occurs if the current value Vtot is different from expectation.
Still no action selection, policy for behavior, long sequences
![Page 22: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/22.jpg)
Iterative solution for V(S)
Vπ(S) = < r1 + γ Vπ (S') >
V(S) ← V(S) + α [ (r + γV(S’)) – V(S) ]
Error
Prediction error, TD error
![Page 23: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/23.jpg)
• Learning is driven by the prediction error:
• δ(t) = r + γV(S’)) – V(S)
• Computed by the dopamine system
• (Here too, if there is no error, no learning will take place)
![Page 24: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/24.jpg)
Domaminergic neurons
• Dopamine is a neuro-modulator
• In the:
• VTA (ventral tegmental area)
• Substantia Nigra
• These neurons send their axons to brain structures involved in motivation and goal-directed behavior, for example, the striatum, nucleus accumbens, and frontal cortex.
![Page 25: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/25.jpg)
Major players in RL
![Page 26: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/26.jpg)
Effects of dopamine, why it is associated with reward and reward related learning
• drugs like amphetamine and cocaine exert their addictive actions in part by prolonging the influence of dopamine on target neurons
• Second, neural pathways associated with dopamine neurons are among the best targets for electrical self-stimulation.
• animals treated with dopamine receptor blockers learn less rapidly to press a bar for a reward pellet
![Page 27: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/27.jpg)
Self stimulation
![Page 28: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/28.jpg)
• You can put a stimulating electrode in various places. In
the Dopamine system (e.g. VTA), the animal will continue stimulating.
• In the Orbital cortex for example you can put the electrode in a taste-related sub-region, activated by food. The animal will stimulate the electrode when it is hungry, but will stop activating when he is not.
![Page 29: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/29.jpg)
Dopamine and prediction error
The animal (rat, monkey) gets a cue (visual, or auditory). A reward after a delay (1 sec below)
![Page 30: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/30.jpg)
Dopamine and prediction error
![Page 31: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/31.jpg)
TD, prediction error Conclusion of the biological study
![Page 32: TD learning Biology: dompanine9.54/fall14/slides/RL4 -- Computation and... · 2014-11-17 · the Dopamine system (e.g. VTA), the animal will continue stimulating. •In the Orbital](https://reader034.fdocuments.in/reader034/viewer/2022042322/5f0c13b47e708231d433a11b/html5/thumbnails/32.jpg)
Computational TD learning is similar:
Update V(S):
Take a step, compute a TD error:
V(S) is updated at each step, although the current step is different. If S was visited, then S1, S2, S3, then V(S) will be updated with the error of each of them. δ