Summary of part I: prediction and RL

60
Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference learning Neural implementation: dopamine dependent learning in BG A precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the model Precise (normative!) theory for generation of dopamine firing patterns Explains anticipatory dopaminergic responding, second order conditioning Compelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areas

Transcript of Summary of part I: prediction and RL

Summary of part I: prediction and RL

Prediction is important for action selection

• The problem: prediction of future reward

• The algorithm: temporal difference learning

• Neural implementation: dopamine dependent learning in BG

⇒ A precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the model

⇒ Precise (normative!) theory for generation of dopamine firing patterns

⇒ Explains anticipatory dopaminergic responding, second order conditioning

⇒ Compelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areas

prediction error hypothesis of dopamine

mea

sure

d fir

ing

rate

model prediction error

mea

sure

d fir

ing

rate

Bayer & Glimcher (2005)

at end of trial: δt = rt - Vt (just like R-W)

Vt = η (1−η)t− i ri

i=1

t

Global plan

• Reinforcement learning I:

– prediction

– classical conditioning

– dopamine

• Reinforcement learning II:• Reinforcement learning II:

– dynamic programming; action selection

– Pavlovian misbehaviour

– vigor

• Chapter 9 of Theoretical Neuroscience

Action Selection

• Evolutionary specification• Immediate reinforcement:

– leg flexion– Thorndike puzzle box– Thorndike puzzle box– pigeon; rat; human matching

• Delayed reinforcement:– these tasks– mazes– chess

Bandler;Blanchard

Immediate Reinforcement

• stochastic policy:

• based on action values:

5

RL mm ;

Indirect Actoruse RW rule:

6

25.0;05.0 ==RL p

R

p

L rr switch every 100 trials

Direct ActorRL rRPrLPE ][][)( +=m

][][][

][][][

RPLPm

RPRPLP

m

LPRL

ββ −=∂

∂=∂

( )( )( )[ ] [ ] [ ]L L RE

P L r P L r P R rβ∂ = − +m ( )( )( )[ ] [ ] [ ]L L R

L

EP L r P L r P R r

mβ∂ = − +

∂m

( )( )[ ] ( )L

L

EP L r E

mβ∂ = −

∂m

m

( )( )( ) if L is chosenL

L

Er E

mβ∂ ≈ −

∂m

m

(1 )( ) ( ( ))( )L R L R am m m m r E L Rε ε− → − − + − −m

Direct Actor

8

Could we Tell?

• correlate past rewards, actions with present choice

• indirect actor (separate clocks):

• direct actor (single clock):

Matching: Concurrent VI-VI

Lau, Glimcher, Corrado, Sugrue, Newsome

Matching

• income not return• approximately exponential

in r

• alternation choice kernel

Action at a (Temporal) Distance

x=1

x=2 x=3

x=1

x=2 x=3

12

• learning an appropriate action at x=1:– depends on the actions at x=2 and x=3

– gains no immediate feedback

• idea: use prediction as surrogate feedback

Action Selectionstart with policy: ))()((];[ xmxmxLP RL −= σ

evaluate it: )3(),2(),1( VVV

x=1

x=2 x=3

x=1 x=3x=2

13

improve it:

thus choose R more frequently than L;C δα *m∆

0.025-0.175

-0.1250.125

x=1

x=2 x=3

Policyif 0>δ

• value is too pessimistic• action is better than average

v∆⇒

P∆⇒

x=1 x=3x=2

14

actor/critic

m1

m2

m3

mn

dopamine signals to both motivational & motor striatum appear, surprisingly the same

suggestion: training both values & policies

Formally: Dynamic Programming

Variants: SARSA[ ]CuxxVrECQ tttt ==+= + ,1|)(),1( 1

**

( )),1(),2(),1(),1( CQuQrCQCQ actualt −++→ ε

Morris et al, 2006

Variants: Q learning[ ]CuxxVrECQ tttt ==+= + ,1|)(),1( 1

**

( )),1(),2(max),1(),1( CQuQrCQCQ ut −++→ ε

Roesch et al, 2007

Summary

• prediction learning– Bellman evaluation

• actor-critic– asynchronous policy iteration– asynchronous policy iteration

• indirect method (Q learning)– asynchronous value iteration

[ ]1|)()1( 1** =+= + ttt xxVrEV

[ ]CuxxVrECQ tttt ==+= + ,1|)(),1( 1**

Impulsivity & Hyperbolic Discounting

• humans (and animals) show impulsivity in:– diets– addiction– spending, …

• intertemporal conflict between short and long term choices• often explained via hyperbolic discount functions• often explained via hyperbolic discount functions

• alternative is Pavlovian imperative to an immediate reinforcer

• framing, trolley dilemmas, etc

Direct/Indirect Pathways

• direct: D1: GO; learn from DA increase• indirect: D2: noGO; learn from DA decrease• hyperdirect (STN) delay actions given

strongly attractive choices

Frank

Frank

• DARPP-32: D1 effect• DRD2: D2 effect

Three Decision Makers

• tree search• position evaluation• situation memory

Multiple Systems in RL

• model-based RL– build a forward model of the task, outcomes– search in the forward model (online DP)

• optimal use of information• computationally ruinous• computationally ruinous

• cached-based RL– learn Q values, which summarize future worth

• computationally trivial• bootstrap-based; so statistically inefficient

• learn both – select according to uncertainty

Animal Canary

• OFC; dlPFC; dorsomedial striatum; BLA?• dosolateral striatum, amygdala

Two Systems:

Behavioural Effects

Effects of Learning

• distributional value iteration• (Bayesian Q learning)

• fixed additional uncertainty per step

One Outcome

shallow treeimpliesgoal-directedcontrolwins

Human Canary...a b

• if a → c and c → £££ , then do more of a or b?– MB: b– MF: a (or even no effect)

c

Behaviour

• action values depend on both systems:

• expect that will vary by subject (but be fixed)

( ) ),(),(, uxQuxQuxQ MBMFtot β+=β

Neural Prediction Errors (1→2)

R ventral striatum

• note that MB RL does not use this prediction error – training signal?

R ventral striatum

(anatomical definition)

Neural Prediction Errors (1)

• right nucleus accumbens

behaviour

1-2, not 1

Vigour

• Two components to choice:– what:

• lever pressing• direction to run

34

• direction to run• meal to choose

– when/how fast/how vigorous• free operant tasks

• real-valued DP

The model

τ

cost

LP

NP

?

how fast τ

VC

vigour cost

UC

unit cost(reward)

UR

PR

35

choose(action,ττττ) = (LP,τ1)

ττττ1 time

CostsRewards

choose(action,ττττ)= (LP,τ2)

CostsRewards

Other

ττττ2 timeS1 S2S0

goal

The model

Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time)

36

choose(action,ττττ) = (LP,τ1)

ττττ1 time

CostsRewards

choose(action,ττττ)= (LP,τ2)

CostsRewards

ττττ2 timeS1 S2S0

ARL

Compute differential values of actions

Differential value of taking action L

with latency τwhen in state x

ρ = average rewards

minus costs, per unit time

Average Reward RL

37

• steady state behavior (not learning dynamics)

(Extension of Schwartz 1993)

QL,τ(x) = Rewards – Costs + Future Returns

)'(xV

τρ−v

uCC τ+

⇒ Choose action with largest expected reward minus cost

1. Which action to take?

• slow → delays (all) rewards2.How fast to perform it?

• slow → less costly (vigour

Average Reward Cost/benefit Tradeoffs

38

• slow → delays (all) rewards

• net rate of rewards = cost of delay (opportunity cost of time)

⇒ Choose rate that balances vigour and opportunity costs

• slow → less costly (vigour cost)

explains faster (irrelevant) actions under hunger, etc

masochism

0 0.5 1 1.50

0.2

0.4pr

obab

ility

0 20 400

10

20

30

rate

per

min

ute

1st NPLP

Optimal response rates

Exp

erim

enta

l dat

aNiv, Dayan, Joel, unpublished1st Nose poke

39

0 0.5 1 1.5 0 20 40 Exp

erim

enta

l dat

a

seconds since reinforcement

Mod

el s

imul

atio

n

1st Nose poke

seconds since reinforcement0 0.5 1 1.5

0

0.2

0.4

prob

abili

ty

0 20 400

10

20

30ra

te p

er m

inut

e

seconds

seconds

Optimal response rates

Model simulation

50

% R

esp

on

ses

on

leve

r A Model

Perfect matching

100

80

60

Pigeon APigeon BPerfect matching

% R

esp

on

ses

on

key

A

Experimental data

40

500

0% Reinforcements on lever A

% R

esp

on

ses

on

leve

r A

0 20 40 60 80 100

40

20

% Reinforcements on key A

% R

esp

on

ses

on

key

A

Herrnstein 1961More:• # responses• interval length• amount of reward• ratio vs. interval• breaking point• temporal structure• etc.

Effects of motivation (in the model)RR25

RxVC

CRpuxQ vur ⋅−+−−= τττ )'(),,(

0),,(

2=−=

∂∂

RCuxQ v

τττ

opt

vopt

R

C=ττττ

41

low utilityhigh utility

mea

n la

ten

cy

LP Other

energizing effect

∂ ττ optR

Effects of motivation (in the model)re

spo

nse

rat

e / m

inu

te

RR25

resp

on

se r

ate

/ min

ute

directing effect1

42

UR 50%

resp

on

se r

ate

/ min

ute

seconds from reinforcement

resp

on

se r

ate

/ min

ute

seconds from reinforcement

low utilityhigh utility

mea

n la

ten

cy

LP Other

energizing effect

2

Phasic dopamine firing = reward prediction error

Relation to Dopamine

43

What about tonic dopamine?moreless

Tonic dopamine = Average reward ratem

inu

tes

2000

2500Control

DA depleted

800

1000

1200

30 m

inu

tes Control

DA depleted

1. explains pharmacological manipulations

2. dopamine control of vigour through BG pathways

44NB. phasic signal RPE for choice/value learning

Aberman and Salamone 1999

# L

Ps

in 3

0 m

inu

tes

1 4 16 64

500

1000

1500

1 4 8 160

200

400

600

Model simulation

# L

Ps

in 3

0 m

inu

tes

• eating time confound• context/state dependence (motivation & drugs?)• less switching=perseveration

Tonic dopamine hypothesis

…also explains effects of phasic dopamine on response times

$ $ $ $ $ $♫ ♫ ♫ ♫ ♫♫

45 Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992

reaction time

firin

g ra

te

…also explains effects of phasic dopamine on response times

Sensory Decisions as Optimal Stopping

• consider listening to:

• decision: choose, or sample• decision: choose, or sample

Optimal Stopping

• equivalent of state u=1 is 11 nu =

• and states u=2,3 is ( )212 2

1nnu +=

2.5

0.1Cr

σ =

= −

Transition Probabilities

Computational Neuromodulation

• dopamine– phasic: prediction error for reward– tonic: average reward (vigour)

• serotonin• serotonin– phasic: prediction error for punishment?

• acetylcholine:– expected uncertainty?

• norepinephrine– unexpected uncertainty; neural interrupt?

Conditioning

• Ethology– optimality– appropriateness

• Computation– dynamic progr.– Kalman filtering

prediction: of important eventscontrol: in the light of those predictions

50

– appropriateness

• Psychology– classical/operant

conditioning

– Kalman filtering

• Algorithm– TD/delta rules– simple weights

• Neurobiologyneuromodulators; amygdala; OFCnucleus accumbens; dorsal striatum

class of stylized tasks withstates, actions & rewards

– at each timestep t the world takes on state st and delivers reward rt, and the agent chooses an action at

Markov Decision Process

World: You are in state 34.Your immediate reward is 3. You have 3 actions.

Markov Decision Process

Your immediate reward is 3. You have 3 actions.Robot: I’ll take action 2.

World: You are in state 77.Your immediate reward is -7. You have 2 actions.

Robot: I’ll take action 1.

World: You’re in state 34 (again).Your immediate reward is 3. You have 3 actions.

Markov Decision Process

Stochastic process defined by:–reward function:

rt ~ P(rt | st)–transition function:

st ~ P(st+1 | st, at)

Markov Decision Process

Stochastic process defined by:–reward function:

rt ~ P(rt | st)–transition function:

st ~ P(st+1 | st, at) Markov property–future conditionally independent of past, given st

The optimal policy

Definition: a policy such that at every state, its expected value is better than (or equal to) that of all other policies

Theorem: For every MDP there exists (at least) Theorem: For every MDP there exists (at least) one deterministic optimal policy.

� by the way, why is the optimal policy just a mapping from states to actions? couldn’t you earn more reward by choosing a different action depending on last 2 states?

Pavlovian & Instrumental Conditioning

• Pavlovian– learning values and predictions– using TD error

• Instrumental• Instrumental– learning actions:

• by reinforcement (leg flexion)• by (TD) critic

– (actually different forms: goal directed & habitual)

Pavlovian-Instrumental Interactions

• synergistic– conditioned reinforcement– Pavlovian-instrumental transfer

• Pavlovian cue predicts the instrumental outcome• behavioural inhibition to avoid aversive outcomes• behavioural inhibition to avoid aversive outcomes

• neutral– Pavlovian-instrumental transfer

• Pavlovian cue predicts outcome with same motivational valence

• opponent– Pavlovian-instrumental transfer

• Pavlovian cue predicts opposite motivational valence

– negative automaintenance

-ve Automaintenance in Autoshaping

• simple choice task– N: nogo gives reward r=1– G: go gives reward r=0

• learn three quantities• learn three quantities– average value– Q value for N – Q value for G

• instrumental propensity is

-ve Automaintenance in Autoshaping

• Pavlovian action– assert: Pavlovian impetus towards G is v(t)– weight Pavlovian and instrumental advantages by ω –

competitive reliability of Pavlov

• new propensities

• new action choice

-ve Automaintenance in Autoshaping

• basic –ve automaintenance effect (µ=5)

• lines are theoretical • lines are theoretical asymptotes

• equilibrium probabilities of action