Post on 06-Mar-2018
Kenji Doya doya@oist.jp
Neural Computation Unit Okinawa Institute of Science and Technology
MLSS 2012 in Kyoto
Brain and Reinforcement Learning �
Taipei!1.5 hour!
Tokyo!2.5 hour!
Seoul!2.5 hour!
Shanghai!2 hour!
Manila!
2.5 hour!
Okinawa!
Location of Okinawa �
Beijing!3 hour!
Okinawa Institute of Science & Technology
! Apr. 2004: Initial research ! President: Sydney Brenner
! Nov. 2011: Graduate university
! President: Jonathan Dorfan
! Sept. 2012: Ph.D. course
! 20 students/year
Our Research Interests How to build adaptive, autonomous systems
! robot experiments
How the brain realizes robust, flexible adaptation
! neurobiology
Learning to Walk (Doya & Nakano, 1985)
! Action: cycle of 4 postures ! Reward: speed sensor output
! Problem: a long jump followed by a fall
Need for long-term evaluation of action
Reinforcement Learning
! Learn action policy: s → a to maximize rewards ! Value function: expected future rewards ! V(s(t)) = E[ r(t) + γr(t+1) + γ2r(t+2) + γ3r(t+3) +…]
0≤γ≤1: discount factor ! Temporal difference (TD) error: ! δ(t) = r(t) + γV(s(t+1)) - V(s(t))
environment!
reward r!
action a!
state s!
agent!
γV(s(t+1)!
Example: Grid World ! Reward field
2
4
6
2
4
6
-2
-1
0
1
2
Value function γ=0.9
2
4
6
2
4
6
-2
-1
0
1
2
γ=0.3
Cart-Pole Swing-Up ! Reward: height of pole ! Punishment: collision
! Value in 4D state space
Learning to Stand Up �Morimoto & Doya, 2000)
! State: joint/head angles, angular velocity ! Action: torques to motors
! Reward: head height – tumble
Learning to Survive and Reproduce ! Catch battery packs ! survival
! Copy ‘genes’ by IR ports ! reproduction, evolution
Markov Decision Process (MDP)
! Markov decision process ! state s ∈ S ! action a ∈ A ! policy p(a|s) ! reward r(s,a) ! dynamics p(s’|s,a)
! Optimal policy: maximize cumulative reward ! finite horizon: E[ r(1) + r(2) + r(3) + ... + r(T)] ! infinite horizon: E[ r(1) + γr(2) + γ2r(3) + … ]
0≤γ≤1: temporal discount factor ! average reward: E[ r(1) + r(2) + ... + r(T)]/T, T→∞
environment!
reward r!
action a!
state s!
agent!
Solving MDPs �Dynamic Programming ! p(s’|s,a) and r(s,a) are known
! Solve Bellman equation V(s) = maxa E[ r(s,a) + γV(s’)]
! V(s): value function expected reward from state s
! Apply optimal policy a = argmaxa E[ r(s,a) + γV*(s’)]
! Value iteration
! Policy iteration
Reinforcement Learning ! p(s’|s,a) and r(s,a) are unknown
! Learn from actual experience {s,a,r,s,a,r,…}
! Monte Carlo ! SARSA
! Q-learning ! Actor-Critic
! Policy gradient
! Model-based ! learn p(s’|s,a), r(s,a) and do DP
Actor-Critic and TD learning�! Actor: parameterized policy: P(a|s; w) ! Critic: learn value function
V(s(t)) = E[ r(t) + γr(t+1) + γ2r(t+2) +…]
! in a table or a neural network
! Temporal Difference (TD) error:
! δ(t) = r(t) + γ V(s(t+1)) - V(s(t))
! Update
! Critic: ΔV(s(t)) = α δ(t)
! Actor: Δw = α δ(t) ∂P(a(t)|s(t);w)/∂w
… reinforce a(t) by δ(t)
SARSA and Q Learning
! Action value function ! Q(s,a) = E[ r(t) + γr(t+1) + γ2r(t+2) …| s(t)=s,a(t)=a]
! Action selection
! ε-greedy: a = argmaxa Q(s,a) with prob 1-ε*
! Boltzman: P(ai|s) = exp[βQ(s,ai)] / Σjexp[βQ(s,aj)] ! SARSA: on-policy update
! ΔQ(s(t),a(t)) = α{r(t)+γQ(s(t+1),a(t+1))-Q(s(t),a(t))} ! Q learning: off-policy update
! ΔQ(s(t),a(t)) = α{r(t)+γmaxa’Q(s(t+1),a’)-Q(s(t),a(t))}
“Lose to Gain” Task ! N states, 2 actions
! if r2 >> r1 , then better take a2 �
-r1� -r1�
+r2�
+r1� +r1�
-r2�
a2�s1� s2� s3� s4�
-r1�
+r1�a1�
Reinforcement Learning
! Predict reward: value function ! V(s) = E[ r(t) + γr(t+1) + γ2r(t+2)…| s(t)=s] ! Q(s,a) = E[ r(t) + γr(t+1) + γ2r(t+2)…| s(t)=s, a(t)=a]
! Select action ! greedy: a = argmax Q(s,a) ! Boltzmann: P(a|s) ∝ exp[ β Q(s,a)]
! Update prediction: TD error*! δ(t) = r(t) + γV(s(t+1)) - V(s(t)) ! ΔV(s(t)) = α δ(t) ! ΔQ(s(t),a(t)) = α δ(t)
How to implement these steps?
How to tune these parameters?
�������
�������
�������
�� r
��� ��
���� V
�� r
��� ��
���� V
�� r
��� ��
���� V
r
V
δ*
r
V
δ*
r
V
δ*
Dopamine Neurons Code TD Error δ(t) = r(t) + γV(s(t+1)) - V(s(t))
unpredicted
predicted
omitted
(Schultz et al. 1997)
W. SCHULTZ4
fails to occur, even in the absence of an immediately preced-ing stimulus (Fig. 2, bottom) . This is observed when animalsfail to obtain reward because of erroneous behavior, whenliquid flow is stopped by the experimenter despite correctbehavior, or when a valve opens audibly without deliveringliquid (Hollerman and Schultz 1996; Ljungberg et al. 1991;Schultz et al. 1993). When reward delivery is delayed for0.5 or 1.0 s, a depression of neuronal activity occurs at theregular time of the reward, and an activation follows thereward at the new time (Hollerman and Schultz 1996). Bothresponses occur only during a few repetitions until the newtime of reward delivery becomes predicted again. By con-trast, delivering reward earlier than habitual results in anactivation at the new time of reward but fails to induce adepression at the habitual time. This suggests that unusuallyearly reward delivery cancels the reward prediction for thehabitual time. Thus dopamine neurons monitor both the oc-currence and the time of reward. In the absence of stimuliimmediately preceding the omitted reward, the depressionsdo not constitute a simple neuronal response but reflect anexpectation process based on an internal clock tracking theprecise time of predicted reward.
Activation by conditioned, reward-predicting stimuliAbout 55–70% of dopamine neurons are activated by
conditioned visual and auditory stimuli in the various classi-cally or instrumentally conditioned tasks described earlier(Fig. 2, middle and bottom) (Hollerman and Schultz 1996;Ljungberg et al. 1991, 1992; Mirenowicz and Schultz 1994;Schultz 1986; Schultz and Romo 1990; P. Waelti, J. Mire-nowicz, and W. Schultz, unpublished data) . The first dopa-mine responses to conditioned light were reported by Milleret al. (1981) in rats treated with haloperidol, which increasedthe incidence and spontaneous activity of dopamine neuronsbut resulted in more sustained responses than in undruggedanimals. Although responses occur close to behavioral reac-tions (Nishino et al. 1987), they are unrelated to arm andeye movements themselves, as they occur also ipsilateral toFIG. 2. Dopamine neurons report rewards according to an error in re-
ward prediction. Top : drop of liquid occurs although no reward is predicted the moving arm and in trials without arm or eye movementsat this time. Occurrence of reward thus constitutes a positive error in the (Schultz and Romo 1990). Conditioned stimuli are some-prediction of reward. Dopamine neuron is activated by the unpredicted what less effective than primary rewards in terms of responseoccurrence of the liquid. Middle : conditioned stimulus predicts a reward,
magnitude and fractions of neurons activated. Dopamineand the reward occurs according to the prediction, hence no error in theprediction of reward. Dopamine neuron fails to be activated by the predicted neurons respond only to the onset of conditioned stimuli andreward (right) . It also shows an activation after the reward-predicting stim- not to their offset, even if stimulus offset predicts the rewardulus, which occurs irrespective of an error in the prediction of the later (Schultz and Romo 1990). Dopamine neurons do not distin-reward ( left ) . Bottom : conditioned stimulus predicts a reward, but the re- guish between visual and auditory modalities of conditionedward fails to occur because of lack of reaction by the animal. Activity of
appetitive stimuli. However, they discriminate between ap-the dopamine neuron is depressed exactly at the time when the rewardwould have occurred. Note the depression occurring ú1 s after the condi- petitive and neutral or aversive stimuli as long as they aretioned stimulus without any intervening stimuli, revealing an internal pro- physically sufficiently dissimilar (Ljungberg et al. 1992;cess of reward expectation. Neuronal activity in the 3 graphs follows the P. Waelti, J. Mirenowicz, and W. Schultz, unpublishedequation: dopamine response (Reward) Å reward occurred 0 reward pre-
data) . Only 11% of dopamine neurons, most of them withdicted. CS, conditioned stimulus; R, primary reward. Reprinted fromSchultz et al. (1997) with permission by American Association for the appetitive responses, show the typical phasic activations alsoAdvancement of Science. in response to conditioned aversive visual or auditory stimuli
in active avoidance tasks in which animals release a key toavoid an air puff or a drop of hypertonic saline (Mirenowicztogether, the occurrence of reward, including its time, mustand Schultz 1996), although such avoidance may be viewedbe unpredicted to activate dopamine neurons.as ‘‘rewarding.’’ These few activations are not sufficientlystrong to induce an average population response. Thus theDepression by omission of predicted rewardphasic responses of dopamine neurons preferentially reportenvironmental stimuli with appetitive motivational value butDopamine neurons are depressed exactly at the time of
the usual occurrence of reward when a fully predicted reward without discriminating between different sensory modalities.
J857-7/ 9k2a$$jy19 06-22-98 13:43:40 neupa LP-Neurophys
Basal Ganglia for Reinforcement Learning? (Doya 2000, 2007)
Cerebral cortex!state/action coding!
Striatum!reward prediction!
Pallidum!action selection!
Dopamine neurons!TD signal!
Thalamus!
δ �
V(s) � Q(s,a) �
state� action �
Monkey Free Choice Task (Samejima et al., 2005)
10 50 90
10
50
90 90-50
10-50
50-10 50-90 50-50
P( r
ewar
d |
Left
)= Q
L
P( reward | Right) = QR
%
%
Dissociation of action and reward !
Action Value Coding in Striatum (Samejima et al., 2005)
QL! QR!
QL! QR!-1 0 sec!
! QL neuron
! -QR neuron
Forced and Free Choice Task Makoto Ito
Center
Cue tone
0.5-1s 1-2s
Right
Rwd tone
No-rwd
Pellet
Left
poking Center! Right!Left!
pellet dish!
Cue tone Reward prob. (L, R)
Left tone (900Hz)
Fixed (50%,0%)
Right tone (6500Hz)
Fixed (0%, 50%)
Free-choice tone
(White noise)
Varied (90%, 50%) (50%, 90%) (50%, 10%) (10%, 50%)
Time Course of Choice
� ��� �� ��� �� ��� �� ��� �� �� �
�
�
������� ��������������������������������������������� ��������������
������� ������
Trials
PL
10-50 10-50 10-50 50-10 90-50 50-90 90-50 50-10 50-90 50-10
Tone B
Left for tone A
Right for tone A
0.9 0.5 0.1 0.1
0.5
0.9 P(r|a=L)
P(r|a=R)
Generalized Q-learning Model (Ito & Doya, 2009) �
! Action selection P(a(t)=L) = expQL(t)/(expQL(t)+expQR(t))
! Action value update: i�{L,R} Qi(t+1) = (1-α1)Qi(t) + α1κ1 if a(t)=i, r(t)=1
(1-α1)Qi(t) - α1κ2 if a(t)=i, r(t)=0 (1-α2)Qi(t) if a(t)≠i, r(t)=1 (1-α2)Qi(t) if a(t)≠i, r(t)=0
! Parameters ! α1: learning rate ! α2: forgetting rate ! κ1: reward reinforcement ! κ2: no-reward aversion
�� �� �� �� �� �� � � �� �����
��
��
�
�
�
�
�
�
�
�� �� �� �� �� �� � � �� ����
���
���
���
��
�
Left, reward!Left, no-reward!
Right, reward!Right, no-reward!
QL!QR!
(90 50)! (50 90)! (50 10)!Model Fitting by Particle Filter�
α2!
α1!
Trials!
Model Fitting�
! Generalized Q learning ! α1: learning
! α2: forgetting ! κ1: reinforcement
! κ2: aversion
! standard: α2=κ2=0 ! forgetting: κ2=0 ��� ����� ���� ����� ���� ����� ���� ����� ����
�����
������
�����
������
�����
������
������
�����
������
�����
�����
1st Markov model(4)!2nd Markov model(16)!3rd Markov model(64)!
4th Markov model(256)!
standard Q (const)(2)!F-Q (const)(3)!
DF-Q (const)(4)!local matching law(1)!
standard Q (variable)(2)!F-Q (variable)(2)!
DF-Q (variable)(2)!
**!
**!
*!
**!
**!
**!
**!
**!
**!
normalized likelihood!
Neural Activity in the Striatum
! Dorsolateral
! Dorsomedial
! Ventral
C! R!
C! R!
Action (out of center hole)! Reward (into choice hole)!
poking C!Tone!poking L!poking R!pellet dish!
DL(122)!DM(56)!NA(59)!
sec!
bit/s
ec!
Information of Action and Reward
Action value coded by a DLS neuron Firing rate during tone presentation (blue in left panel)!
Action value for left estimated by FQ-learning!
action value for left estimated by FQ-learning!Firin
g ra
te d
urin
g to
ne p
rese
ntat
ion
(blu
e)! Trials!
Trials!
Q!
FSA!
DLS!
State value coded by a VS neuron Firing rate during tone presentation (blue in left panel)!
Action value for left estimated by FQ-learning!
action value for left estimated by FQ-learning!Firin
g ra
te d
urin
g to
ne p
rese
ntat
ion
(blu
e)!
Q!
FSA!
VS!
Hierarchy in Cortico-Striatal Network
! Dorsolateral striatum – motor ! early action coding
! what action to take?
! Dorsomedial striatum - frontal
! action value
! in what context?
! Ventral striatum - limbic
! state value
! whether worth doing?�(Voorn et al., 2004)!
thalamus�
SN�
IO�
Cortex�
Basal�Ganglia�
Cerebellum�
target�
error �+�
-�
output�input�
Cerebellum: Supervised Learning!
reward�
output�input�
Basal Ganglia: Reinforcement Learning!
Cerebral Cortex:Unsupervised Learning!
output�input �
Specialization by Learning Algorithms (Doya, 1999) �
Multiple Action Selection Schemes ! Model-free ! a = argmaxa Q(s,a)
! Model-based
! a = argmaxa [r+V(f(s,a))]
forward model: f(s,a)
! Lookup table
! a = g(s)
s a
Q s’ a
V ai
f
s
s a
g
‘Grid Sailing’ Task Alan Fermin�
! Move a cursor to the goal ! 100 points for shortest path
! -5 points per excess steps
! Keymap
! only 3 directions
! non-trivial path planning
! Immediate or delayed start
! 4 to 6 sec for planning
! timeout in 6 sec �
1
3 2
1 2 3
+
Task Conditions�
!"#$%&'()*$+,-./&'0)1234*!(
!"#$%&'(")*%!+,(,'-.$'(./"&0-$""'%!/'(./"&01%,"/'
,2$%2"3+",'+!'*#(%!,'4"$-.$(+!3'%'(#&2+0,2"4'!%5+3%2+.!'2%,6%789'-:;<=9>?@?'28A:B=AC DCEB=F8>?@?'(8ACGC'+GC@?'HI9=JB=;C DCEB=<CGC>?@?'6:9K='/CL8>?@?M
>!8;8'+9EG=GIG:'CN',J=:9J:'89F'2:JB9C7COL?'@.A=98P8'+9EG=GIG:'CN',J=:9J:'89F'2:JB9C7COL?'M%2$')C<QIG8G=C987'!:I;CEJ=:9J:'&8RC;8GC;=:E
56789:);<8=9)>?):<;<@>)8@>A?9:)B=?7):@=8>@CD)EF)>=A8;489G4<==?=D)?=)EF)6:A9H)
I9?J;<GH<)B=?7)K8:>)<LK<=A<9@<:M)N:);<8=9A9H)H?<:)?9D)8@>A?9):<;<@>A?9)78F)
=<;F)?9)K=<GA@>AO<)7<@C89A:7:D):6@C)8:)=<>=A<O8;)?B)7?>?=)7<7?=A<:)B=?7)J<;;)
<:>8E;A:C<G):<9:?=A7?>?= 78K:D)?=)@?9:>=6@>A?9)89G)6:<)?B)A9><=98;)7?G<;:)A9)
9<J;F)<9@?69><=<G)<9OA=?97<9>:M)"<A9B?=@<7<9>)P<8=9A9H)Q"PRD)8)
@?7K6>8>A?98;)>C<?=F)?B)8G8K>AO<)?K>A78;)@?9>=?;D):6HH<:>:)>J?)8@>A?9)
:<;<@>A?9)7<>C?G:)>C8>)=<:<7E;<)=<8;)C6789)E<C8OA?=S)2?G<;4T=<<)Q2TR)
7<>C?G)6:<:)8@>A?9)O8;6<)B69@>A?9:)>?)K=<GA@>)B6>6=<)=<J8=G:)E8:<G)?9)@6==<9>)
:>8><:)89G)8O8A;8E;<)8@>A?9:D)89G)2?G<;4+8:<G)Q2+R)7<>C?G)6:<:)8)B?=J8=G)
7?G<;)>?)K=<GA@>)>C<)B6>6=<):>8><:)=<8@C<G)EF)CFK?>C<>A@8;)8@>A?9:M)"<@<9>)
@?7K6>8>A?98;)7?G<;:)QU?F8D)0VVVD)U8J <>)8;D)(''WR)C8O<)CFK?>C<:AX<G)>C8>)
2+)89G)2T)8=<)A7K;<7<9><G)A9)>C<)C6789)E=8A9)>C=?6HC)@?=>A@?4E8:8;)H89H;A8)
;??K:D)A9@;6GA9H)>C<)K=<B=?9>8;)@?=><L)89G)G?=:8;):>=A8>67M)/?)><:>)JC<>C<=)89G)
C?J)C6789:)6>A;AX<)2+)89G)2T):>=8><HA<:D)J<)K<=B?=7<G)89)B2"Y <LK<=A7<9>)
6:A9H)8)ZH=AG):8A;A9H)>8:I[M
+!2$./#)2+.!
("2*./,*6E\<@>:S)0])Q()B<78;<R)C<8;>CFD)9?9476:A@A89D)=AHC>4C89G<GD)8H<)(W!3F?M))
/=8A9A9H)Q3)E;?@I:D)(^')>=A8;:R)89G)/<:>)Q()E;?@I:D)0]')>=A8;:R):<::A?9:)J<=<)
:<K8=8><G)EF)8)9AHC>):;<<K)A9><=O8;M)*6E\<@>:)J<=<)=89G?7;F)8::AH9<G)>?)?9<)?B)
>C=<<)H=?6K:)JA>C)>C<):87<)>8:I)@?9GA>A?9:M)*6E\<@>:)C8G)>?)7?O<)8)@6=:?=)
Q>=A89H;<R)B=?7)>C<):>8=>)Q=<GR)>?)>C<)>8=H<>)H?8;)QE;6<R)EF):<_6<9>A8;;F)K=<::A9H)
>C=<<)I<F:M)Y9:>=6@>A?9)J8:)>?);<8=9)8):A9H;<)?K>A78;)8@>A?9):<_6<9@<)Q:C?=><:>)
K8>CJ8FR)B?=)<8@C)?B)>C<)124*!)K8A=:D)JCA@C)J<=<)@?9:>89>)>C=?6HC?6>)>C<)
<LK<=A7<9>M)/C<):>8=>)K?:A>A?9)@?;?=):K<@ABA<G)JC<>C<=)>?):>8=>)8)=<:K?9:<)
A77<GA8><;F)?=)8B><=)8)G<;8F)Q`:<@RM
INDEX MIDDLE RING
Buttons & Fingers
Key-Maps
Start-Goal Positions
SG1 SG2 SG3
2
SG4 SG5
KM1
KM2
SG1
SG2SG4
SG5
KM1 SG1
KM2 SG2
KM3 SG3
Training Session Test Session
KM-SG combinations
Cond 2: Learned Key-Maps
Cond 3: Learned KM-SG
Test Conditions
Cond 1: New Key-Map
31
S
ITI DELAY
Goal Reach
3~5 sec 4~6 sec 6 sec
2 sec
Reward
IMMEDIATE
No Reach
S
S
S S
S
Training Schedule Test Schedule
TB1 TB2
=<<:F=8G: F:78L:F
Random Random
…
1 2
4 trials
240 trials
4 trials=<<:F=8G: F:78L:F
Random Random
19 20
4 trials 4 trials=<<:F=8G: F:78L:F
Random Random
…
1 2
9 trials 9 trials=<<:F=8G: F:78L:F
Random Random
9 10
9 trials 9 trials
TB1 TB2 180 trials
Sequence of Trial Events
S
T
>U
VU
U
Time-over: 0 Fail: -5 Optimal: 100
TB3
Group 1
(6 subj.)
KM2
KM3
SG3
SG2SG5
SG4
Group 2
(6 subj.)
KM1
KM3
SG3
SG1SG5
SG4
Group 3
(6 subj.)
KM1 SG1
KM2 SG2
KM3 SG3
KM1 SG1
KM2 SG2
KM3 SG3
KM1 KM2 KM3
MODEL-FREE MODEL-BASED
SEQUENCE MEMORY
(Habit)
Q-LEARNING
KNOWLEDGEQ-a%P#Y/N/Y#bR
LEARNED
POLICY
RANDOMQ-a%P#"N/Y#bR
EXPECTEDPERFORMANCE
LEARNING STAGES
Trial-and-error learning for key-map and action sequence
Forward model for planning future actions
Sequential motor memory for fast action selection
COMPUTATIONAL STRATEGIES
TASK CONDITIONS
Condition 2
Known Key-Map
Condition 1New Key-Map
KM1
Condition 3
Learned KM-SG
SG1
KM2 SG2
KM1 SG2
KM1 SG3
KM2 SG1
KM2 SG3
KM3 SG1
KM3 SG2
KM3 SG3
'
0''
'
0''
'
0''
>A7<
Re
wa
rd
Re
wa
rd
Re
wa
rd
HIGH PERFORMANCE
BOOSTED LEARNING
SLOW LEARNING
QtR
>A7< QtR
>A7< QtR
PLANNING
Q(s,a)
P(s’,r’|s,a)
P(a|s)
$",#&2,'0 2$%+!+!3
"-*$P/*)4 /-*/N* "-*$P/*)4 /-*/N*
Immediate Delay Immediate Delay Immediate Delay0
20
40
60
80
100
% o
f T
rial T
ype D
istr
ibution
Block 1 Block 2 Block 3
Optimal
Suboptimal
Fail
05
1015
A-2A-1
A-1A-3
A-3
0200400600800
10001200140016001800200022002400
TrialAction Sequence
Reaction T
ime (
ms)
051015
A-2A-1
A-1A-3
A-3
0200400600800
10001200140016001800200022002400
TrialAction Sequence
Reaction T
ime (
ms)
A-2 A-1 A-1 A-3 A-30
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
Reaction T
ime (
ms)
Action Sequence
).!/+2+.!'> ).!/+2+.!'@
N($+ W /"&%D'4"$+./'%)2+5+2D).!/+2+.!'>'X').!/+2+.!'M ).!/+2+.!'@'X').!/+2+.!'M
).!/+2+.!'M'X'
Y).!/+2+.!'>?').!/+2+.!'@Z
UP%T.Q+N^`R UP%T.Q+N^`R
c<9>=8;)%2.Q+N^^d^WR
P8><=8;)%2.
20
N9><=A?=)+8:8;)!89H;A8
%?:><=A?=)+8:8;)!89H;A8%=?K<=4*2N
%?:><=A?=).<=<E<;;67
N9><=A?=).<=<E<;;67
*?78>?:<9:?=F .?=><L
*M)%8=A<>8;).?=><L
c<9>=8;)%2.Q+N^^d^WR
P8><=8;)%2.
%=?K<=4*2N
%=<4*2N%=<4*2N
20
*M)%8=A<>8;).?=><L
*?78>?:<9:?=F .?=><L
N9><=A?=)+8:8;)!89H;A8
%?:><=A?=)+8:8;)!89H;A8
N9><=A?=).<=<E<;;67
%?:><=A?=).<=<E<;;67
[\0>>?'L\]?'^\0_
)C9F=G=C9'@'`R7I:a
)C9F=G=C9'>'`;:Fa
$"3+.!'.-'+!2"$",2'%!%&D,+,
&"-2'*"(+,4*"$" $+3*2'*"(+,4*"$" &"-2'*"(+,4*"$" $+3*2'*"(+,4*"$" &"-2'*"(+,4*"$" $+3*2'*"(+,4*"$"
BGGQbccPPPd9Jd=;QdC=EGdKQc
e<)@?9BA=7<G)>C<);<8=9A9H)=<;8><G)A7K=?O<7<9>:)89G)
B?=78>A?9)?B)C8EA>):<_6<9@<:)EF):AH9ABA@89>)A7K=?O<7<9>:)
A9)8;;)K<=B?=789@<)7<8:6=<:D)JA>C)89)A9@=<8:<)A9)=<J8=G)
8@_6A:A>A?9)89G)8)G<@=<8:<)A9)>C<)967E<=)?B)7?O<:)>?)
=<8@C)>C<)>8=H<>)H?8;D)8:)J<;;)8:)A9)>C<)>A7<)>?):>8=>)89G)
BA9A:C)89)8@>A?9):<_6<9@<M)/C<):>8E;<)K<=B?=789@<)J8:)
8;:?)?E:<=O<G)A9)>C<):C?=>)K=8@>A@<)E<B?=<)>C<)B2"Y ><:>D)
89G)8B><=)?9<)9AHC>):;<<KM
).!/+2+.!'MN):<=A<:)?B)34J8F)Nb#cN):C?J<G)8)
:AH9ABA@89>)<BB<@>)?B)*>8=>)/A7<D)/8:I)
.?9GA>A?9:D)89G)/<:>)+;?I: QKf'M''0R
N)76;>A@?7K8=A:?9 898;F:A:):C?J<G)
>C8>).?9GA>A?9)0)C8G)>C<);?J<:>)
K<=B?=789@<D).?9GA>A?9)()J8:)
A9><=7<GA8><D)89G).?9GA>A?9)3)J8:)
>C<)E<:>M
N;:?D).?9GA>A?9)():C?J<G):AH9ABA@89>)
A9)=<J8=G)8@_6A:A>A?9)8B><=)>C<)G<;8F)
K<=A?GD)JCA@C)7AHC>)A9GA@8><)8)
E<9<BA@A8;)<BB<@>)?B)>C<)G<;8F)K<=A?G)
?9)>C<)A7K;<7<9>8>A?9)?B)8)B?=J8=G)
7?G<;)7<@C89A:7)B?=)8@>A?9)
:<;<@>A?9M
%<=B?=789@<)A9)).?9GA>A?9)0)
=<78A9<G)@?9:>89>)89G)CAHC)
A==<:K<@>AO<)?B)>C<):>8=>)>A7<)
@?9GA>A?9
).!/+2+.!'> ).!/+2+.!'@ ).!/+2+.!'M
).!)&#,+.!'4 e<)C8O<)G<O<;?K<G)8)ZH=AG4:8A;A9H)>8:I[)>?)><:>)C?J)C6789:)6>A;AX<)GABB<=<9>):>=8><HA<:)B?=)>C<);<8=9A9H)?B)8@>A?9)
:<;<@>A?9M))UA:>A9@>)K<=B?=789@<)K=?BA;<:)B?=)<8@C)>8:I)@?9GA>A?9)J<=<)B?69GD)89G)<8@C)8;:?)=<_6A=<G):K<@ABA@)9<6=8;)9<>J?=I: B?=)
:?;OA9H)>C<)>8:IM)#6=)9<L>):><K)A:)>?)K<=B?=7)8)=<A9B?=@<7<9>);<8=9A9H)7?G<;4E8:<G)898;F:A:)>?):<<I)JC<>C<=):6E\<@>:g)E<C8OA?=)
69G<=)<8@C)>8:I)@?9GA>A?9)BA>:)J<;;)>?):K<@ABA@)8@>A?9):<;<@>A?9):>=8><HA<:)Q2TD)2+R)89G)>C<)E=8A9)8=<8:)A7K;<7<9>A9H)>C<7M
$",#&2,'0 2",2
/8:I)@?9GA>A?9:)GABB<=<9>A8;;F)
7?G6;8><G)>C<)+#PU):AH98;)G6=A9H)
>C<)G<;8F)K<=A?GM).?9GA>A?9)()
:C?J<G):6:>8A9<G)89G):>=?9H<=)
8@>AO8>A?9)A9)UP%T.D)c%2.D)P%2.D)
89><=A?=)+!)89G)K?:><=A?=).+M))
.?9GA>A?9)3)8@>AO8><G)7?=<)7?>?=)
=<;8><G):>=6@>6=<:D)89G).?9GA>A?9)0)
:C?J<G)>C<);?J<:>D)E6>)9<8=)>?)
E8:<;A9<)8@>AO8>A?9)K8>><=9:M
!"#$%&'()*$+,-./&'0)1234*!(
!"#$%&'(")*%!+,(,'-.$'(./"&0-$""'%!/'(./"&01%,"/'
,2$%2"3+",'+!'*#(%!,'4"$-.$(+!3'%'(#&2+0,2"4'!%5+3%2+.!'2%,6%789'-:;<=9>?@?'28A:B=AC DCEB=F8>?@?'(8ACGC'+GC@?'HI9=JB=;C DCEB=<CGC>?@?'6:9K='/CL8>?@?M
>!8;8'+9EG=GIG:'CN',J=:9J:'89F'2:JB9C7COL?'@.A=98P8'+9EG=GIG:'CN',J=:9J:'89F'2:JB9C7COL?'M%2$')C<QIG8G=C987'!:I;CEJ=:9J:'&8RC;8GC;=:E
56789:);<8=9)>?):<;<@>)8@>A?9:)B=?7):@=8>@CD)EF)>=A8;489G4<==?=D)?=)EF)6:A9H)
I9?J;<GH<)B=?7)K8:>)<LK<=A<9@<:M)N:);<8=9A9H)H?<:)?9D)8@>A?9):<;<@>A?9)78F)
=<;F)?9)K=<GA@>AO<)7<@C89A:7:D):6@C)8:)=<>=A<O8;)?B)7?>?=)7<7?=A<:)B=?7)J<;;)
<:>8E;A:C<G):<9:?=A7?>?= 78K:D)?=)@?9:>=6@>A?9)89G)6:<)?B)A9><=98;)7?G<;:)A9)
9<J;F)<9@?69><=<G)<9OA=?97<9>:M)"<A9B?=@<7<9>)P<8=9A9H)Q"PRD)8)
@?7K6>8>A?98;)>C<?=F)?B)8G8K>AO<)?K>A78;)@?9>=?;D):6HH<:>:)>J?)8@>A?9)
:<;<@>A?9)7<>C?G:)>C8>)=<:<7E;<)=<8;)C6789)E<C8OA?=S)2?G<;4T=<<)Q2TR)
7<>C?G)6:<:)8@>A?9)O8;6<)B69@>A?9:)>?)K=<GA@>)B6>6=<)=<J8=G:)E8:<G)?9)@6==<9>)
:>8><:)89G)8O8A;8E;<)8@>A?9:D)89G)2?G<;4+8:<G)Q2+R)7<>C?G)6:<:)8)B?=J8=G)
7?G<;)>?)K=<GA@>)>C<)B6>6=<):>8><:)=<8@C<G)EF)CFK?>C<>A@8;)8@>A?9:M)"<@<9>)
@?7K6>8>A?98;)7?G<;:)QU?F8D)0VVVD)U8J <>)8;D)(''WR)C8O<)CFK?>C<:AX<G)>C8>)
2+)89G)2T)8=<)A7K;<7<9><G)A9)>C<)C6789)E=8A9)>C=?6HC)@?=>A@?4E8:8;)H89H;A8)
;??K:D)A9@;6GA9H)>C<)K=<B=?9>8;)@?=><L)89G)G?=:8;):>=A8>67M)/?)><:>)JC<>C<=)89G)
C?J)C6789:)6>A;AX<)2+)89G)2T):>=8><HA<:D)J<)K<=B?=7<G)89)B2"Y <LK<=A7<9>)
6:A9H)8)ZH=AG):8A;A9H)>8:I[M
+!2$./#)2+.!
("2*./,*6E\<@>:S)0])Q()B<78;<R)C<8;>CFD)9?9476:A@A89D)=AHC>4C89G<GD)8H<)(W!3F?M))
/=8A9A9H)Q3)E;?@I:D)(^')>=A8;:R)89G)/<:>)Q()E;?@I:D)0]')>=A8;:R):<::A?9:)J<=<)
:<K8=8><G)EF)8)9AHC>):;<<K)A9><=O8;M)*6E\<@>:)J<=<)=89G?7;F)8::AH9<G)>?)?9<)?B)
>C=<<)H=?6K:)JA>C)>C<):87<)>8:I)@?9GA>A?9:M)*6E\<@>:)C8G)>?)7?O<)8)@6=:?=)
Q>=A89H;<R)B=?7)>C<):>8=>)Q=<GR)>?)>C<)>8=H<>)H?8;)QE;6<R)EF):<_6<9>A8;;F)K=<::A9H)
>C=<<)I<F:M)Y9:>=6@>A?9)J8:)>?);<8=9)8):A9H;<)?K>A78;)8@>A?9):<_6<9@<)Q:C?=><:>)
K8>CJ8FR)B?=)<8@C)?B)>C<)124*!)K8A=:D)JCA@C)J<=<)@?9:>89>)>C=?6HC?6>)>C<)
<LK<=A7<9>M)/C<):>8=>)K?:A>A?9)@?;?=):K<@ABA<G)JC<>C<=)>?):>8=>)8)=<:K?9:<)
A77<GA8><;F)?=)8B><=)8)G<;8F)Q`:<@RM
INDEX MIDDLE RING
Buttons & Fingers
Key-Maps
Start-Goal Positions
SG1 SG2 SG3
2
SG4 SG5
KM1
KM2
SG1
SG2SG4
SG5
KM1 SG1
KM2 SG2
KM3 SG3
Training Session Test Session
KM-SG combinations
Cond 2: Learned Key-Maps
Cond 3: Learned KM-SG
Test Conditions
Cond 1: New Key-Map
31
S
ITI DELAY
Goal Reach
3~5 sec 4~6 sec 6 sec
2 sec
Reward
IMMEDIATE
No Reach
S
S
S S
S
Training Schedule Test Schedule
TB1 TB2
=<<:F=8G: F:78L:F
Random Random
…
1 2
4 trials
240 trials
4 trials=<<:F=8G: F:78L:F
Random Random
19 20
4 trials 4 trials=<<:F=8G: F:78L:F
Random Random
…
1 2
9 trials 9 trials=<<:F=8G: F:78L:F
Random Random
9 10
9 trials 9 trials
TB1 TB2 180 trials
Sequence of Trial Events
S
T
>U
VU
U
Time-over: 0 Fail: -5 Optimal: 100
TB3
Group 1
(6 subj.)
KM2
KM3
SG3
SG2SG5
SG4
Group 2
(6 subj.)
KM1
KM3
SG3
SG1SG5
SG4
Group 3
(6 subj.)
KM1 SG1
KM2 SG2
KM3 SG3
KM1 SG1
KM2 SG2
KM3 SG3
KM1 KM2 KM3
MODEL-FREE MODEL-BASED
SEQUENCE MEMORY
(Habit)
Q-LEARNING
KNOWLEDGEQ-a%P#Y/N/Y#bR
LEARNED
POLICY
RANDOMQ-a%P#"N/Y#bR
EXPECTEDPERFORMANCE
LEARNING STAGES
Trial-and-error learning for key-map and action sequence
Forward model for planning future actions
Sequential motor memory for fast action selection
COMPUTATIONAL STRATEGIES
TASK CONDITIONS
Condition 2
Known Key-Map
Condition 1New Key-Map
KM1
Condition 3
Learned KM-SG
SG1
KM2 SG2
KM1 SG2
KM1 SG3
KM2 SG1
KM2 SG3
KM3 SG1
KM3 SG2
KM3 SG3
'
0''
'
0''
'
0''
>A7<
Re
wa
rd
Re
wa
rd
Re
wa
rd
HIGH PERFORMANCE
BOOSTED LEARNING
SLOW LEARNING
QtR
>A7< QtR
>A7< QtR
PLANNING
Q(s,a)
P(s’,r’|s,a)
P(a|s)
$",#&2,'0 2$%+!+!3
"-*$P/*)4 /-*/N* "-*$P/*)4 /-*/N*
Immediate Delay Immediate Delay Immediate Delay0
20
40
60
80
100
% o
f T
rial T
ype D
istr
ibution
Block 1 Block 2 Block 3
Optimal
Suboptimal
Fail
05
1015
A-2A-1
A-1A-3
A-3
0200400600800
10001200140016001800200022002400
TrialAction Sequence
Reaction T
ime (
ms)
051015
A-2A-1
A-1A-3
A-3
0200400600800
10001200140016001800200022002400
TrialAction Sequence
Reaction T
ime (
ms)
A-2 A-1 A-1 A-3 A-30
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
Reaction T
ime (
ms)
Action Sequence
).!/+2+.!'> ).!/+2+.!'@
N($+ W /"&%D'4"$+./'%)2+5+2D).!/+2+.!'>'X').!/+2+.!'M ).!/+2+.!'@'X').!/+2+.!'M
).!/+2+.!'M'X'
Y).!/+2+.!'>?').!/+2+.!'@Z
UP%T.Q+N^`R UP%T.Q+N^`R
c<9>=8;)%2.Q+N^^d^WR
P8><=8;)%2.
20
N9><=A?=)+8:8;)!89H;A8
%?:><=A?=)+8:8;)!89H;A8%=?K<=4*2N
%?:><=A?=).<=<E<;;67
N9><=A?=).<=<E<;;67
*?78>?:<9:?=F .?=><L
*M)%8=A<>8;).?=><L
c<9>=8;)%2.Q+N^^d^WR
P8><=8;)%2.
%=?K<=4*2N
%=<4*2N%=<4*2N
20
*M)%8=A<>8;).?=><L
*?78>?:<9:?=F .?=><L
N9><=A?=)+8:8;)!89H;A8
%?:><=A?=)+8:8;)!89H;A8
N9><=A?=).<=<E<;;67
%?:><=A?=).<=<E<;;67
[\0>>?'L\]?'^\0_
)C9F=G=C9'@'`R7I:a
)C9F=G=C9'>'`;:Fa
$"3+.!'.-'+!2"$",2'%!%&D,+,
&"-2'*"(+,4*"$" $+3*2'*"(+,4*"$" &"-2'*"(+,4*"$" $+3*2'*"(+,4*"$" &"-2'*"(+,4*"$" $+3*2'*"(+,4*"$"
BGGQbccPPPd9Jd=;QdC=EGdKQc
e<)@?9BA=7<G)>C<);<8=9A9H)=<;8><G)A7K=?O<7<9>:)89G)
B?=78>A?9)?B)C8EA>):<_6<9@<:)EF):AH9ABA@89>)A7K=?O<7<9>:)
A9)8;;)K<=B?=789@<)7<8:6=<:D)JA>C)89)A9@=<8:<)A9)=<J8=G)
8@_6A:A>A?9)89G)8)G<@=<8:<)A9)>C<)967E<=)?B)7?O<:)>?)
=<8@C)>C<)>8=H<>)H?8;D)8:)J<;;)8:)A9)>C<)>A7<)>?):>8=>)89G)
BA9A:C)89)8@>A?9):<_6<9@<M)/C<):>8E;<)K<=B?=789@<)J8:)
8;:?)?E:<=O<G)A9)>C<):C?=>)K=8@>A@<)E<B?=<)>C<)B2"Y ><:>D)
89G)8B><=)?9<)9AHC>):;<<KM
).!/+2+.!'MN):<=A<:)?B)34J8F)Nb#cN):C?J<G)8)
:AH9ABA@89>)<BB<@>)?B)*>8=>)/A7<D)/8:I)
.?9GA>A?9:D)89G)/<:>)+;?I: QKf'M''0R
N)76;>A@?7K8=A:?9 898;F:A:):C?J<G)
>C8>).?9GA>A?9)0)C8G)>C<);?J<:>)
K<=B?=789@<D).?9GA>A?9)()J8:)
A9><=7<GA8><D)89G).?9GA>A?9)3)J8:)
>C<)E<:>M
N;:?D).?9GA>A?9)():C?J<G):AH9ABA@89>)
A9)=<J8=G)8@_6A:A>A?9)8B><=)>C<)G<;8F)
K<=A?GD)JCA@C)7AHC>)A9GA@8><)8)
E<9<BA@A8;)<BB<@>)?B)>C<)G<;8F)K<=A?G)
?9)>C<)A7K;<7<9>8>A?9)?B)8)B?=J8=G)
7?G<;)7<@C89A:7)B?=)8@>A?9)
:<;<@>A?9M
%<=B?=789@<)A9)).?9GA>A?9)0)
=<78A9<G)@?9:>89>)89G)CAHC)
A==<:K<@>AO<)?B)>C<):>8=>)>A7<)
@?9GA>A?9
).!/+2+.!'> ).!/+2+.!'@ ).!/+2+.!'M
).!)&#,+.!'4 e<)C8O<)G<O<;?K<G)8)ZH=AG4:8A;A9H)>8:I[)>?)><:>)C?J)C6789:)6>A;AX<)GABB<=<9>):>=8><HA<:)B?=)>C<);<8=9A9H)?B)8@>A?9)
:<;<@>A?9M))UA:>A9@>)K<=B?=789@<)K=?BA;<:)B?=)<8@C)>8:I)@?9GA>A?9)J<=<)B?69GD)89G)<8@C)8;:?)=<_6A=<G):K<@ABA@)9<6=8;)9<>J?=I: B?=)
:?;OA9H)>C<)>8:IM)#6=)9<L>):><K)A:)>?)K<=B?=7)8)=<A9B?=@<7<9>);<8=9A9H)7?G<;4E8:<G)898;F:A:)>?):<<I)JC<>C<=):6E\<@>:g)E<C8OA?=)
69G<=)<8@C)>8:I)@?9GA>A?9)BA>:)J<;;)>?):K<@ABA@)8@>A?9):<;<@>A?9):>=8><HA<:)Q2TD)2+R)89G)>C<)E=8A9)8=<8:)A7K;<7<9>A9H)>C<7M
$",#&2,'0 2",2
/8:I)@?9GA>A?9:)GABB<=<9>A8;;F)
7?G6;8><G)>C<)+#PU):AH98;)G6=A9H)
>C<)G<;8F)K<=A?GM).?9GA>A?9)()
:C?J<G):6:>8A9<G)89G):>=?9H<=)
8@>AO8>A?9)A9)UP%T.D)c%2.D)P%2.D)
89><=A?=)+!)89G)K?:><=A?=).+M))
.?9GA>A?9)3)8@>AO8><G)7?=<)7?>?=)
=<;8><G):>=6@>6=<:D)89G).?9GA>A?9)0)
:C?J<G)>C<);?J<:>D)E6>)9<8=)>?)
E8:<;A9<)8@>AO8>A?9)K8>><=9:M
Effect of Pre-start Delay Time� New Learned key-map Learned �
Block 1 Block 20
5
10
15
20
25
30
35
Test Block
Mea
n Re
ward
Gain **
Condition 1Condition 2Condition 3
1 2 3 4 5 6 7 8 9 100102030405060708090
100
Trial
% O
ptim
al Go
al Re
ach
Error Suboptimal Optimal0102030405060708090
100
Trial Type%
Rew
ard
Scor
e
**
**
Condition 1 � Condition 3A Condition 2 � Condition 3B
C D
!"#$%&'(
Delay Period Activity �! Condition 2 ! DLPFC
! PMC
! parietal
! anterior striatum
! lateral cerebellum�
�
POMDP by Cortex-Basal Ganglia Rao (2011)�
! Belief state update by the cortex �
216
Frontiers in Computational Neuroscience www.frontiersin.org November 2010 | Volume 4 | Article 146 | 3
Rao Decision making under uncertainty
where v denotes the vector of output firing rates, o denotes the input observation vector, f
1 is a potentially non-linear function
describing the feedforward transformation of the input, M is the matrix of recurrent synaptic weights, and g
1 is a dendritic filtering
function.The above differential equation can be rewritten in discrete
form as:
v i f g M i j v jt t tj
( ) ( ) ( , ) ( )o 1 (4)
where vt(i) is the ith component of the vector v, f and g are func-
tions derived from f1 and g
1, and M(i,j) is the synaptic weight value
in the ith row and jth column of M.To make the connection between Eq. (4) and Bayesian infer-
ence, note that the belief update Eq. (2) requires a product of two sources of information (current observation and feedback) whereas Eq. (4) involves a sum of observation- and feedback-related terms. This apparent divide can be bridged by performing belief updates in the log domain:
log ( ) log | log , , ( ) logb i P o s i T j a i b j kt t t tj
t1 1
This suggests that Eq. (4) could neurally implement Bayesian inference over time as follows: the log likelihood log P(o
t|s
t = i)
is computed by the feedforward term f(ot) while the feedback
g M i j v jj
t( ( , ) ( ))1 conveys the log of the predicted distribution, i.e., log
jT(j, a
t 1,i)b
t 1(j), for the current time step. The latter is
computed from the activities vt 1
(j) from the previous time step and the recurrent weights M(i,j), which is defined for each action a. The divisive normalization in Eq. (2) reduces in the equation above to the log k term, which is subtractive and could therefore be implemented via inhibition.
A neural model as sketched above for approximate Bayesian infer-ence but using a linear recurrent network was first explored in (Rao, 2004). Here we have followed the slightly different implementation in (Rao, 2005) that uses the non-linear network given by Eq. (3). As shown in (Rao, 2005), if one interprets Eq. (3) as the membrane potential dynamics in a stochastic integrate-and-fire neuron model,
Before proceeding to the model, we note that the space of beliefs is continuous (each component of the belief state vector is a prob-ability between 0 and 1) and typically high-dimensional (number of dimensions is one less than the number of states). This makes the problem of finding optimal policies very difficult. In fact, finding exact solutions to general POMDP problems has been proved to be a computationally hard problem (e.g., the finite-horizon case is “PSPACE-hard”; Papadimitriou and Tsitsiklis, 1987). However, one can typically find approximate solutions, many of which work well in practice. Our model is most closely related to a popular class of approximation algorithms known as point-based POMDP solvers (Hauskrecht, 2000; Pineau et al., 2003; Spaan and Vlassis, 2005; Kurniawati et al., 2008). The idea is to discretize the belief space with a finite set of belief points and compute value for these belief points rather than the entire belief space. For learning value, our model relies on the temporal-difference (TD) framework (Sutton and Barto, 1981; Sutton, 1988; Sutton and Barto, 1998) in rein-forcement learning theory, a framework that has also proved useful in understanding dopaminergic responses in the primate brain (Schultz et al., 1997).
Neural computation of beliefA prerequisite for a neural POMDP model is being able to compute the belief state b
t in neural circuitry. Several models have been pro-
posed for neural implementation of Bayesian inference (see Rao, 2007 for a review). We focus here on one potential implementation. Recall that the belief state is updated at each time step according to the following equation:
b i k P o s i T j a i b jt t t tj
t( ) ( | ) , , ( )1 1 (2)
where bt(i) is the ith component of the belief vector b
t and repre-
sents the posterior probability of state i.Equation (2) combines information from the current observa-
tion (P(ot|s
t)) with feedback from the past time step (b
t 1), suggest-
ing a neural implementation based on a recurrent network, for example, a leaky integrator network:
ddt
f gv
v o v1 1( ( ),) M (3)
Agent
World: T(s’,a,s)Observation o
Reward r
Action a
World
SESESEbt
SE : belief state estimatorbt : belief state
Agent
actionobservation & reward
A B
FIGURE 1 | The POMDP model. (A) When the animal executes an action a in the state s’, the environment (“World”) generates a new state s according to the transition probability T(s’,a,s). The animal receives an observation o of the new state according to P(o|s) and a reward r = R(s’,a). (B) In order to solve the POMDP
problem, the animal maintains a belief bt which is a probability distribution over states of the world. This belief is computed iteratively using Bayesian inference by the belief state estimator SE. An action for the current time step is provided by the learned policy , which maps belief states to actions.
Frontiers in Computational Neuroscience www.frontiersin.org November 2010 | Volume 4 | Article 146 | 6
Rao Decision making under uncertainty
We model the task using a POMDP as follows: there are two underlying hidden states representing the two possible directions of coherent motion (leftward or rightward). In each trial, the experi-menter chooses one of these hidden states (either leftward or right-ward) and provides the animal with observations of this hidden state in the form of an image sequence of random dots at the chosen coherence. Note that the hidden state remains the same until the end of the trial. Using only the sequence of observed images seen so far, the animal must choose one of the following actions: sample one more time step (to reduce uncertainty), make a leftward eye move-ment (indicating choice of leftward motion), or make a rightward eye movement (indicating choice of rightward motion).
We use the notation SL to represent the state corresponding
to leftward motion and SR to represent rightward motion. Thus,
at any given time t, the state st can be either S
L or S
R (although
within a trial, the state once selected remains unchanged). The animal receives noisy measurements or observations o
t of the
hidden state based on P(ot|s
t, c
t), where c
t is the current coher-
ence value. We assume the coherence value is randomly chosen for each trial from a set {C
1, C
2, , C
Q} of possible coherence
values, and remains the same within a trial. At each time step t, the animal must choose from one of three actions {A
S, A
L,
AR} denoting sample, leftward eye movement, and rightward eye
movement respectively.The animal receives a reward for choosing the correct action,
i.e., action AL when the true state is S
L and action A
R when the
true state is SR. We model this reward as a positive number (e.g.,
between 10 and 30; here, 20). An incorrect choice produces a large penalty (e.g., between 100 and 400; here, 400) simulating the time-out used for errors in monkey experiments (Roitman and Shadlen, 2002). We assume the animal is motivated by hunger or thirst to make a decision as quickly as possible. This is modeled using a small negative reward (penalty of 1) for each time step spent sampling. We have experimented with a range of reward/punishment values and found that the results remain qualitatively the same as long as there is a large penalty for incorrect decisions, a moderate positive reward for correct decisions, and a small penalty for each time step spent sampling.
The transition probabilities P(st|s
t 1, a
t 1) for the task are as
follows: the state remains unchanged (self-transitions have prob-ability 1) as long as the sample action A
S is executed. Likewise,
P(ct|c
t 1, A
S) = 1. When the animal chooses A
L or A
R, a new trial
begins, with a new state (SL or S
R) and a new coherence C
k (from
{C1, C
2, , C
Q}) chosen uniformly at random.
In the first set of experiments, we trained the model on 6000 trials of leftward or rightward motion. Inputs o
t were generated
according to P(ot|s
t, c
t) based on the current coherence value and
state (direction). For these simulations, ot was one of two values
OL and O
R corresponding to observing leftward and rightward
motion respectively. The probability P(o = OL|s = S
L, c = C
k)
was fixed to a value between 0.5 and 1 based on the coherence value C
k, with 0.5 corresponding to 0% coherence and 1 corre-
sponding to 100%. The probability P(o = OR|s = S
R, c = C
k) was
defined similarly.The belief state b
t over the unknown direction of motion
was computed using a slight variant of Eq. (2) using the current input o
t, the known coherence value c
t = C
k, the known transition
responses. Interestingly, results from such an experiment have recently been published by Nomoto et al. (2010). We compare their results to the model’s predictions in a section below.
RESULTSTHE RANDOM DOTS TASKWe tested the neural POMDP model derived above in the well-known random dots motion discrimination task used to study decision making in primates (Shadlen and Newsome, 2001). We focus specifically on the reaction-time version of the task (Roitman and Shadlen, 2002) where the animal can choose to make a decision at any time. In this task, the stimulus consists of an image sequence showing a group of moving dots, a fixed fraction of which are randomly selected at each frame and moved in a fixed direction (for example, either left or right). The rest of the dots are moved in random directions. The fraction of dots moving in the same direc-tion is called the motion strength or coherence of the stimulus.
The animal’s task is to decide the direction of motion of the coher-ently moving dots for a given input sequence. The animal learns the task by being rewarded if it makes an eye movement to a target on the left side of its fixation point if the motion is to the left, and to a target on the right if the motion is to the right. A wealth of data exists on the psychophysical performance of humans and monkeys on this task, as well as the neural responses observed in brain areas such as MT and LIP in monkeys performing this task (see Roitman and Shadlen, 2002; Shadlen and Newsome, 2001 and references therein).
EXAMPLE I: RANDOM DOTS WITH KNOWN COHERENCEIn the first set of experiments, we illustrate the model using a sim-plified version of the random dots task where the coherence value chosen at the beginning of the trial is known. This reduces the problem to that of deciding from noisy observations the underlying direction of coherent motion, given a fixed known coherence. We tackle the case of unknown coherence in a later section.
Cortex
Striatum
GPi/SNr
Thalamus
STN
GPe SNc/VTA
Belief
Basis belief points *ib
TD error
Action
Value
DA
FIGURE 3 | Suggested mapping of elements of the model to components of the cortex-basal ganglia network. STN, subthalamic nucleus; GPe, globus pallidus, external segment; GPi, Globus pallidus, internal segment; SNc, substantia nigra pars compacta; SNr, substantia nigra pars reticulata; VTA, ventral tegmental area; DA, dopaminergic responses. The model suggests that cortical circuits compute belief states, which are provided as input to the striatum (and STN). Cortico-striatal synapses are assumed to maintain a compact learned representation of cortical belief space (“basis belief points”). The dopaminergic responses of SNc/VTA neurons are assumed to represent reward prediction (TD) errors based on striatal activations. The GPe-GPi/SNr network, in conjunction with the thalamus, is assumed to implement action selection based on the compact belief representation in the striatum/STN.
Frontiers in Computational Neuroscience www.frontiersin.org November 2010 | Volume 4 | Article 146 | 14
Rao Decision making under uncertainty
g t eit ti i( ) ( ) /* 2 2
where i2is a variance parameter. In the simulations, i
2 was set to progressively larger values for larger ti loosely inspired by the fact that an animal’s uncertainty about time increases with elapsed time (Leon and Shadlen, 2003). Specifically, both ti and i
2 were set arbitrarily to 1.25i for i = 1, , 65. The number of hidden units for direction, coherence, and time was 65. The deadline T was set to 20 time steps, with a large penalty ( 2000) if a left/right decision was not reached before the deadline. Other parameters included: 20 reward for correct decisions, 400 for errors, 1 for each sampling action,
1 = 2.5 10 5,
2 = 4 10 8,
3 = 1 10 5, = 1, = 1.5,
2 = 0.08. The model was trained on 6000 trials, with motion direc-tion (Left/Right) and coherence (Easy/Hard) selected uniformly at random for each trial.
Figures 18B,C show the learned value function for the beginning (t = 1) and near the end of a trial (t = 19) before the deadline at t = 20. The shape of the value function remains approximately the same, but the overall value drops noticeably over time. Figure 18D illustrates this progressive drop in value over time for a slice through the value function at Belief(E) = 1.
Finally, in the case of an error trial, the model predicts that the absence of reward (or presence of a negative reward/penalty as in the simulations) should cause a negative reward prediction error and this error should be slightly larger for the higher coherence case due to its higher expected value (see Figure 13). This prediction is shown in Figure 17C, which compares average reward prediction (TD) error at the end of error trials for the 60% coherence case (left) and the 8% coherence case (right). The population dopamine responses for error trials are depicted by red traces in Figure 17B.
EXAMPLE III: DECISION MAKING UNDER A DEADLINEOur final set of results illustrates how the model can be extended to learn time-varying policies for tasks with a deadline. Suppose a task has to be solved by time T (otherwise, a large penalty is incurred). We will examine this situation in the context of a random dots task where the animal has to make a decision by time T in each trial (in contrast, both the experimental data and simulations discussed above involved the dots task with no deadline).
Figure 18A shows the network used for learning the value func-tion. Note the additional input node representing elapsed time t, measured from the start of the trial. The network includes basis neurons for elapsed time, each neuron preferring a particular time t i
*. The activation function is the same as before:
A
B
Time steps Time (ms)A
vg. T
D e
rror
Avg
. firi
ng ra
te (s
p/s)
Avg
. TD
err
or
Avg
. firi
ng ra
te (s
p/s)
Time steps Time (ms)
60%
8%
50%
5%
Model Monkey
FIGURE 16 | Reward prediction error and dopamine responses in the random dots task. (A) The plot on the left shows the temporal evolution of reward prediction (TD) error in the model, averaged over trials with “Easy” motion coherence (coherence = 60%). The dotted line shows the average reaction time. The plot on the right shows the average firing rate of dopamine neurons in SNc in a monkey performing the random dots task at 50% motion coherence. The dashed line shows average time of saccade onset. The gray trace and line show the case where the amount of reward was reduced (not modeled here; see Nomoto et al., 2010). (B) (Left) Reward prediction (TD) error in the model averaged over trials with “Hard” motion coherence (coherence = 8%). (Right) Average firing rate of the same dopamine neurons as in (A) but for trials with 5% motion coherence. (Dopamine plots adapted from Nomoto et al., 2010).
A
B
C
Avg
. TD
err
or
Time steps from reward
Avg
. TD
err
or
Time steps from reward
Avg
. TD
err
or
Time steps from error
60% 8%
60% 8%
Avg
. firi
ng ra
te (s
p/s)
Time steps from reward tone
Avg
. TD
err
or
Time steps from error
15% 5%50%
FIGURE 17 | Reward prediction error at the end of a trial. (A) Average reward prediction (TD) error at the end of correct trials for the 60% coherence case (left) and the 8% coherence case (right). The vertical line denotes the time of reward delivery. Right after reward delivery, TD error is larger for the 8% coherence case due to its smaller expected value (see Figure 13). (B) Population dopamine responses of SNc neurons in the same monkey as in Figure 16 but after the monkey has made a decision and a feedback tone indicating reward or no reward is presented. (Plots adapted from Nomoto et al., 2010). The black and red traces show dopamine response for correct and error trials respectively. The gray traces show the case where the amount of reward was reduced (see (Nomoto et al., 2010) for details). (C) Average reward prediction (TD) error at the end of error trials for the 60% coherence case (left) and the 8% coherence case (right). The absence of reward (or negative reward/penalty in the current model) causes a negative reward prediction error, with a slightly larger error for the higher coherence case due to its higher expected value (see Figure 13).
Robots
Virtual agents 15-25
Population
w1, w2, …, wn
Genes
Embodied Evolution (Elfwing et al., 2011)
Weights for top layer NN
Weights shaping rewards
Meta-parameters
v1, v2, …, vn
αγλτkτ 0
13
(a)
(b)
Figure 1. Two physical robots with six energy sources and the neural network controller.(a) The Cyber Rodent robots used in the experiments were equipped infrared communication for theexchange of genotypes and cameras for visual detection of energy sources (blue), tail-lamps of otherrobots (green), and faces of other robots (red). (b) The control architecture consisted of a linearartificial neural network. The output of the network was the weighted sum (
Pi
w
i
x
i
) of the fivenetwork inputs (x
i
) and the five evolutionarily tuned neural network weights (wi
). In each time step, ifthe output was less or equal to zero then the foraging module was selected, otherwise the matingmodule was selected. The basic behaviors were learned from by reinforcement learning with the aid ofevolutionarily tuned additional reward signals and meta-parameters. The foraging module learned aforaging behavior for capturing energy sources. The mating module learned both a mating behavior forthe exchange of genotypes, when a face of another robot was visible, and a waiting behavior, when noface was visible.
Temporal Discount Factor γ*! Large γ ! reach for far reward*
! Small γ ! only to near reward*
Temporal Discount Factor γ*
! V(t) = E[ r(t) + γr(t+1) + γ2r(t+2) + γ3r(t+3) +…] ! controls the ‘character’ of an agent
1 2 3 4 step
-20
+100
-20 -20
1
0
1 2 3 4 step
-20
+100
-20 -20
1
0
1 2 3 4 step
+50 -100
1
0
1 2 3 4 step
+50 -100
1
0
γ large� γ small
can’t resist temptation
no pain, no gain!
stay away from danger
better stay idle
V =18.7
V = -22.9
V =-25.1
V = 47.3
Depression?!
Impulsivity?!
Neuromodulators for Metalearning (Doya, 2002) �
! Metaparameter tuning is critical in RL ! How does the brain tune them?
Dopamine: TD error δ!Acetylcholine: learning rate α!Noradrenaline: exploration β�Serotonin: temporal discount γ!
Markov Decision Task (Tanaka et al., 2004)
! Stimulus and response
! State transition and reward functions
+100yen+100yen
2s1s 1s
1s0.5s 0.5s
Time
action a1 action a2
�-20 -20
+20 +20
-20
+20
s1 s2 s3 �-20 -20
+20 +20
+100
-100
s1 s2 s3
Block-Design Analysis
! Different areas for immediate/future reward prediction
SHORT vs. NO Reward (p < 0.001 uncorrected)
LONG vs. SHORT (p < 0.0001 uncorrected)
OFC Insula Striatum Cerebellum
Cerebellum Striatum Dorsal raphe DLPFC, VLPFC, IPC, PMd
Model-based Explanatory Variables ! Reward prediction
V(t)
! Reward prediction
error δ(t)
trial
γ = 0
γ = 0.3
γ = 0.6
γ = 0.8
γ = 0.9
γ = 0.99
γ = 0
γ = 0.3
γ = 0.6
γ = 0.8
γ = 0.9
γ = 0.99 1 312
-50
10
V
-50
10
delta
-10
0
10
V
-10
0
10
delta
-10010
V
-20
010
delta
-30
010
V
-20020
delta
-60
0
40
V
-50
0
50
delta
-600
0
400
-500
0
500
Regression Analysis ! Reward prediction
V(t)
! Reward prediction error δ(t)
mPFC Insula
x = -2 mm x = -42 mm
Striatum
z = 2
Tryptophan Depletion/Loading (Tanaka et al., 2007) �
! Tryptophan: precursor of serotonin ! depletion/loading affect
central serotonin levels (e.g. Bjork et al. 2001, Luciana et al.�2001)�
! 100 g of amino acid mixture ! experiments after 6 hours
Day2: Tr0 Day3: Tr+ Day1: Tr-
10.3g of tryptophan (Loading)
No��������������Depletion)
2.3g of tryptophan (Control)
Behavioral Result (Schweighfer et al., 2008) �! Extended sessions outside scanner �
Control!Loading!Depletion!
Modulation by Tryptophan Levels
Changes in Correlation Coefficient
! ROI (region of interest) analysis
γ = 0.6 (28, 0, -4)
γ = 0.99 (16, 2, 28)
Tr- < Tr+ correlation with V at large γ in dorsal Putamen
Tr- > Tr+ correlation with V at small γ in ventral Putamen
Reg
ress
ion
slop
e R
egre
ssio
n sl
ope
Microdialysis Experiment (Miyazaki et al 2011 EJNS)
2 mm
Dopamine and Serotonin Responses Serotonin
Dopamine
Average in 30 minutes ! Serotonin (n=10) ! Dopamine (n=8)
Serotonin v.s. Dopamine
! No significant positive or negative correlation
Delayed
Intermittent
Immediate
Effect of Serotonin Suppression (Miyazaki et al. JNS 2012) �
! 5-HT1A agonist in DRN ! Wait errors in long-delayed reward condition
choice error wait error �
Dorsal Raphe Neuron Recording (Miyazaki et al. 2011 JNS) �
! Putative 5-HT neurons ! wider spikes
! low firing rate
! suppression by 5-HT1a agonist
Delayed Tone-Food-Tone-Water Task Food Water
Tone ! 2 ~ 20 sec delays
���
���
���
��
������
���
�
��
��
���� ��� ��� �� ��� �� �� � �� �� ���
��
��
���
��
������
���
�
��
��
���� ��� ��� �� ��� �� �� � �� �� ��� ���� ��� ��� �� ��� �� �� � �� �� ���
��
��
���
��
������
���
�
��
��
���
���
���
��
������
���
�
��
��
���� ��� ��� �� ��� �� �� � �� �� ���
Food site! Water site!
����������� ������������
Tone before food!
����������� �����������
Tone before water!
��������
Error Trials in Extended Delay �! Sustained firing
! Drop before wait error �
a
b
c d
e f
Time from end of reward wait (s)
Time from reward site entry (s)
First 2s Last 2s First 2s Last 2sPeriod (2s) of reward wait
SuccessError
n = 23
SuccessErrorType of reward site entry
n = 43
n = 26 n = 23n.s.n.s. *****
*** ***
0
5
sp
ike
ss-1
0
5
sp
ike
ss-1
0
5
sp
ike
ss-1
0
5
sp
ike
ss-1
-2 00
5
sp
ike
ss-1
-2 00
5
0 20
5
sp
ike
ss-1
0 20
5
sp
ike
ss-1
0 50
10
Tone site
Time (s)
Tri
al
0 5
Food site
Time (s)0 5
0
10
Tone site
Time (s)
Tri
al
0 5
Water site
Time (s)
0 50
10
20
Tone site
Time (s)
Tri
al
0 5 10 15
Food site
Time (s)0 5
0
10
20
Tone site
Time (s)
Tri
al
0 5 10 15
Water site
Time (s)
���
���
��
������
���� �� �� � ��� �������
����������������
Serotonin Experiments: Summary ! Microdialysis ! higher release for delayed reward ! no apparent opponency with dopamine ! lower release cause waiting error
Consistent with discounting hypothesis ! Serotonin neuron recording ! higher firing during waiting ! firing stops before giving up
More dynamic variable than a ‘parameter’ ! Question: regulation of serotonin neurons ! algorithm for regulation of patience
Collaborators ! ATR → Tamagawa U
! Kazuyuki Samejima ! NAIST→CalTech→ATR→Osaka U
! Saori Tanaka ! CREST → USC
! Nicolas Schweighofer ! OIST
! Makoto Ito ! Kayoko Miyazaki ! Katsuhiko Miyazaki ! Takashi Nakano ! Jun Yoshimoto ! Eiji Uchibe ! Stefan Elfwing
! Kyoto PUM ! Minoru Kimura ! Yasumasa Ueda ! Hiroshima U ! Shigeto Yamawaki ! Yasumasa Okamoto ! Go Okada ! Kazutaka Ueda ! Shuji Asahi ! Kazuhiro Shishida
! U Otago → OIST ! Jeff Wickens