Welcome!
description
Transcript of Welcome!
NIPS 2007 Workshop
Welcome!
Hierarchical organization of behavior
•Thank you for coming
•Apologies to the skiers…
•Why we will be strict about timing
•Why we want the workshop to be interactive
Rewards/punishments may be delayedOutcomes may depend on sequence of actions Credit assignment problem
RL: Decision making
Goal: maximize reward (minimize punishment)
RL in a nutshell: formalization
states - actions - transitions - rewards - policy - long term values
Components of an RL
task
Policy: p(S,a)State values: V(S)State-action values: Q(S,a)
S1
S3S2
44 00 22 22
RL
RL in a nutshell: forward search
S1S3
S2LR
LRLR
= 4= 0= 2= 2
Model based RL
learn model through experience (cognitive map)choosing actions is hardgoal directed behavior; cortical
Model = T(ransitions) and R(ewards)
S1
S3S2
44 00 22 22
RL
Trick #1: Long-term values are recursiveQ(S,a) = r(S,a) + V(Snext)
RL in a nutshell: cached valuesM
odel-free RL
temporal difference learning
Q(S,a) = r(S,a) + max Q(S’,a’)
TD learning:start with initial (wrong) Q(S,a)
PE = r(S,a) + max Q(S’,a’) - Q(S,a)Q(S,a)new = Q(S,a)old + PE
S1
S3S2
44 00 22 22
RL
RL in a nutshell: cached valuesM
odel-free RL
choosing actions is easy (but need lots of practice to learn)habitual behavior; basal ganglia
temporal difference learning
S1
S3S2
44 00 22 22
RL
Trick #2: Can learn values without a model
Q(S1,L) 4Q(S1,R) 2
Q(S2,L) 4Q(S2,R) 0
Q(S3,L) 2Q(S3,R) 2
RL in real world tasks…
model based vs. model free learning and control
Q(S1,L) 4Q(S1,R) 2
Q(S2,L) 4Q(S2,R) 0
Q(S3,L) 2Q(S3,R) 2 S1
S3
S2LR
LRLR
= 4= 0= 2= 2
S1
S3S2
44 00 22 22
RL
Scaling problem!
Real-world behavior is hierarchicalHierarchical RL: W
hat is it?
1. set water temp2. get wet3. shampoo4. soap
5. turn off water6. dry off
add hot
success
add coldwait 5sec
too co
ld
too hotchangejust right
simplified control, disambiguation, encapsulation
1. pour coffee2. add sugar3. add milk4. stir
HRL: (in)formal framework
Termination condition = (sub)goal stateOption policy learning: via pseudo reward (model based or model free)
Hierarchical RL: What is
it?
options - skills - macros - temporally abstract actions(Sutton, McGovern, Dietterich, Barto, Precup, Singh, Parr…)
Option: set water temperatureS1S2S8…
S1
0.80.10.1
S2
0.10.10.8
S3
010
S1 (0.1)S2 (0.1)S3 (0.9)
…initiation set policy
termination
conditions
S: start G: goalOptions: going to doorsActions: + 2 door options
HRL: a toy exampleHierarchical RL: W
hat is it?
Advantages of HRL1. Faster learning
(mitigates scaling problem)
Hierarchical RL: What is
it?
RL: no longer ‘tabula rasa’
2. Transfer of knowledge from previous tasks(generalization, shaping)
Disadvantages (or: the cost) of HRLHierarchical RL: W
hat is it?
1. Need ‘right’ options - how to learn them?2. Suboptimal behavior (“negative transfer”;
habits)3. More complex learning/control structure
no free lunches…