Exploration in Reinforcement Learning
Jeremy Wyatt
Intelligent Robotics Lab
School of Computer Science
University of Birmingham, UK
www.cs.bham.ac.uk/research/robotics
www.cs.bham.ac.uk/~jlw
2
The talk in one slide
• Optimal learning problems: how to act while learning how to act
• We’re going to look at this while learning from rewards
• Old heuristic: be optimistic in the face of uncertainty
• Our method: apply principle of optimism directly to model of how the world works
• Bayesian
3
Plan• Reinforcement Learning
• How to act while learning from rewards
• An approximate algorithm
• Results
• Learning with structure
4
Reinforcement Learning (RL) • Learning from punishments and rewards• Agent moves through world, observing states and
rewards• Adapts its behaviour to maximise some function of
reward
s9s5s4s2
……
…s3
+50
-1-1
+3
r9r5r4r1
s1
a9a5a4a2 …a3a1
5
Reinforcement Learning (RL) • Let’s assume our agent acts according to some rules, called a
policy,
• The return Rt is a measure of long term reward collected after time t
+50
-1-1
+3
r9r5r4r1
21 2 3 1
0
kt t t t t k
k
R r r r r
3 4 80 3 1 1 50R
0 1
6
Reinforcement Learning (RL)
• Rt is a random variable
• So it has an expected value in a state under a given policy
• RL problem is to find optimal policy that maximises the expected value in every state
21 2 3 1
0
kt t t t t k
k
R r r r r
10
( ) { | , } | ,kt t t t k t
k
V s E R s E r s
7
Markov Decision Processes (MDPs)
• The transitions between states are uncertain• The probabilities depend only on the current state
• Transition matrix P, and reward function
r = 2211r = 0a1
a2
112p
111p
212p
211p
312 1p Pr( 2 | 1, 3)t t ts s a
1 111 12
2 211 12
p p
p p
P =
0
2
= R
8
Bellman equations and bootstrapping
• Conditional independence allows us to define the expected return V* for the optimal policy in terms of a recurrence relation:
where
• We can use the recurrence relation to bootstrap our estimate of Vin two ways
* *ij( , ) { ( )}a
jj S
Q i a p V j
R
* *( ) max ( , )a
V i Q i a
A i
4
3
5
a
9
Two types of bootstrapping
• We can bootstrap using explicit knowledge of P and (Dynamic Programming)
• Or we can bootstrap using samples from P and (temporal difference learning)
* *ij 1( , ) { ( )}a
n j nj S
Q i a p V j
R i
4
3
5
a
* * * *1 1 1
ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )t t t t t t t t t t t tb A
Q s a Q s a r Q s b Q s a
st+1
atst
rt+1
10
Multi-agent RL: Learning to play football
• Learning to play in a team• Too time consuming to do on real robots• There is a well established simulator league• We can learn effectively from reinforcement
11
Learning to play backgammon
• TD() learning and a Backprop net with one hidden layer
• 1,500,000 training games (self play)
• Equivalent in skill to the top dozen human players
• Backgammon has ~1020 states, so can’t be solved using DP
12
The Exploration problem: intuition
• We are learning to maximise performance
• But how should we act while learning?
• Trade-off: exploit what we know or explore to gain new information?
• Optimal learning: maximise performance while learning given your imperfect knowledge
13
The optimal learning problem
• If we knew P it would be easy
• However …
– We estimate P from observations
– P is a random variable
– There is a density f(P) over the space of possible MDPs
– What is it? How does it help us solve our problem?
* *ij( , ) { max ( , ) }a
jb
j S
Q i a p Q j b
A
R
14
A density over MDPs• Suppose we’ve wandered around for a while
• We have a matrix M containing the transition counts
• The density over possible P depends on
M, f(P|M) and is a product of Dirichlet
densities
r = 2211r = 0a1
a2
112 4m
111 2m
212 2m
211 4m
1 111 12
2 211 12
m m
m m
M =
212p
112p
00
1
1
1 111 12
2 211 12
p p
p p
P =
15
A density over multinomials
211a1
112 4m
111 2m
1 111 12m m
1m =
212p
1 111 12p p
1p =
112p
00
1
1
8,161m =
112p
112( | )f p 1m
16
Optimal learning formally stated
• Given f(P|M), find that maximises
* *( , , ) ( , , ) ( | )Q i a Q i a f dM P P M PM
212p
112p
00
1
1
r = 2211r = 0a1
a2
112p
111p
212p
211p
17
Transforming the problem
• When we evaluate the integral we get another MDP!
• This one is defined over the space of
information states
• This space grows exponentially in the
depth of the look ahead
a1
* *( , , ) ( ) ( , ( ))a aij j ij
j
Q i a p V j TM M + MR
1122, ( )T M
1,M
2133, ( )T M
a2
18
A heuristic: Optimistic Model Selection
• Solving the information state space MDP is intractable
• An old heuristic is be optimistic in the face of uncertainty
• So here we pick an optimistic P
• Find V* for that P only
• How do we pick P optimistically?
19
Optimistic Model Selection
• h
Do some DP style bootstrapping to improve estimated V
20
Experimental results
21
Bayesian view: performance while learning
22
Bayesian view: policy quality
23
Do we really care?
• Why solve for MDPs? While challenging they are too simple to be useful
• Structured representations
are more powerful
24
Model-based RL: structured models
• Transition model P is represented compactly using a Dynamic Bayes Net
(or factored MDP)
• V is represented as a
tree
• Backups look like goal
regression operators
• Converging with the AI
planning community
25
Structured Exploration: results
26
Challenge: Learning with Hidden State
• Learning in a POMDP, or k-Markov environment
• Planning in POMDPs is intractable
• Factored POMDPs look promising
• POMDPs are the basis of the state of the art in mobile robotics
at at+1 at+2
rt+1 rt+2
st st+1 st+2
ot ot+1 ot+2
27
Wrap up• RL is a class of problems
• We can pose some optimal learning problems elegantly in this framework
• Can’t be perfect, but we can do alright
• BUT: probabilistic representations while very useful in many fields are a source of frequent intractability
• General probabilistic representations are best avoided
• How?
28
Cumulative Discounted Return
29
Cumulative Discounted Return
30
Cumulative Discounted Return
31
Policy Quality
32
Policy Quality
33
Policy Quality
Top Related