Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer...
-
date post
22-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer...
Exploration in Reinforcement Learning
Jeremy Wyatt
Intelligent Robotics Lab
School of Computer Science
University of Birmingham, UK
www.cs.bham.ac.uk/research/robotics
www.cs.bham.ac.uk/~jlw
2
The talk in one slide
• Optimal learning problems: how to act while learning how to act
• We’re going to look at this while learning from rewards
• Old heuristic: be optimistic in the face of uncertainty
• Our method: apply principle of optimism directly to model of how the world works
• Bayesian
3
Plan• Reinforcement Learning
• How to act while learning from rewards
• An approximate algorithm
• Results
• Learning with structure
4
Reinforcement Learning (RL) • Learning from punishments and rewards• Agent moves through world, observing states and
rewards• Adapts its behaviour to maximise some function of
reward
s9s5s4s2
……
…s3
+50
-1-1
+3
r9r5r4r1
s1
a9a5a4a2 …a3a1
5
Reinforcement Learning (RL) • Let’s assume our agent acts according to some rules, called a
policy,
• The return Rt is a measure of long term reward collected after time t
+50
-1-1
+3
r9r5r4r1
21 2 3 1
0
kt t t t t k
k
R r r r r
3 4 80 3 1 1 50R
0 1
6
Reinforcement Learning (RL)
• Rt is a random variable
• So it has an expected value in a state under a given policy
• RL problem is to find optimal policy that maximises the expected value in every state
21 2 3 1
0
kt t t t t k
k
R r r r r
10
( ) { | , } | ,kt t t t k t
k
V s E R s E r s
7
Markov Decision Processes (MDPs)
• The transitions between states are uncertain• The probabilities depend only on the current state
• Transition matrix P, and reward function
r = 2211r = 0a1
a2
112p
111p
212p
211p
312 1p Pr( 2 | 1, 3)t t ts s a
1 111 12
2 211 12
p p
p p
P =
0
2
= R
8
Bellman equations and bootstrapping
• Conditional independence allows us to define the expected return V* for the optimal policy in terms of a recurrence relation:
where
• We can use the recurrence relation to bootstrap our estimate of Vin two ways
* *ij( , ) { ( )}a
jj S
Q i a p V j
R
* *( ) max ( , )a
V i Q i a
A i
4
3
5
a
9
Two types of bootstrapping
• We can bootstrap using explicit knowledge of P and (Dynamic Programming)
• Or we can bootstrap using samples from P and (temporal difference learning)
* *ij 1( , ) { ( )}a
n j nj S
Q i a p V j
R i
4
3
5
a
* * * *1 1 1
ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )t t t t t t t t t t t tb A
Q s a Q s a r Q s b Q s a
st+1
atst
rt+1
10
Multi-agent RL: Learning to play football
• Learning to play in a team• Too time consuming to do on real robots• There is a well established simulator league• We can learn effectively from reinforcement
11
Learning to play backgammon
• TD() learning and a Backprop net with one hidden layer
• 1,500,000 training games (self play)
• Equivalent in skill to the top dozen human players
• Backgammon has ~1020 states, so can’t be solved using DP
12
The Exploration problem: intuition
• We are learning to maximise performance
• But how should we act while learning?
• Trade-off: exploit what we know or explore to gain new information?
• Optimal learning: maximise performance while learning given your imperfect knowledge
13
The optimal learning problem
• If we knew P it would be easy
• However …
– We estimate P from observations
– P is a random variable
– There is a density f(P) over the space of possible MDPs
– What is it? How does it help us solve our problem?
* *ij( , ) { max ( , ) }a
jb
j S
Q i a p Q j b
A
R
14
A density over MDPs• Suppose we’ve wandered around for a while
• We have a matrix M containing the transition counts
• The density over possible P depends on
M, f(P|M) and is a product of Dirichlet
densities
r = 2211r = 0a1
a2
112 4m
111 2m
212 2m
211 4m
1 111 12
2 211 12
m m
m m
M =
212p
112p
00
1
1
1 111 12
2 211 12
p p
p p
P =
15
A density over multinomials
211a1
112 4m
111 2m
1 111 12m m
1m =
212p
1 111 12p p
1p =
112p
00
1
1
8,161m =
112p
112( | )f p 1m
16
Optimal learning formally stated
• Given f(P|M), find that maximises
* *( , , ) ( , , ) ( | )Q i a Q i a f dM P P M PM
212p
112p
00
1
1
r = 2211r = 0a1
a2
112p
111p
212p
211p
17
Transforming the problem
• When we evaluate the integral we get another MDP!
• This one is defined over the space of
information states
• This space grows exponentially in the
depth of the look ahead
a1
* *( , , ) ( ) ( , ( ))a aij j ij
j
Q i a p V j TM M + MR
1122, ( )T M
1,M
2133, ( )T M
a2
18
A heuristic: Optimistic Model Selection
• Solving the information state space MDP is intractable
• An old heuristic is be optimistic in the face of uncertainty
• So here we pick an optimistic P
• Find V* for that P only
• How do we pick P optimistically?
23
Do we really care?
• Why solve for MDPs? While challenging they are too simple to be useful
• Structured representations
are more powerful
24
Model-based RL: structured models
• Transition model P is represented compactly using a Dynamic Bayes Net
(or factored MDP)
• V is represented as a
tree
• Backups look like goal
regression operators
• Converging with the AI
planning community
26
Challenge: Learning with Hidden State
• Learning in a POMDP, or k-Markov environment
• Planning in POMDPs is intractable
• Factored POMDPs look promising
• POMDPs are the basis of the state of the art in mobile robotics
at at+1 at+2
rt+1 rt+2
st st+1 st+2
ot ot+1 ot+2
27
Wrap up• RL is a class of problems
• We can pose some optimal learning problems elegantly in this framework
• Can’t be perfect, but we can do alright
• BUT: probabilistic representations while very useful in many fields are a source of frequent intractability
• General probabilistic representations are best avoided
• How?