Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer...

Exploration in Reinforcement Learning

Jeremy Wyatt

Intelligent Robotics Lab

School of Computer Science

University of Birmingham, UK

[email protected]

www.cs.bham.ac.uk/research/robotics

www.cs.bham.ac.uk/~jlw

mailto:[email protected]







http://www.cs.bham.ac.uk/research/robotics







http://www.cs.bham.ac.uk/~jlw








2

The talk in one slide

• Optimal learning problems: how to act while learning how to act

• We’re going to look at this while learning from rewards

• Old heuristic: be optimistic in the face of uncertainty

• Our method: apply principle of optimism directly to model of how the world works

• Bayesian

3

Plan• Reinforcement Learning

• How to act while learning from rewards

• An approximate algorithm

• Results

• Learning with structure

4

Reinforcement Learning (RL) • Learning from punishments and rewards• Agent moves through world, observing states and

rewards• Adapts its behaviour to maximise some function of

reward

s9s5s4s2

……

…s3

+50

-1-1

+3

r9r5r4r1

s1

a9a5a4a2 …a3a1

5

Reinforcement Learning (RL) • Let’s assume our agent acts according to some rules, called a

policy,

• The return Rt is a measure of long term reward collected after time t

+50

-1-1

+3

r9r5r4r1

21 2 3 1

0

kt t t t t k

k

R r r r r

3 4 80 3 1 1 50R

0 1

6

Reinforcement Learning (RL)

• Rt is a random variable

• So it has an expected value in a state under a given policy

• RL problem is to find optimal policy that maximises the expected value in every state

21 2 3 1

0

kt t t t t k

k

R r r r r

10

( ) { | , } | ,kt t t t k t

k

V s E R s E r s

7

Markov Decision Processes (MDPs)

• The transitions between states are uncertain• The probabilities depend only on the current state

• Transition matrix P, and reward function

r = 2211r = 0a1

a2

112p

111p

212p

211p

312 1p Pr( 2 | 1, 3)t t ts s a

1 111 12

2 211 12

p p

p p

P =

0

2

= R

8

Bellman equations and bootstrapping

• Conditional independence allows us to define the expected return V* for the optimal policy in terms of a recurrence relation:

where

• We can use the recurrence relation to bootstrap our estimate of Vin two ways

* *ij( , ) { ( )}a

jj S

Q i a p V j

R

* *( ) max ( , )a

V i Q i a

A i

4

3

5

a

9

Two types of bootstrapping

• We can bootstrap using explicit knowledge of P and (Dynamic Programming)

• Or we can bootstrap using samples from P and (temporal difference learning)

* *ij 1( , ) { ( )}a

n j nj S

Q i a p V j

R i

4

3

5

a

* * * *1 1 1

ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )t t t t t t t t t t t tb A

Q s a Q s a r Q s b Q s a

st+1

atst

rt+1

10

Multi-agent RL: Learning to play football

• Learning to play in a team• Too time consuming to do on real robots• There is a well established simulator league• We can learn effectively from reinforcement

11

Learning to play backgammon

• TD() learning and a Backprop net with one hidden layer

• 1,500,000 training games (self play)

• Equivalent in skill to the top dozen human players

• Backgammon has ~1020 states, so can’t be solved using DP

12

The Exploration problem: intuition

• We are learning to maximise performance

• But how should we act while learning?

• Trade-off: exploit what we know or explore to gain new information?

• Optimal learning: maximise performance while learning given your imperfect knowledge

13

The optimal learning problem

• If we knew P it would be easy

• However …

– We estimate P from observations

– P is a random variable

– There is a density f(P) over the space of possible MDPs

– What is it? How does it help us solve our problem?

* *ij( , ) { max ( , ) }a

jb

j S

Q i a p Q j b

A

R

14

A density over MDPs• Suppose we’ve wandered around for a while

• We have a matrix M containing the transition counts

• The density over possible P depends on

M, f(P|M) and is a product of Dirichlet

densities

r = 2211r = 0a1

a2

112 4m

111 2m

212 2m

211 4m

1 111 12

2 211 12

m m

m m

M =

212p

112p

00

1

1

1 111 12

2 211 12

p p

p p

P =

15

A density over multinomials

211a1

112 4m

111 2m

1 111 12m m

1m =

212p

1 111 12p p

1p =

112p

00

1

1

8,161m =

112p

112( | )f p 1m

16

Optimal learning formally stated

• Given f(P|M), find that maximises

* *( , , ) ( , , ) ( | )Q i a Q i a f dM P P M PM

212p

112p

00

1

1

r = 2211r = 0a1

a2

112p

111p

212p

211p

17

Transforming the problem

• When we evaluate the integral we get another MDP!

• This one is defined over the space of

information states

• This space grows exponentially in the

depth of the look ahead

a1

* *( , , ) ( ) ( , ( ))a aij j ij

j

Q i a p V j TM M + MR

1122, ( )T M

1,M

2133, ( )T M

a2

18

A heuristic: Optimistic Model Selection

• Solving the information state space MDP is intractable

• An old heuristic is be optimistic in the face of uncertainty

• So here we pick an optimistic P

• Find V* for that P only

• How do we pick P optimistically?

19

Optimistic Model Selection

• h

Do some DP style bootstrapping to improve estimated V

20

Experimental results

21

Bayesian view: performance while learning

22

Bayesian view: policy quality

23

Do we really care?

• Why solve for MDPs? While challenging they are too simple to be useful

• Structured representations

are more powerful

24

Model-based RL: structured models

• Transition model P is represented compactly using a Dynamic Bayes Net

(or factored MDP)

• V is represented as a

tree

• Backups look like goal

regression operators

• Converging with the AI

planning community

25

Structured Exploration: results

26

Challenge: Learning with Hidden State

• Learning in a POMDP, or k-Markov environment

• Planning in POMDPs is intractable

• Factored POMDPs look promising

• POMDPs are the basis of the state of the art in mobile robotics

at at+1 at+2

rt+1 rt+2

st st+1 st+2

ot ot+1 ot+2

27

Wrap up• RL is a class of problems

• We can pose some optimal learning problems elegantly in this framework

• Can’t be perfect, but we can do alright

• BUT: probabilistic representations while very useful in many fields are a source of frequent intractability

• General probabilistic representations are best avoided

• How?

28

Cumulative Discounted Return

29


30


31

Policy Quality

32

Policy Quality

33

Policy Quality

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer...

Documents

Transcript of Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer...