RL for Large State Spaces: Policy Gradient

1

RL for Large State Spaces: Policy Gradient

Alan Fern

2

RL via Policy Gradient Searchh So far all of our RL techniques have tried to learn an exact

or approximate value function or Q-function5 Learn optimal value of being in a state, or taking an action from state.

h Value functions can often be much more complex to represent than the corresponding policy5 Do we really care about knowing Q(s,left) = 0.3554, Q(s,right) = 0.5335 Or just that “right is better than left in state s”

h Motivates searching directly in a parameterized policy space5 Bypass learning value function and “directly” optimize the value of a policy

3

Reminder: Gradient Ascenth Given a function f(1,…, n) of n real values =(1,…, n)

suppose we want to maximize f with respect to

h A common approach to doing this is gradient ascenth The gradient of f at point , denoted by f(), is an

n-dimensional vector that points in the direction where f increases most steeply at point

h Vector calculus tells us that f() is just a vector of partial derivatives

where

)(),,,,,(lim)( 111

0

fff niii

i

n

fff

)(,,)()(1

4

Aside: Gradient Ascenth Gradient ascent iteratively follows the gradient direction

starting at some initial point

5 Initialize to a random value

5 Repeat until stopping condition

)( f

𝜃1

𝜃2

Local optima of f

With proper decay of learning rate gradient descent is guaranteed to converge to local optima.

5

RL via Policy Gradient Ascenth The policy gradient approach has the following schema:

1. Select a space of parameterized policies

2. Compute the gradient of the value of current policy wrt parameters

3. Move parameters in the direction of the gradient

4. Repeat these steps until we reach a local maxima

5. Possibly also add in tricks for dealing with bad local maxima (e.g.

random restarts)

h So we must answer the following questions:5 How should we represent parameterized policies?

5 How can we compute the gradient?

6

Parameterized Policiesh One example of a space of parametric policies is:

where may be a linear function, e.g.

h The goal is to learn parameters that give a good policyh Note that it is not important that be close to the

actual Q-function5 Rather we only require is good at ranking actions in order of

goodness

),(...),(),(),(ˆ 22110 asfasfasfasQ nn

),(ˆmaxarg)( asQsa

),(ˆ asQ

),(ˆ asQ

),(ˆ asQ

7

Policy Gradient Ascenth For simplicity we will make the following assumptions:

5 Each run/trajectory of a policy starts from a fixed initial state5 Each run/trajectory always reaches a terminal state in a finite

number of steps (alternatively we fix the horizon to H)

h Let () be expected value of policy at initial state5 () is just the expected discounted total reward of a trajectory of

h Our objective is to find a that maximizes ()

8

Policy Gradient Ascent

h Policy gradient ascent tells us to iteratively update parameters via:

h Problem: () is generally very complex and it is rare

that we can compute a closed form for the gradient of () even if we have an exact model of the system.

h Key idea: estimate the gradient based on experience

)(

9

Gradient Estimationh Concern: Computing or estimating the gradient of

discontinuous functions can be problematic.

h For our example parametric policy

is () continuous? h No.

5 There are values of where arbitrarily small changes, cause the policy to change.

5 Since different policies can have different values this means that changing can cause discontinuous jump of ().

),(ˆmaxarg)( asQsa

10

Example: Discontinous ()

h Consider a problem with initial state s and two actions a1 and a2 5 a1 leads to a very large terminal reward R1 5 a2 leads to a very small terminal reward R2

h Fixing 2 to a constant we can plot the ranking assigned to each action by Q and the corresponding value ()

),(),(maxarg),(ˆmaxarg)( 2211 asfasfasQsaa

1

)1,(ˆ asQ

)2,(ˆ asQ

1

()

R1

R2

Discontinuity in () when ordering of a1 and a2 change

11

Probabilistic Policiesh We would like to avoid policies that drastically change with

small parameter changes, leading to discontinuities

h A probabilistic policy takes a state as input and returns a distribution over actions 5 Given a state s (s,a) returns the probability that selects action a in s

h Note that () is still well defined for probabilistic policies5 Now uncertainty of trajectories comes from environment and policy5 Importantly if (s,a) is continuous relative to changing then () is also

continuous relative to changing

h A common form for probabilistic policies is the softmax function or Boltzmann exploration function

Aa

asQasQsaas

'

)',(ˆexp),(ˆexp)|Pr(),(

12

Empirical Gradient Estimationh Our first (naïve) approach to estimating () is to

simply compute empirical gradient estimatesh Recall that and

so we can compute the gradient by empirically estimating each partial derivative

h So for small we can estimate the partial derivatives by

h This requires estimating n+1 values:

)(),,,,,(lim)( 111

0

niii

i

n

)(,,)()(1

)(),,,,,( 111 niii

niniii ,...,1|),,,,,(),( 111

13

Empirical Gradient Estimationh How do we estimate the quantities

h For each set of parameters, simply execute the policy for N trials/episodes and average the values achieved across the trials

h This requires a total of N(n+1) episodes to get gradient estimate5 For stochastic environments and policies the value of N must be

relatively large to get good estimates of the true value5 Often we want to use a relatively large number of parameters5 Often it is expensive to run episodes of the policy

h So while this can work well in many situations, it is often not a practical approach computationally

niniii ,...,1|),,,,,(),( 111

14

Likelihood Ratio Gradient Estimation

h The empirical gradient method can be applied even when the functional form of the policy is a black box (i.e. don’t know mapping from to action distribution)

h If we know the functional form of the policy and can compute its gradient with respect to , we can do better.5 Possible to estimate () directly from trajectories of just the current

policy

h We will start with a general approach of likelihood ratio gradient estimation and then show how it applied to policy gradient.

15

General Likelihood Ratio Gradient Estimate

h Let F be a real-valued function over a finite domain D 5 Everything generalizes to continuous domains

h Let X be a random variable over D distributed according to P(x) 5 is the parameter vector of this distribution

h Consider the expectation of F(X) conditioned on

h We wish to estimate () given by:

Dx

xFxPXFE )()()()(

DxDx

xFxPxFxP )()()()()(

Often no closed form for sum(|D| can be huge)

16

h Rewriting

h So () is just the expected value of z(X)F(X) 5 Get unbiased estimate of () by averaging over N samples of X

xj is the j’th sample of X

h Only requires ability to sample X and to compute z(x)Does not depend on how big D is!

)()()()(log)(

)()()()(

)()()(

XFXzExFxPxP

xFxPxPxP

xFxP

Dx

Dx

Dx

z(x)

)()(1)(1

j

N

jj xFxz

N

General Likelihood Ratio Gradient Estimate

X∼𝑃 𝜃

h Define as sequence of H statesand H-1 actions generated for single episode of

5 In general H will differ across episodes.

h X is random due to policy and environment, distributed as:

h Define as the total reward of X

h is expected total reward of

5 We want to estimate the gradient of 5 Apply likelihood ratio method!

17

Application to Policy Gradient

),,,,,( 2211 HsasasX

1

1111 ),,(),(),,,(

H

ttttttH sasTassasP

H

ttsRXF

1

)()(

H

ttsREXFE

1

)()()(

h Recall, for random variable X we have unbiased estimate

h We can generate samples ofby running policy from the start state until a terminal state5 is i’th sampled episode

5 is sum of observed rewards during i’th episode

5

19


)()(1)(1

j

N

jj xFxz

N

),,,,,( 2211 HsasasX

),,,,,( ,2,2,1,1, Hiiiiii sasasx

H

tisRxF

1

)()(

1

1

1

11

1

11

),(log

),,(log),(log

),,(),(log)(log)(

H

ttt

H

tttttt

H

tttttt

as

sasTas

sasTasxPxz

Does not depend on knowing model!Allows model-free implementation.

h Recall, for random variable X we have unbiased estimate

h Consider a single term

h Since action at time t does not influence rewards before time t+1, we can derive the following result: (this is non-trivial to derive)

h This justifies using a modified computation for each term:

20


)()(1)(1

j

N

jj xFxz

N

Total reward after time t.

H

kk

H

ttt sRasxFxz

1

1

1

)(),(log)()(

H

tkk

H

ttt sRasExFxzE

1

1

1

)(),(log)()(

H

tkk

H

ttt sRas

1

1

1

)(),(log Estimate is still unbiased but generally has smaller variance.

h Putting everything together we get:

h Interpretation: each episode contributes weighted sum of gradient directions5 Gradient direction for increasing probability of aj,t in sj,t is weighted by

sum of rewards observed after taking aj,t in sj,t

h Intuitively this increases/decreases probability of taking actions that are typically followed by good/bad reward sequences

21


N

j

H

t

H

tkkjtjtj

j j

sRasN 1 1 1

,,, )(),(log1)(

Observed reward after taking ajt in state sjt

length of trajectory j

# of sampled trajectoriesof current policy

Direction to move parameters in order to increase the probability that policy selectsajt in state sjt

22

Basic Policy Gradient Algorithmh Repeat until stopping condition

1. Execute for N episodes to get set of state, action, reward sequences

2.

3.

h Unnecessary to store N episodes (use online mean estimate in step 2)

h Disadvantage: small # of updates per # episodes5 Also is not well defined for (non-episodic) infinite horizon problems

h Online policy gradient algorithms perform updates after each step in environment (often learn faster)

N

j

H

t

H

tkkjtjtj

j j

sRasN 1 1 1

,,, )(),(log1

23

Toward Online Algorithmh Consider the computation for a single episode

h Notice that we can compute zt in an online way

h We can now incrementally compute ΔT for each episode

Storage requirement depends only on # of policy parameters

H

ttt

H

t

t

kttt

H

t

H

tkkttH

zsR

assR

sRas

2

2

1

1

1 1

)(

),(log)(

)(),(log

),(log;0 11 tttt aszzz

1110 )(;0 tttt zsR

Just reorganize terms

24

Toward Online Algorithmh So the overall gradient estimate can be done by incrementally

computing ΔH for each of N episodes and computing their mean5 The mean of the ΔH across episodes can be computed online5 Total memory requirements still depends only on # parameters 5 Independent of length and number of episodes!

h But what if episodes go on forever? 5 We could continually maintain ΔT (using a constant amount of

memory) but we would never actually do a parameter update5 Also ΔT can have infinite variance in this setting (we will not show this)

h Solution:5 Update policy parameters after each reward is observed

(rather than simply update gradient estimate ΔT )5 Introduce discounting 5 Results in OLPOMDP algorithm

h Notice that we can compute zt in an online way

h We can now incrementally compute ΔT for each episode

Storage requirement is only # of parameters + 1

Online Policy Gradient (OLPOMDP)Repeat forever

1. Observe state s

2. Draw action a according to distribution (s)

3. Execute a and observe reward r4. ;; discounted sum of

;; gradient directions5.

h Performs policy update at each time step and executes indefinitely5 This is the OLPOMDP algorithm [Baxter & Bartlett, 2000]

),(log aszz

zr

InterpretationRepeat forever

1. Observe state s

2. Draw action a according to distribution (s)

3. Execute a and observe reward r4. ;; discounted sum of

;; gradient directions5.

h Step 4 computes an “eligibility trace” z5 Discounted sum of gradients over previous state-action pairs5 Points in direction of parameter space that increases probability of

taking more recent actions in more recent states

h For positive rewards step 5 will increase probability of recent actions and decrease for negative rewards.

zr

),(log aszz

27

Computing the Gradient of Policyh Both algorithms require computation of

h For the Boltzmann distribution with linear approximation we have:

where

h Here the partial derivatives composing the gradient are:

),(log as

Aa

asQasQas

'

)',(ˆexp),(ˆexp),(

),(...),(),(),(ˆ 22110 asfasfasfasQ nn

'

)',()',(),(),(loga

iii

asfasasfas

28

Controlling Helicoptersh Policy gradient techniques have been used to

create controllers for difficult helicopter maneuversh For example, inverted helicopter flight.

29

Quadruped Locomotion h Optimize gait of 4-legged robots over rough terrain

Proactive SecurityIntelligent Botnet Controller

• Used OLPOMDP to proactively discover maximally damaging botnet attacks in peer-to-peer networks

31

Policy Gradient Recaph When policies have much simpler representations

than the corresponding value functions, direct search in policy space can be a good idea5 Or if we already have a complex parametric controllers, policy gradient

allows us to focus on optimizing parameter settings

h For baseline algorithm the gradient estimates are unbiased (i.e. they will converge to the right value) but have high variance for large T5 Can require a large N to get reliable estimates

h OLPOMDP can trade-off bias and variance via the discount parameter and does not require notion of episode

h Can be prone to finding local maxima 5 Many ways of dealing with this, e.g. random restarts or

intelligent initialization.

RL for Large State Spaces: Policy Gradient

Documents

Transcript of RL for Large State Spaces: Policy Gradient