Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a...

Reinforcement Learning

Donglin Zeng, Department of Biostatistics, University of North Carolina

Introduction

I Unsupervised learning has no outcome (no feedback).I Supervised learning has outcome so we know what to

predict.I Reinforcement learning is in between–it has no explicit

supervision so uses a rewarding system to learnfeature-outcome relationship.

I The crucial advantage of reinforcement learning is itsnon-greedy nature: we do not need to improveperformance in a short term but to optimize a long-termachievement.


RL terminology

I Reinforcement learning is a dynamic process where at eachstep, a new decision rule or policy is updated based onnew data and rewarding system.

I Terminology used in reinforcement learning:– Agent: whoever uses learned decisions during theprocess (robot in AI)– Action (A): a decision to be taken during the process– State (S): environment variables that may interact withAction– Reward (R): a value system to evaluate Action givenState.

I Note that (A,S,R) is time-step dependent so we use(At,St,Rt) to reflect time-step t.


Reinforcement learning diagram


Maze example


Maze example: continue


Mountain car problem


RL Notation

I At time-step t, the agent observe a state St from a statespace (ST) and selects an action At from an action space(At).

I Both action and state result in transition to a new state St+1.I Given (At,St,St+1), the agent receives an immediate

rewardRt = rt(St,At,St+1) ∈ R,

where rt(·, ·, ·) is called immediate reward function.


RL mathematical formulation

I At time t, we assume a transition probability function from(St = s,At = a) to (St+1 = s′): pt(s′|s, a) ≥ 0,∫

s′ pt(s′|s, a)ds′ = 1.I We also assume At given St from a probability distribution:πt(a|s) ≥ 0,

∫a πt(a|s)da = 1.

I A trajectory (training sample) (s1, a1, s2, ..., sT, aT, sT+1) isgenerated as follows:– start from an initial state s1 from a probabilitydistribution p(s);– for t = 1, 2, ...,T (T is the total number of steps),– (a) at is chosen from πt(·|st),– (b) the next state st+1 is from pt(·|st, at).

I It is called finite horizon if T <∞ and infinite horizon ifT =∞.


Goal of RL

I Define the return at time t asT∑

j=t

γj−tr(Sj,Aj,Sj+1)

where γ ∈ [0, 1) is called the discount factor (discountinglong trajectory).

I An action policy, π = (π1, ...., πT), is a sequence ofprobability distribution functions, where πt is a probabilitydistribution for At given St.

I The goal of RL is to learn the optimal action decision,policy π∗ = (π∗1 , π

∗2 , ..., π

∗T), to maximize the expected

return:

Eπ[T∑

j=1

γj−1r(Sj,Aj,Sj+1)], Eπ(·) means At|St ∼ πt(·|St).


Optimal policy

I RL aims to find the best action decision rules such that theaverage long-term reward is maximized if such rules areimplemented.

I Note: π∗ is a function of states and for any individual, weonly know what actions should be at time t after observingits states ate time t. This is related to the so-called adaptivedecision or dynamic decision.


How supervised learning is framed in RL context?

I We can imagine St to be all data (both feature andoutcome) collected by step t.

I Then At is the prediction rule from a class of predictionfunctions based on St (no need to be perfect predictionfunction; can be even random prediction) so πt is theprobabilistic selection of which prediction function at t.

I Based on (St,At), St+1 can be St with additionally collecteddata, or St with individual errors, or just St.

I Rt is the prediction error evaluated at the data.I The goal is to learn the best prediction rule–RL method can

help!


Two important concepts in RL

I State-Action value function (SAV) It is the expected returnincrement at time t given state St = s and action At = a:

Qπt (s, a) = Eπ[

T∑j=t

γj−trt(Sj,Aj,Sj+1)|St = s,At = a].

Q∗t (s, a) ≡ Qπ∗t (s, a) is the optimal expected return at time t.

I State value function (SV) It is the expected returnincrement at time t given state St = s:

Vπt (s) = Eπ[

T∑j=t

γj−trt(Sj,Aj,Sj+1)|St = s].

Similarly, V∗t (s) = Vπ∗t (s).

I Clearly, Vπt (s) =

∫a Qπ

t (s, a)πt(a|s)da.


RL methods

I Reinforcement learning methods are mostly into twogroups:– (policy iteration) model-based or learning methods toapproximate SAV– (policy search) model-based or learning methods todirectly maximize SV for estimating π∗.


Policy iteration for value function approximation

I The Bellman equation for SV:

Vπt (s) = Eπ

[rt(s,At,St+1) + γVπ

t+1(St+1)∣∣∣St = s

]=

∫s′

∫a

[rt(s, a, s′) + γVπ

t+1(s′)]πt(a|s)pt(s′|s, a)dads′.

I The Bellman equation for SAV:

Qπt (s, a) = Eπ

[rt(s, a,St+1) + γQπ

t+1(St+1,At+1)∣∣∣St = s,At = a

]=

∫s′

∫a′

[rt(s, a, s′) + γQπ

t+1(s′, a′)

]×πt+1(a′|s′)pt(s′|s, a)da′ds′.


Value function learning for finite horizon

I For finite T, the Bellman equations suggest a backwardprocedure to evaluate the value function associated aparticular policy:– start from time T. We can learnQπ

T(s, a) = E[RT|ST = s,AT = a].– at time T − 1, we learn Qπ

T−1(s, a) as

E[RT−1 + γQπ

T(ST,AT)∣∣∣ST−1 = s,AT−1 = a

].

...– we perform learning backwards till time 1.


Optimal policy learning for finite horizon (Q-learning)

I Start from time T. We can learnQπ

T(s, a) = E[RT|ST = s,AT = a]. We calculate π∗T(s) as withprobability 1 at a∗ = argmaxaQπ

T(s, a).I At time T − 1, we learn Qπ∗

T−1(s, a) as

E[

RT−1 + γmaxa′

Qπ∗T (ST, a′)

∣∣∣ST−1 = s,AT−1 = a].

We obtain π∗T−1 as the one with probability 1 ata∗ = argmaxaQπ∗

T−1(s, a).I We perform the same learning procedures backwards till

time 1 to learn all the optimal policies.


Value function learning for infinite horizon

I When T =∞ or T is large, Q-learning method may not beapplicable.

I The salvage is to take advantage of process stability when tis large so we can assume the following Markov decisionprocess (MDP):

I MDP assumes that state and action spaces are constant overtime.

I MPD assumes pt(s′|s, a) to be independent of t.I Reward function rt(s, a, s′) is independent of t.

I MDP assumption is plausible for a long horizon and aftercertain number of steps.


Bellman equations under MDP (T =∞)

I Under MPD, Qπt (s, a) = Qπ(s, a) and Vπ

t (s) = Vπ(s).I Bellman equations become

Vπ(s) = Eπ[r(s,At,St+1) + γVπ(St+1)

∣∣∣St = s],

Qπ(s, a) = Eπ[r(s, a,St+1) + γQπ(St+1,At+1)

∣∣∣St = s,At = a].

I Derived equation for optimal policy:

Vπ∗(s) = maxa

Qπ∗(s, a),

Qπ∗(s, a) = Eπ∗[r(s, a,St+1) + γVπ∗(St+1)

∣∣∣St = s,At = a],

π∗(s) ∼ I{

a = argmaxaQπ∗(s, a)}.


Policy iteration procedure

I Start from a policy π.I Policy evaluation: evaluate Qπ(s, a) and thus Vπ(s).I Policy improvement: update π(a|s) to be I(a = aπ(s)) where

aπ(s) is the action maximizing Qπ(s, a).I Iterate between policy evaluation step and policy

improvement.


Soft policy iteration procedure

I Selecting a deterministic policy update may be too greedyif the initial policy is far from the optimal.

I More soft policy update includes:– π(a|s) ∝ exp{Qπ(s, a)/τ},– (ε-greedy policy improvement) π(a|x) has a probability(1− ε+ ε/m) at a = a(π) and probability ε/m at other a’s,where m is the number of possible actions.


Estimation of state-value function

I One challenge in the policy iteration is how to estimateQπ(s, a).

I This requires statistical modelling or learning algorithms.I Parametric/semiparametric models for Qπ(s, a) are

commonly used.


Least-squared policy iteration

I We assume

Qπ(s, a) =B∑

b=1

θbφb(s, a),

where φb(s, a) is a sequence of basis functions.I In other words, the policy is indirectly represented by θb’s.I From the Bellman equation, we note

Rt = r(St,At,St+1) = Qπ(St,At)− γEπ[Qπ(St+1,At+1)|St,At]

≈ θTψ(St,At)

has mean zero given (St,At) under policy π, whereψ(s, a) = φ(s, a)− γEπ[φ(St+1,At+1)|St = s,At = a].


Numerical implementation of least-squares policyiteration

I Suppose we have data from n subjects, each with a trainingsample of T steps, or n training T-step sample from thesame agent,

(Si1,Ai1,Si2, ...,SiT,AiT,Si,T+1).

I We estimate ψ(s, a) by ψb(s, a) =

φb(s, a)− γ∑n

i=1∑T

t=1 I(Sit = s,Ait = a)Eπ[φb(Si,t+1,Ai,t+1])∑ni=1∑T

t=1 I(Sit = s,Ait = a).

I We perform a least-squares estimation

minθ

1nT

n∑i=1

T∑t=1

I(Ait|Sit ∼ π)[θTψ(Sit,Ait)− Rit

]2,

where Ait|Sit ∼ π means that the data of Ait is obtained byfollowing the policy.


More on numerical implementation

I Regularization may be introduced to have a more sparsesolution.

I L2-minimization can be replaced by L1-minimization togain robustness.

I Choice of basis functions: radial basis function wherekernel function can be the usual Gaussian kernel (onepossible definition of d(s, s′) is the shortest path from s to s′

in the graph defined by transition probabilities).


Robot-Arm control example


Robot-Arm control example: continue


Off-policy estimation

I In the previous derivation, we essentially estimate

Eπ

[T∑

t=1

(θTψ(St,At)− Rt)2

]using the history sample (St,At) following the targetpolicy π.

I This is called on-policy reinforcement learning.I However, not all policy has been seen in the history

sample.I An alternative method is to use importance sampling:

Eπ

[T∑

t=1

(θTψ(St,At)− Rt)2

]= Eπ

[T∑

t=1

(θTψ(St,At)− Rt)2wt

],

where

wt =

∏tj=1 π(Aj|Sj)∏tj=1 π(Aj|Sj)

.


Off-policy iteration: more

I We need one assumption: there exists a policy in historysample, π, such that

π(a|s) > 0, ∀(a, s).

I Adaptive importance weighting is to replace wt by wνt and

choose ν via cross-validation.I When history sample have multiple policies π’s, we can

obtain the estimate from importance weighting withrespect to each policy and aggregate estimation(sample-reuse policy iteration).


Mountain car example

I Action space: force applied to the car (0.2,−0.2, 0).I State space: (x, x) where x is the horizontal position

(∈ [−1.2, 0.5]) and x is the velocity (∈ [−1.5, 1.5]).I Transition:

xt+1 = xt + xt+1δt,xt+1 = xt + (−9.8wcos(3xt) + at/w− kxt)δt,

where w is the mass 0.2kg, k is the friction coefficient 0.3,and δt is 0.1 second.

I Reward:

r(s, a, s′) ={

1 xs′ ≥ 0.5,−0.01 o.w.

I Policy iteration uses kernels with centers at{−1.2, 0.35, 0.5} × {−1.5,−0.5, 0.5, 1.5} and σ = 1.


Experiment results


Direct Policy Search

I The direct policy search approach aims for finding thepolicy maximizing the expected return.

I Suppose we model policy as π(a|s; θ).I The expected return under π is given by

J(θ) =∫

s1,...,sT

p(s1)

T∏t=1

p(st+1|st, at)π(at|st; θ)

×

{T∑

t=1

γt−1r(st, at, st+1)

}s1 · · · dsT.

I We optimize J(θ) to find the optimal θ.I Gradient approach can be adopted for optimization.I EM-based policy search can be used for optimization.I Importance sampling can be used for evaluating J(θ).


Alternative methods

I Modelling transition probability functionsI Active policy iteration (active learning)

– update sampling policy actively


Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a...

Documents

Transcript of Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a...