1 Reinforcement Learning: Learning Algorithms Csaba Szepesvári University of Alberta Kioloa,...

Post on 15-Jan-2016

222 views 0 download

Transcript of 1 Reinforcement Learning: Learning Algorithms Csaba Szepesvári University of Alberta Kioloa,...


Reinforcement Learning:Learning Algorithms

Csaba SzepesváriUniversity of Alberta

Kioloa, MLSS’08

Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/



Defining the problem(s) Learning optimally Learning a good policy

Monte-Carlo Temporal Difference (bootstrapping) Batch – fitted value iteration and



The Learning Problem The MDP is unknown but the agent can

interact with the system Goals:

Learn an optimal policy Where do the samples come from?

Samples are generated externally The agent interacts with the system to get the

samples (“active learning”) Performance measure: What is the performance

of the policy obtained? Learn optimally: Minimize regret while

interacting with the system Performance measure: loss in rewards due to

not using the optimal policy from the beginning Exploration vs. exploitation


Learning from Feedback

A protocol for prediction problems: xt – situation (observed by the agent) yt 2 Y – value to be predicted pt 2 Y – predicted value (can depend on all past

values ) learning!) rt(xt,yt,y) – value of predicting y

loss of learner: t= rt(xt,yt,y)-rt(xt, yt,pt) Supervised learning:

agent is told yt, rt(xt,yt,.) Regression: rt(xt,yt,y)=-(y-yt)2 t=(yt-pt)2

Full information prediction problem:8 y2 Y, rt(xt,y) is communicated to the agent, but not yt

Bandit (partial information) problem:rt(xt,pt) is communicated to the agent only


Learning Optimally

Explore or exploit? Bandit problems

Simple schemes Optimism in the face of uncertainty (OFU) UCB

Learning optimally in MDPs with the OFU principle


Learning Optimally: Exploration vs. Exploitation

Two treatments Unknown success

probabilities Goal:

find the best treatment while loosing few patients

Explore or exploit?


Exploration vs. Exploitation: Some Applications

Simple processes: Clinical trials Job shop scheduling (random jobs) What ad to put on a web-page

More complex processes (memory): Optimizing production Controlling an inventory Optimal investment Poker ..


Bernoulli Bandits

Payoff is 0 or 1

Arm 1:R1(1), R2(1), R3(1), R4(1), …

Arm 2:R1(2), R2(2), R3(2), R4(2), …


1 1 0

1 0




Some definitions

Payoff is 0 or 1

Arm 1:R1(1), R2(1), R3(1), R4(1), …

Arm 2:R1(2), R2(2), R3(2), R4(2), …

Now: t=9T1(t-1) = 4T2(t-1) = 4A1 = 1, A2 = 2, …


P Tt=1 Rt(k¤) ¡

P Tt=1 RTA t (t)(At)


1 1 0

1 0




The Exploration/Exploitation Dilemma

Action values: Q*(a) = E[Rt(a)] Suppose you form estimates

The greedy action at t is:

Exploitation: When the agent chooses to follow At*

Exploration: When the agent chooses to do something else

You can’t exploit all the time; you can’t explore all the time You can never stop exploring; but you should always

reduce exploring. Maybe.

Qt(a) ¼Q¤(a)

A¤t = argmaxa Qt(a)


Action-Value Methods Methods that adapt action-value estimates

and nothing else How to estimate action-values? Sample average:

Claim: if nt(a)!1 Why??

limt! 1 Qt(a) = Q¤(a);

Qt(a) =R 1(a)+:::+R T t (a ) (a)

Tt (a)


-Greedy Action Selection

Greedy action selection:


. . . the simplest way to “balance” exploration and exploitation

At = A¤t = argmaxa Qt(a)

At =


t with probability 1¡ "random action with probability "


10-Armed Testbed

n = 10 possible actions Repeat 2000 times:

Q*(a) ~ N(0,1) Play 1000 rounds

Rt(a)~ N(Q*(a),1)


-Greedy Methods on the 10-Armed Testbed


Softmax Action Selection

Problem with ²-greedy: Neglects action values

Softmax idea: grade action probs. by estimated values.

Gibbs, or Boltzmann action selection, or exponential weights:

=t is the “computational temperature”

P (At = ajH t) = eQ t (a )=¿ tP

b eQ t (b)=¿ t


Incremental Implementation

Qt+1(At) = Qt(At) + 1t+1(Rt+1 ¡ Qt(At))

Sample average:

Incremental computation:

Common update rule form:

NewEstimate = OldEstimate + StepSize[Target – OldEstimate]

Qt(a) =R 1(a)+:::+R T t (a ) (a)

Tt (a)


UCB: Upper Confidence Bounds Principle: Optimism in the face of uncertainty Works when the environment is not adversary Assume rewards are in [0,1]. Let

(p>2) For a stationary environment, with iid rewards this

algorithm is hard to beat! Formally: regret in T steps is O(log T) Improvement: Estimate variance, use it in place of

p [AuSzeMu ’07] This principle can be used for achieving small

regret in the full RL problem!

At = argmaxa

nQt(a) +

qp log(t)2Tt (a)


[Auer et al. ’02]


UCRL2: UCB Applied to RL [Auer, Jaksch & Ortner ’07] Algorithm UCRL2(±):

Phase initialization: Estimate mean model p0 using maximum

likelihood (counts) C := { p | ||p(.|x,a)-p0(.|x,a)

· c |X| log(|A|T/delta) / N(x,a) }

p’ :=argmaxp ½*(p), ¼ :=¼*(p’) N0(x,a) := N(x,a), 8 (x,a)2 X£ A

Execution Execute ¼ until some (x,a) have been visited

at least N0(x,a) times in this phase


UCRL2 Results Def: Diameter of an MDP M:

D(M) = maxx,y min¼ E[ T(xy; ¼) ] Regret bounds

Lower bound: E[Ln] = ( ( D |X| |A| T )1/2)

Upper bounds: w.p. 1-±/T,

LT · O( D |X| ( |A| T log( |A|T/±)1/2 ) w.p. 1-±,

LT · O( D2 |X|2 |A| log( |A|T/±)/¢ ) ¢ =performance gap between best and second best policy


Learning a Good Policy

Monte-Carlo methods Temporal Difference methods

Tabular case Function approximation

Batch learning


Learning a good policy Model-based learning

Learn p,r “Solve” the resulting MDP

Model-free learning Learn the optimal action-value function

and (then) act greedily Actor-critic learning Policy gradient methods

Hybrid Learn a model and mix planning and a

model-free method; e.g. Dyna


Monte-Carlo Methods Episodic MDPs! Goal: Learn V¼(.) V¼(x)

= E¼[ t°t Rt|X0=x] (Xt,At,Rt):

-- trajectory of ¼ Visits to a state

f(x) = min {t|Xt = x} First visit

E(x) = { t | Xt = x } Every visit

Return:S(t) = °0Rt + °1 Rt+1 + …

K independent trajectories S(k), E(k), f(k), k=1..K

First-visit MC: Average over{ S(k)( f(k)(x) ) : k=1..K }

Every-visit MC: Average over{ S(k)( t ) : k=1..K

, t2 E(k)(x) } Claim: Both converge to

V¼(.) From now on St = S(t)

1 2 3 4 5

[Singh & Sutton ’96]


Learning to Control with MC Goal: Learn to behave optimally Method:

Learn Q¼(x,a) ..to be used in an approximate policy iteration (PI)

algorithm Idea/algorithm:

Add randomness Goal: all actions are sampled eventually infinitely often e.g., ²-greedy or exploring starts

Use the first-visit or the every-visit method to estimate Q¼(x,a)

Update policy Once values converged

.. or .. Always at the states visited


Monte-Carlo: Evaluation

Convergence rate: Var(S(0)|X=x)/N Advantages over DP:

Learn from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (no

bootstrapping) Issue: maintaining sufficient

exploration exploring starts, soft policies


Temporal Difference Methods Every-visit Monte-Carlo:

V(Xt) V(Xt) + ®t(Xt) (St – V(Xt)) Bootstrapping

St = Rt + ° St+1

St’ = Rt + ° V(Xt+1) TD(0):

V(Xt) V(Xt) + ®t(Xt) ( St’– V(Xt) ) Value iteration:

V(Xt) E[ St’ | Xt ] Theorem: Let Vt be the sequence of functions generated

by TD(0). Assume 8 x, w.p.1 t ®t(x)=1, t ®t

2(x)<+1. Then Vt V¼ w.p.1

Proof: Stochastic approximations:Vt+1=Tt(Vt,Vt), Ut+1=Tt(Ut,V¼) TV¼.[Jaakkola et al., ’94, Tsitsiklis ’94, SzeLi99]

[Samuel, ’59], [Holland ’75], [Sutton ’88]


TD or MC? TD advantages:

can be fully incremental, i.e., learn before knowing the final outcome Less memory Less peak computation

learn without the final outcome From incomplete sequences

MC advantage: Less harm by Markovian violations

Convergence rate? Var(S(0)|X=x) decides!


Learning to Control with TD Q-learning [Watkins ’90]:

Q(Xt,At) Q(Xt,At) + ®t(Xt,At) {Rt+°maxaQ (Xt+1,a)–Q(Xt,At)}

Theorem: Converges to Q* [JJS’94, Tsi’94,SzeLi99]

SARSA [Rummery & Niranjan ’94]: At ~ Greedy²(Q,Xt) Q(Xt,At) Q(Xt,At) +

®t(Xt,At) {Rt+°Q (Xt+1,At+1)–Q(Xt,At)}

Off-policy (Q-learning) vs. on-policy (SARSA) Expecti-SARSA Actor-Critic [Witten ’77, Barto, Sutton & Anderson ’83, Sutton ’84]



greedy= 0.1


N-step TD Prediction Idea: Look farther into the future when you

do TD backup (1, 2, 3, …, n steps)


Monte Carlo: St = Rt+° Rt+1 + .. °T-t RT

TD: St(1)

= Rt + ° V(Xt+1) Use V to estimate remaining return

n-step TD: 2 step return:


= Rt + ° Rt+1 + °2 V(Xt+2)

n-step return: St

(n) = Rt + ° Rt+1 + … + °n V(Xt+n)

N-step TD Prediction


Learning with n-step Backups

Learning with n-step backups: V(Xt) V(Xt) + ®t( St

(n) - V(Xt))

n: controls how much to bootstrap


Random Walk Examples

How does 2-step TD work here? How about 3-step TD?


A Larger Example Task: 19 state

random walk

Do you think there is an optimal n? for everything?


Averaging N-step Returns

Idea: backup an average of several returns e.g. backup half of 2-step and

half of 4-step:

“complex backup”

One backup

Rt = 12R(2)

t + 12R(4)



Forward View of TD()

Idea: Average over multiple backups


(¸) = (1-¸) n1 ¸n St(n+1)

TD(¸):¢V(Xt) = ®t( St

(¸) -V(Xt)) Relation to TD(0) and MC

¸=0 TD(0) ¸=1 MC

[Sutton ’88]


-return on the Random Walk

Same 19 state random walk as before Why intermediate values of are



Backward View of TD()

±t = Rt + ° V(Xt+1) – V(Xt)

V(x) V(x) + ®t ±t e(x)

e(x) ° ¸ e(x) + I(x=Xt) Off-line updates Same as FW TD(¸) e(x): eligibility trace

Accumulating trace Replacing traces speed up convergence:

e(x) max( °¸ e(x), I(x=Xt) )

[Sutton ’88, Singh & Sutton ’96]


Function Approximation with TD


Gradient Descent Methods


µt = (µt(1); : : : ;µt(n))T

Assume Vt is a differentiable function of :

Vt(x) = V(x;).

Assume, for now, training examples of the form:

{ (Xt, V(Xt)) }


Performance Measures Many are applicable but… a common and simple one is the mean-squared

error (MSE) over a distribution P:

Why P? Why minimize MSE? Let us assume that P is always the distribution of

states at which backups are done. The on-policy distribution: the distribution

created while following the policy being evaluated. Stronger results are available for this distribution.

L(µ) =P

x2X P (x) (V¼(x) ¡ V(x;µ))2


Gradient Descent Let L be any function of the parameters.

Its gradient at any point in this space is:

Iteratively move down the gradient:



t t(1), t(2) T

r µL =³

@L@µ(1) ;

@L@µ(2) ; : : : ;



µt+1 = µt ¡ ®t (r µL) jµ=µt


Gradient Descent in RL

Function to descent on:


Gradient descent procedure:

Bootstrapping with St’

TD() (forward view):

L(µ) =P

x2X P (x) (V¼(x) ¡ V(x;µ))2

µt+1 = µt + ®t (V¼(X t) ¡ V(X t;µt)) r µV(X t;µt)

µt+1 = µt + ®t (S0t ¡ V(X t;µt)) r µV(X t;µt)

µt+1 = µt + ®t¡S¸

t ¡ V(X t;µt)¢r µV(X t;µt)

r µL(µ) = ¡ 2P

x2X P (x) (V¼(x) ¡ V(x;µ)) r µV(x;µ)


Linear Methods Linear FAPP: V(x;µ) =µ T Á(x) rµ V(x;µ) = Á(x) Tabular representation:

Á(x)y = I(x=y) Backward view:

±t = Rt + ° V(Xt+1) – V(Xt)

µ µ + ®t ±t e

e ° ¸ e + rµ V(Xt;µ) Theorem [TsiVaR’97]: Vt converges to

V s.t. ||V-V¼||D,2 · ||V¼-¦ V¼||D,2/(1-°).

[Sutton ’84, ’88, Tsitsiklis & Van Roy ’97]


Learning state-action valuesTraining examples:

The general gradient-descent rule:

Gradient-descent Sarsa()

Control with FA

f ((X t;At);Q¤(X t;At) + noiset)g

µt+1 = µt + ®t (St ¡ Q(X t;At;µt)) r µQ(X t;At;µt)

µt+1 = µt + ®t±tet


±t = Rt + °Q(X t+1;At+1;µt) ¡ Qt(X t;At;µt)

et = °¸et¡ 1 + r µQ(X t;At;µ)

[Rummery & Niranjan ’94]


Mountain-Car Task

[Sutton ’96], [Singh & Sutton ’96]


Mountain-Car Results


Baird’s Counterexample: Off-policy Updates Can Diverge

[Baird ’95]


Baird’s Counterexample Cont.


Should We Bootstrap?


Batch Reinforcement Learning


Batch RL Goal: Given the trajectory of the behavior policy ¼b

X1,A1,R1, …, Xt, At, Rt, …, XN

compute a good policy! “Batch learning” Properties:

Data collection is not influenced Emphasis is on the quality of the solution Computational complexity plays a secondary role

Performance measures: ||V*(x) – V¼(x)||1 = supx |V*(x) - V¼(x)|

= supx V*(x) - V¼(x) ||V*(x) - V¼(x)||2 = s (V*(x)-V¼(x))2 d¹(x)


Solution methods

Build a model Do not build a model, but find an

approximation to Q*

using value iteration => fitted Q-iteration

using policy iteration => Policy evaluated by approximate value

iteration Policy evaluated by Bellman-residual minimization (BRM)

Policy evaluated by least-squares temporal difference learning (LSTD) => LSPI

Policy search

[Bradtke, Barto ’96], [Lagoudakis, Parr ’03], [AnSzeMu ’07]


Evaluating a policy: Fitted value iteration

Choose a function space F. Solve for i=1,2,…,M the LS (regression) problems:

Counterexamples?!?!? [Baird ’95, Tsitsiklis and van Roy ’96]

When does this work?? Requirement: If M is big enough and the number of

samples is big enough QM should be close to Q¼

We have to make some assumptions on F

Qi+1 = argminQ2F



(Rt + °Qi (X t+1;¼(X t+1)) ¡ Q(X t;At))2


Least-squares vs. gradient Linear least squares (ordinary regression):

yt = w*T xt + ²t

(xt,yt) jointly distributed r.v.s., iid, E[²t|xt]=0. Seeing (xt,yt), t=1,…,T, find out w*. Loss function: L(w) = E[ (y1 – wT x1 )2 ]. Least-squares approach:

wT = argminw t=1T (yt – wT xt)2

Stochastic gradient method: wt+1 = wt + ®t (yt-wt

T xt) xt

Tradeoffs Sample complexity: How good is the estimate Computational complexity: How expensive is

the computation?


Fitted value iteration: Analysis Goal: Bound ||QM - Q¼||¹

2 in terms of

maxm ||²m||º2 , ||²m||º

2 = s ²m2(x,a) º(dx,da),

where Qm+1 = T¼Qm + ²m , ²-1= Q0-Q¼

Um = Qm – Q¼

Um+1 = Qm+1 ¡ Q¼

= T¼Qm ¡ Q¼+ ²m

= T¼Qm ¡ T¼Q¼+ ²m

= °P¼Um + ²m:



(°P¼)M ¡ m ²m¡ 1:

After [AnSzeMu ’07]





(°P¼)M ¡ m ²m¡ 1:

¹ jUM j2 ·µ

11¡ °

¶2 1¡ °1¡ °M +1



°m¹ ((P¼)m²M ¡ m¡ 1)2

· C1


1¡ °

¶2 1¡ °1¡ °M +1



°mºj²M ¡ m¡ 1j2

· C1


1¡ °

¶2 1¡ °1¡ °M +1


°M ºj²¡ 1j2 +MX




= C1


1¡ °


²2 + C1°M ºj²¡ 1j2

1¡ °M +1 :


² ½f =R

f (x)½(dx)

² (P f )(x) =R

f (y)P (dyjx)

J ensen applied to operators,¹ · C1º and:8½: ½P¼ · C1º

J ensen


Summary If the regression errors are all small and the system

is noisy (8 ¼,½, ½ P¼ · C1 º) then the final error will be small.

How to make the regression errors small? Regression error decomposition:

kQm+1 ¡ T¼Qmk2 · kQm+1 ¡ ¦ F T¼Qmk2

+k¦ F T¼Qm ¡ T¼Qmk2

Approximation error

Estimation error


Controlling the approximation error







Controlling the approximation error



Fdp;¹ (TF ;F )


Controlling the approximation error







Controlling the approximation error

B(X ; R max1¡ ° )

Assume smoothness!Lip®(L)

T³B(X ; R max

1¡ ° )´


Learning with (lots of) historical data

Data: A long trajectory of some exploration policy

Goal: Efficient algorithm to learn a policy Idea: Use fitted action-values Algorithms:

Bellman residual minimization, FQI [AnSzeMu ’07] LSPI [Lagoudakis, Parr ’03]

Bounds: Oracle inequalities (BRM, FQI and LSPI) ) consistency


BRM insight TD error: tRt+° Q(Xt+1,¼(Xt+1))-Q(Xt,At) Bellman error: E[E[ t | Xt,At ]2] What we can compute/estimate: E[E[ t

2 | Xt,At]] They are different! However:

E[¢ tjX t;At]2 = E[¢ 2t jX t;At]¡ Var[¢ tjX t;At]

[AnSzeMu ’07]


Loss function

LN ;¼(Q;h) =





n(Rt + °Q(X t+1;¼(X t+1)) ¡ Q(X t;At))


¡ (Rt + °Q(X t+1;¼(X t+1)) ¡ h(X t;At))2o

wt = 1=¹ (AtjX t)

E[¢ tjX t;At]2 = E[¢ 2t jX t;At]¡ Var[¢ tjX t;At]


Algorithm (BRM++)

1. Choose ¼0, i := 0

2. While (i · K ) do:

3. Let Qi+1 = argminQ2F A suph2F A LN ;¼i (Q;h)

4. Let ¼i+1(x) = argmaxa2A Qi+1(x;a)

5. i := i + 1


Do we need to reweight or throw away data?

NO! WHY? Intuition from regression:

m(x) = E[Y|X=x] can be learnt no matter what p(x) is!

*(a|x): the same should be possible! BUT..

Performance might be poor! => YES! Like in supervised learning when training and

test distributions are different



kQ¤ ¡ Q¼K k2;½·

2°(1¡ °)2C1=2


³~E (F ) + E (F ) + S1=2

N ;x

´+ (2°K )1=2Rmax;

SN ;x = c2


2 + 1) ln(N ) + ln(c1) + 11+· ln(bc2

24 ) + x

´ 1+ ·2·

(b1=· N )1=2


The concentration coefficients

Lyapunov exponents

Our case: yt is infinite dimensional Pt depends on the policy chosen If top-Lyap exp.· 0, we are good

yt+1 = Ptyt

°̂top = limsupt! 1


log+(kytk1 )


Open question




f (i1; : : : ; im) = log(jjPi1Pi2 : : :Pim jj), ik 2 f0;1g.

f : f 0;1g¤ ! R+, f (x + y) · f (x) + f (y),limsupm! 1

1m f ([x]m) · ¯ .

8fymgm, ym 2 f0;1gm,

limsupm! 11m logf (ym) · ¯


Relation to LSTD

LSTD: Linear function space Bootstrap the “normal equation”

h¤(f ) = infh2F

kh ¡ Qf k2n

QLST D = inff 2F

kf ¡ h¤(f )k2n

QBRM = inff 2F

kf ¡ Qf k2n ¡ kh¤(f ) ¡ Qf k2


kQ ¡ Qf k2n = kQ ¡ h¤(Q)k2

n + kh¤(Q) ¡ Qf k2n

[AnSzeMu ’07]


Open issues

Adaptive algorithms to take advantage of regularity when present to address the “curse of dimensionality” Penalized least-squares/aggregation? Feature relevance Factorization Manifold estimation

Abstraction – build automatically Active learning Optimal on-line learning for infinite problems


References [Auer et al. ’02] P. Auer, N. Cesa-Bianchi and P. Fischer: Finite time analysis of the multiarmed bandit problem,

Machine Learning, 47:235—256, 2002. [AuSzeMu ’07] J.-Y. Audibert, R. Munos and Cs. Szepesvári: Tuning bandit algorithms in stochastic environments,

ALT, 2007. [Auer, Jaksch & Ortner ’07] P. Auer, T. Jaksch and R. Ortner: Near-optimal Regret Bounds for Reinforcement

Learning, (2007), available athttp://www.unileoben.ac.at/~infotech/publications/ucrlrevised.pdf

[Singh & Sutton ’96] S.P. Singh and R.S. Sutton:Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123—158, 1996.

[Sutton ’88] R.S. Sutton: Learning to predict by the method of temporal differences. Machine Learning, 3:9—44, 1988.

[Jaakkola et al. ’94] T. Jaakkola, M.I. Jordan, and S.P. Singh: On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6: 1185—1201, 1994.

[Tsitsiklis, ’94] J.N. Tsitsiklis: Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185—202, 1994.

[SzeLi99] Cs. Szepesvári and M.L. Littman: A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms, Neural Computation, 11, 2017—2059, 1999.

[Watkins ’90] C.J.C.H. Watkins: Learning from Delayed Rewards, PhD Thesis, 1990. [Rummery and Niranjan ’94] G.A. Rummery and M. Niranjan: On-line Q-learning using connectionist systems.

Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department, 1994. [Sutton ’84] R.S. Sutton: Temporal Credit Assignment in Reinforcement Learning.

PhD thesis, University of Massachusetts, Amherst, MA, 1984. [Tsitsiklis & Van Roy ’97] J.N. Tsitsiklis and B. Van Roy: An analysis of temporal-difference learning with function

approximation. IEEE Transactions on Automatic Control, 42:674—690, 1997. [Sutton ’96] R.S. Sutton: Generalization in reinforcement learning: Successful examples using sparse coarse

coding. NIPS, 1996. [Baird ’95] L.C. Baird: Residual algorithms: Reinforcement learning with function approximation, ICML, 1995. [Bradtke, Barto ’96] S.J. Bradtke and A.G. Barto: Linear least-squares algorithms for temporal difference learning.

Machine Learning, 22:33—57, 1996. [Lagoudakis, Parr ’03] M. Lagoudakis and R. Parr: Least-squares policy iteration, Journal of Machine Learning

Research, 4:1107—1149, 2003. [AnSzeMu ’07] A. Antos, Cs. Szepesvari and R. Munos: Learning near-optimal policies with Bellman-residual

minimization based fitted policy iteration and a single sample path, Machine Learning Journal, 2007.