Download - 1 Reinforcement Learning: Learning Algorithms Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides: szepesva/MLSS08/szepesva/MLSS08

1

Reinforcement Learning:Learning Algorithms

Csaba SzepesváriUniversity of Alberta

Kioloa, MLSS’08

Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/

2

Contents

Defining the problem(s) Learning optimally Learning a good policy

Monte-Carlo Temporal Difference (bootstrapping) Batch – fitted value iteration and

relatives

3

The Learning Problem The MDP is unknown but the agent can

interact with the system Goals:

Learn an optimal policy Where do the samples come from?

Samples are generated externally The agent interacts with the system to get the

samples (“active learning”) Performance measure: What is the performance

of the policy obtained? Learn optimally: Minimize regret while

interacting with the system Performance measure: loss in rewards due to

not using the optimal policy from the beginning Exploration vs. exploitation

4

Learning from Feedback

A protocol for prediction problems: xt – situation (observed by the agent) yt 2 Y – value to be predicted pt 2 Y – predicted value (can depend on all past

values ) learning!) rt(xt,yt,y) – value of predicting y

loss of learner: t= rt(xt,yt,y)-rt(xt, yt,pt) Supervised learning:

agent is told yt, rt(xt,yt,.) Regression: rt(xt,yt,y)=-(y-yt)2 t=(yt-pt)2

Full information prediction problem:8 y2 Y, rt(xt,y) is communicated to the agent, but not yt

Bandit (partial information) problem:rt(xt,pt) is communicated to the agent only

5

Learning Optimally

Explore or exploit? Bandit problems

Simple schemes Optimism in the face of uncertainty (OFU) UCB

Learning optimally in MDPs with the OFU principle

6

Learning Optimally: Exploration vs. Exploitation

Two treatments Unknown success

probabilities Goal:

find the best treatment while loosing few patients

Explore or exploit?

7

Exploration vs. Exploitation: Some Applications

Simple processes: Clinical trials Job shop scheduling (random jobs) What ad to put on a web-page

More complex processes (memory): Optimizing production Controlling an inventory Optimal investment Poker ..

8

Bernoulli Bandits

Payoff is 0 or 1

Arm 1:R1(1), R2(1), R3(1), R4(1), …

Arm 2:R1(2), R2(2), R3(2), R4(2), …

0

1 1 0

1 0

1

0

9

Some definitions

Payoff is 0 or 1

Arm 1:R1(1), R2(1), R3(1), R4(1), …

Arm 2:R1(2), R2(2), R3(2), R4(2), …

Now: t=9T1(t-1) = 4T2(t-1) = 4A1 = 1, A2 = 2, …

L̂Tdef=

P Tt=1 Rt(k¤) ¡

P Tt=1 RTA t (t)(At)

0

1 1 0

1 0

1

0

10

The Exploration/Exploitation Dilemma

Action values: Q*(a) = E[Rt(a)] Suppose you form estimates

The greedy action at t is:

Exploitation: When the agent chooses to follow At*

Exploration: When the agent chooses to do something else

You can’t exploit all the time; you can’t explore all the time You can never stop exploring; but you should always

reduce exploring. Maybe.

Qt(a) ¼Q¤(a)

A¤t = argmaxa Qt(a)

11

Action-Value Methods Methods that adapt action-value estimates

and nothing else How to estimate action-values? Sample average:

Claim: if nt(a)!1 Why??

limt! 1 Qt(a) = Q¤(a);

Qt(a) =R 1(a)+:::+R T t (a ) (a)

Tt (a)

12

-Greedy Action Selection

Greedy action selection:

-Greedy:

. . . the simplest way to “balance” exploration and exploitation

At = A¤t = argmaxa Qt(a)

At =

(A¤

t with probability 1¡ "random action with probability "

13

10-Armed Testbed

n = 10 possible actions Repeat 2000 times:

Q*(a) ~ N(0,1) Play 1000 rounds

Rt(a)~ N(Q*(a),1)

14

-Greedy Methods on the 10-Armed Testbed

15

Softmax Action Selection

Problem with ²-greedy: Neglects action values

Softmax idea: grade action probs. by estimated values.

Gibbs, or Boltzmann action selection, or exponential weights:

=t is the “computational temperature”

P (At = ajH t) = eQ t (a )=¿ tP

b eQ t (b)=¿ t

16

Incremental Implementation

Qt+1(At) = Qt(At) + 1t+1(Rt+1 ¡ Qt(At))

Sample average:

Incremental computation:

Common update rule form:

NewEstimate = OldEstimate + StepSize[Target – OldEstimate]

Qt(a) =R 1(a)+:::+R T t (a ) (a)

Tt (a)

17

UCB: Upper Confidence Bounds Principle: Optimism in the face of uncertainty Works when the environment is not adversary Assume rewards are in [0,1]. Let

(p>2) For a stationary environment, with iid rewards this

algorithm is hard to beat! Formally: regret in T steps is O(log T) Improvement: Estimate variance, use it in place of

p [AuSzeMu ’07] This principle can be used for achieving small

regret in the full RL problem!

At = argmaxa

nQt(a) +

qp log(t)2Tt (a)

o

[Auer et al. ’02]

18

UCRL2: UCB Applied to RL [Auer, Jaksch & Ortner ’07] Algorithm UCRL2(±):

Phase initialization: Estimate mean model p0 using maximum

likelihood (counts) C := { p | ||p(.|x,a)-p0(.|x,a)

· c |X| log(|A|T/delta) / N(x,a) }

p’ :=argmaxp ½*(p), ¼ :=¼*(p’) N0(x,a) := N(x,a), 8 (x,a)2 X£ A

Execution Execute ¼ until some (x,a) have been visited

at least N0(x,a) times in this phase

19

UCRL2 Results Def: Diameter of an MDP M:

D(M) = maxx,y min¼ E[ T(xy; ¼) ] Regret bounds

Lower bound: E[Ln] = ( ( D |X| |A| T )1/2)

Upper bounds: w.p. 1-±/T,

LT · O( D |X| ( |A| T log( |A|T/±)1/2 ) w.p. 1-±,

LT · O( D2 |X|2 |A| log( |A|T/±)/¢ ) ¢ =performance gap between best and second best policy

20

Learning a Good Policy

Monte-Carlo methods Temporal Difference methods

Tabular case Function approximation

Batch learning

21

Learning a good policy Model-based learning

Learn p,r “Solve” the resulting MDP

Model-free learning Learn the optimal action-value function

and (then) act greedily Actor-critic learning Policy gradient methods

Hybrid Learn a model and mix planning and a

model-free method; e.g. Dyna

22

Monte-Carlo Methods Episodic MDPs! Goal: Learn V¼(.) V¼(x)

= E¼[ t°t Rt|X0=x] (Xt,At,Rt):

-- trajectory of ¼ Visits to a state

f(x) = min {t|Xt = x} First visit

E(x) = { t | Xt = x } Every visit

Return:S(t) = °0Rt + °1 Rt+1 + …

K independent trajectories S(k), E(k), f(k), k=1..K

First-visit MC: Average over{ S(k)( f(k)(x) ) : k=1..K }

Every-visit MC: Average over{ S(k)( t ) : k=1..K

, t2 E(k)(x) } Claim: Both converge to

V¼(.) From now on St = S(t)

1 2 3 4 5

[Singh & Sutton ’96]

23

Learning to Control with MC Goal: Learn to behave optimally Method:

Learn Q¼(x,a) ..to be used in an approximate policy iteration (PI)

algorithm Idea/algorithm:

Add randomness Goal: all actions are sampled eventually infinitely often e.g., ²-greedy or exploring starts

Use the first-visit or the every-visit method to estimate Q¼(x,a)

Update policy Once values converged

.. or .. Always at the states visited

24

Monte-Carlo: Evaluation

Convergence rate: Var(S(0)|X=x)/N Advantages over DP:

Learn from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (no

bootstrapping) Issue: maintaining sufficient

exploration exploring starts, soft policies

25

Temporal Difference Methods Every-visit Monte-Carlo:

V(Xt) V(Xt) + ®t(Xt) (St – V(Xt)) Bootstrapping

St = Rt + ° St+1

St’ = Rt + ° V(Xt+1) TD(0):

V(Xt) V(Xt) + ®t(Xt) ( St’– V(Xt) ) Value iteration:

V(Xt) E[ St’ | Xt ] Theorem: Let Vt be the sequence of functions generated

by TD(0). Assume 8 x, w.p.1 t ®t(x)=1, t ®t

2(x)<+1. Then Vt V¼ w.p.1

Proof: Stochastic approximations:Vt+1=Tt(Vt,Vt), Ut+1=Tt(Ut,V¼) TV¼.[Jaakkola et al., ’94, Tsitsiklis ’94, SzeLi99]

[Samuel, ’59], [Holland ’75], [Sutton ’88]

26

TD or MC? TD advantages:

can be fully incremental, i.e., learn before knowing the final outcome Less memory Less peak computation

learn without the final outcome From incomplete sequences

MC advantage: Less harm by Markovian violations

Convergence rate? Var(S(0)|X=x) decides!

27

Learning to Control with TD Q-learning [Watkins ’90]:

Q(Xt,At) Q(Xt,At) + ®t(Xt,At) {Rt+°maxaQ (Xt+1,a)–Q(Xt,At)}

Theorem: Converges to Q* [JJS’94, Tsi’94,SzeLi99]

SARSA [Rummery & Niranjan ’94]: At ~ Greedy²(Q,Xt) Q(Xt,At) Q(Xt,At) +

®t(Xt,At) {Rt+°Q (Xt+1,At+1)–Q(Xt,At)}

Off-policy (Q-learning) vs. on-policy (SARSA) Expecti-SARSA Actor-Critic [Witten ’77, Barto, Sutton & Anderson ’83, Sutton ’84]

28

Cliffwalking

greedy= 0.1

29

N-step TD Prediction Idea: Look farther into the future when you

do TD backup (1, 2, 3, …, n steps)

30

Monte Carlo: St = Rt+° Rt+1 + .. °T-t RT

TD: St(1)

= Rt + ° V(Xt+1) Use V to estimate remaining return

n-step TD: 2 step return:

St(2)

= Rt + ° Rt+1 + °2 V(Xt+2)

n-step return: St

(n) = Rt + ° Rt+1 + … + °n V(Xt+n)

N-step TD Prediction

31

Learning with n-step Backups

Learning with n-step backups: V(Xt) V(Xt) + ®t( St

(n) - V(Xt))

n: controls how much to bootstrap

32

Random Walk Examples

How does 2-step TD work here? How about 3-step TD?

33

A Larger Example Task: 19 state

random walk

Do you think there is an optimal n? for everything?

34

Averaging N-step Returns

Idea: backup an average of several returns e.g. backup half of 2-step and

half of 4-step:

“complex backup”

One backup

Rt = 12R(2)

t + 12R(4)

t

35

Forward View of TD()

Idea: Average over multiple backups

-return:St

(¸) = (1-¸) n1 ¸n St(n+1)

TD(¸):¢V(Xt) = ®t( St

(¸) -V(Xt)) Relation to TD(0) and MC

¸=0 TD(0) ¸=1 MC

[Sutton ’88]

36

-return on the Random Walk

Same 19 state random walk as before Why intermediate values of are

best?

37

Backward View of TD()

±t = Rt + ° V(Xt+1) – V(Xt)

V(x) V(x) + ®t ±t e(x)

e(x) ° ¸ e(x) + I(x=Xt) Off-line updates Same as FW TD(¸) e(x): eligibility trace

Accumulating trace Replacing traces speed up convergence:

e(x) max( °¸ e(x), I(x=Xt) )

[Sutton ’88, Singh & Sutton ’96]

38

Function Approximation with TD

39

Gradient Descent Methods

transpose

µt = (µt(1); : : : ;µt(n))T

Assume Vt is a differentiable function of :

Vt(x) = V(x;).

Assume, for now, training examples of the form:

{ (Xt, V(Xt)) }

40

Performance Measures Many are applicable but… a common and simple one is the mean-squared

error (MSE) over a distribution P:

Why P? Why minimize MSE? Let us assume that P is always the distribution of

states at which backups are done. The on-policy distribution: the distribution

created while following the policy being evaluated. Stronger results are available for this distribution.

L(µ) =P

x2X P (x) (V¼(x) ¡ V(x;µ))2

41

Gradient Descent Let L be any function of the parameters.

Its gradient at any point in this space is:

Iteratively move down the gradient:

(1)

(2)

t t(1), t(2) T

r µL =³

@L@µ(1) ;

@L@µ(2) ; : : : ;

@L@µ(n)

´T

µt+1 = µt ¡ ®t (r µL) jµ=µt

42

Gradient Descent in RL

Function to descent on:

Gradient:

Gradient descent procedure:

Bootstrapping with St’

TD() (forward view):

L(µ) =P

x2X P (x) (V¼(x) ¡ V(x;µ))2

µt+1 = µt + ®t (V¼(X t) ¡ V(X t;µt)) r µV(X t;µt)

µt+1 = µt + ®t (S0t ¡ V(X t;µt)) r µV(X t;µt)

µt+1 = µt + ®t¡S¸

t ¡ V(X t;µt)¢r µV(X t;µt)

r µL(µ) = ¡ 2P

x2X P (x) (V¼(x) ¡ V(x;µ)) r µV(x;µ)

43

Linear Methods Linear FAPP: V(x;µ) =µ T Á(x) rµ V(x;µ) = Á(x) Tabular representation:

Á(x)y = I(x=y) Backward view:

±t = Rt + ° V(Xt+1) – V(Xt)

µ µ + ®t ±t e

e ° ¸ e + rµ V(Xt;µ) Theorem [TsiVaR’97]: Vt converges to

V s.t. ||V-V¼||D,2 · ||V¼-¦ V¼||D,2/(1-°).

[Sutton ’84, ’88, Tsitsiklis & Van Roy ’97]

44

Learning state-action valuesTraining examples:

The general gradient-descent rule:

Gradient-descent Sarsa()

Control with FA

f ((X t;At);Q¤(X t;At) + noiset)g

µt+1 = µt + ®t (St ¡ Q(X t;At;µt)) r µQ(X t;At;µt)

µt+1 = µt + ®t±tet

where

±t = Rt + °Q(X t+1;At+1;µt) ¡ Qt(X t;At;µt)

et = °¸et¡ 1 + r µQ(X t;At;µ)

[Rummery & Niranjan ’94]

45

Mountain-Car Task

[Sutton ’96], [Singh & Sutton ’96]

46

Mountain-Car Results

47

Baird’s Counterexample: Off-policy Updates Can Diverge

[Baird ’95]

48

Baird’s Counterexample Cont.

49

Should We Bootstrap?

50

Batch Reinforcement Learning

51

Batch RL Goal: Given the trajectory of the behavior policy ¼b

X1,A1,R1, …, Xt, At, Rt, …, XN

compute a good policy! “Batch learning” Properties:

Data collection is not influenced Emphasis is on the quality of the solution Computational complexity plays a secondary role

Performance measures: ||V*(x) – V¼(x)||1 = supx |V*(x) - V¼(x)|

= supx V*(x) - V¼(x) ||V*(x) - V¼(x)||2 = s (V*(x)-V¼(x))2 d¹(x)

52

Solution methods

Build a model Do not build a model, but find an

approximation to Q*

using value iteration => fitted Q-iteration

using policy iteration => Policy evaluated by approximate value

iteration Policy evaluated by Bellman-residual minimization (BRM)

Policy evaluated by least-squares temporal difference learning (LSTD) => LSPI

Policy search

[Bradtke, Barto ’96], [Lagoudakis, Parr ’03], [AnSzeMu ’07]

53

Evaluating a policy: Fitted value iteration

Choose a function space F. Solve for i=1,2,…,M the LS (regression) problems:

Counterexamples?!?!? [Baird ’95, Tsitsiklis and van Roy ’96]

When does this work?? Requirement: If M is big enough and the number of

samples is big enough QM should be close to Q¼

We have to make some assumptions on F

Qi+1 = argminQ2F

TX

t=1

(Rt + °Qi (X t+1;¼(X t+1)) ¡ Q(X t;At))2

54

Least-squares vs. gradient Linear least squares (ordinary regression):

yt = w*T xt + ²t

(xt,yt) jointly distributed r.v.s., iid, E[²t|xt]=0. Seeing (xt,yt), t=1,…,T, find out w*. Loss function: L(w) = E[ (y1 – wT x1 )2 ]. Least-squares approach:

wT = argminw t=1T (yt – wT xt)2

Stochastic gradient method: wt+1 = wt + ®t (yt-wt

T xt) xt

Tradeoffs Sample complexity: How good is the estimate Computational complexity: How expensive is

the computation?

55

Fitted value iteration: Analysis Goal: Bound ||QM - Q¼||¹

2 in terms of

maxm ||²m||º2 , ||²m||º

2 = s ²m2(x,a) º(dx,da),

where Qm+1 = T¼Qm + ²m , ²-1= Q0-Q¼

Um = Qm – Q¼

Um+1 = Qm+1 ¡ Q¼

= T¼Qm ¡ Q¼+ ²m

= T¼Qm ¡ T¼Q¼+ ²m

= °P¼Um + ²m:

UM =MX

m=0

(°P¼)M ¡ m ²m¡ 1:

After [AnSzeMu ’07]

56

Analysis/2

UM =MX

m=0

(°P¼)M ¡ m ²m¡ 1:

¹ jUM j2 ·µ

11¡ °

¶2 1¡ °1¡ °M +1

MX

m=0

°m¹ ((P¼)m²M ¡ m¡ 1)2

· C1

µ1

1¡ °

¶2 1¡ °1¡ °M +1

MX

m=0

°mºj²M ¡ m¡ 1j2

· C1

µ1

1¡ °

¶2 1¡ °1¡ °M +1

Ã

°M ºj²¡ 1j2 +MX

m=0

°m²2

!

= C1

µ1

1¡ °

¶2

²2 + C1°M ºj²¡ 1j2

1¡ °M +1 :

Legend:

² ½f =R

f (x)½(dx)

² (P f )(x) =R

f (y)P (dyjx)

J ensen applied to operators,¹ · C1º and:8½: ½P¼ · C1º

J ensen

57

Summary If the regression errors are all small and the system

is noisy (8 ¼,½, ½ P¼ · C1 º) then the final error will be small.

How to make the regression errors small? Regression error decomposition:

kQm+1 ¡ T¼Qmk2 · kQm+1 ¡ ¦ F T¼Qmk2

+k¦ F T¼Qm ¡ T¼Qmk2

Approximation error

Estimation error

58

Controlling the approximation error

F

TF

F

f

Tf

59


F

TF

Fdp;¹ (TF ;F )

60


F

TF

F

F

TF

61


B(X ; R max1¡ ° )

Assume smoothness!Lip®(L)

T³B(X ; R max

1¡ ° )´

62

Learning with (lots of) historical data

Data: A long trajectory of some exploration policy

Goal: Efficient algorithm to learn a policy Idea: Use fitted action-values Algorithms:

Bellman residual minimization, FQI [AnSzeMu ’07] LSPI [Lagoudakis, Parr ’03]

Bounds: Oracle inequalities (BRM, FQI and LSPI) ) consistency

63

BRM insight TD error: tRt+° Q(Xt+1,¼(Xt+1))-Q(Xt,At) Bellman error: E[E[ t | Xt,At ]2] What we can compute/estimate: E[E[ t

2 | Xt,At]] They are different! However:

E[¢ tjX t;At]2 = E[¢ 2t jX t;At]¡ Var[¢ tjX t;At]

[AnSzeMu ’07]

64

Loss function

LN ;¼(Q;h) =

1N

NX

t=1

wt

n(Rt + °Q(X t+1;¼(X t+1)) ¡ Q(X t;At))

2

¡ (Rt + °Q(X t+1;¼(X t+1)) ¡ h(X t;At))2o

wt = 1=¹ (AtjX t)

E[¢ tjX t;At]2 = E[¢ 2t jX t;At]¡ Var[¢ tjX t;At]

65

Algorithm (BRM++)

1. Choose ¼0, i := 0

2. While (i · K ) do:

3. Let Qi+1 = argminQ2F A suph2F A LN ;¼i (Q;h)

4. Let ¼i+1(x) = argmaxa2A Qi+1(x;a)

5. i := i + 1

66

Do we need to reweight or throw away data?

NO! WHY? Intuition from regression:

m(x) = E[Y|X=x] can be learnt no matter what p(x) is!

*(a|x): the same should be possible! BUT..

Performance might be poor! => YES! Like in supervised learning when training and

test distributions are different

67

Bound

kQ¤ ¡ Q¼K k2;½·

2°(1¡ °)2C1=2

½;º

³~E (F ) + E (F ) + S1=2

N ;x

´+ (2°K )1=2Rmax;

SN ;x = c2

³(V

2 + 1) ln(N ) + ln(c1) + 11+· ln(bc2

24 ) + x

´ 1+ ·2·

(b1=· N )1=2

68

The concentration coefficients

Lyapunov exponents

Our case: yt is infinite dimensional Pt depends on the policy chosen If top-Lyap exp.· 0, we are good

yt+1 = Ptyt

°̂top = limsupt! 1

1t

log+(kytk1 )

69

Open question

Abstraction:

Let

True?

f (i1; : : : ; im) = log(jjPi1Pi2 : : :Pim jj), ik 2 f0;1g.

f : f 0;1g¤ ! R+, f (x + y) · f (x) + f (y),limsupm! 1

1m f ([x]m) · ¯ .

8fymgm, ym 2 f0;1gm,

limsupm! 11m logf (ym) · ¯

70

Relation to LSTD

LSTD: Linear function space Bootstrap the “normal equation”

h¤(f ) = infh2F

kh ¡ Qf k2n

QLST D = inff 2F

kf ¡ h¤(f )k2n

QBRM = inff 2F

kf ¡ Qf k2n ¡ kh¤(f ) ¡ Qf k2

n

kQ ¡ Qf k2n = kQ ¡ h¤(Q)k2

n + kh¤(Q) ¡ Qf k2n

[AnSzeMu ’07]

71

Open issues

Adaptive algorithms to take advantage of regularity when present to address the “curse of dimensionality” Penalized least-squares/aggregation? Feature relevance Factorization Manifold estimation

Abstraction – build automatically Active learning Optimal on-line learning for infinite problems

72

References [Auer et al. ’02] P. Auer, N. Cesa-Bianchi and P. Fischer: Finite time analysis of the multiarmed bandit problem,

Machine Learning, 47:235—256, 2002. [AuSzeMu ’07] J.-Y. Audibert, R. Munos and Cs. Szepesvári: Tuning bandit algorithms in stochastic environments,

ALT, 2007. [Auer, Jaksch & Ortner ’07] P. Auer, T. Jaksch and R. Ortner: Near-optimal Regret Bounds for Reinforcement

Learning, (2007), available athttp://www.unileoben.ac.at/~infotech/publications/ucrlrevised.pdf

[Singh & Sutton ’96] S.P. Singh and R.S. Sutton:Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123—158, 1996.

[Sutton ’88] R.S. Sutton: Learning to predict by the method of temporal differences. Machine Learning, 3:9—44, 1988.

[Jaakkola et al. ’94] T. Jaakkola, M.I. Jordan, and S.P. Singh: On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6: 1185—1201, 1994.

[Tsitsiklis, ’94] J.N. Tsitsiklis: Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185—202, 1994.

[SzeLi99] Cs. Szepesvári and M.L. Littman: A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms, Neural Computation, 11, 2017—2059, 1999.

[Watkins ’90] C.J.C.H. Watkins: Learning from Delayed Rewards, PhD Thesis, 1990. [Rummery and Niranjan ’94] G.A. Rummery and M. Niranjan: On-line Q-learning using connectionist systems.

Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department, 1994. [Sutton ’84] R.S. Sutton: Temporal Credit Assignment in Reinforcement Learning.

PhD thesis, University of Massachusetts, Amherst, MA, 1984. [Tsitsiklis & Van Roy ’97] J.N. Tsitsiklis and B. Van Roy: An analysis of temporal-difference learning with function

approximation. IEEE Transactions on Automatic Control, 42:674—690, 1997. [Sutton ’96] R.S. Sutton: Generalization in reinforcement learning: Successful examples using sparse coarse

coding. NIPS, 1996. [Baird ’95] L.C. Baird: Residual algorithms: Reinforcement learning with function approximation, ICML, 1995. [Bradtke, Barto ’96] S.J. Bradtke and A.G. Barto: Linear least-squares algorithms for temporal difference learning.

Machine Learning, 22:33—57, 1996. [Lagoudakis, Parr ’03] M. Lagoudakis and R. Parr: Least-squares policy iteration, Journal of Machine Learning

Research, 4:1107—1149, 2003. [AnSzeMu ’07] A. Antos, Cs. Szepesvari and R. Munos: Learning near-optimal policies with Bellman-residual

minimization based fitted policy iteration and a single sample path, Machine Learning Journal, 2007.