1
Reinforcement Learning:Learning Algorithms
Csaba SzepesváriUniversity of Alberta
Kioloa, MLSS’08
Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/
2
Contents
Defining the problem(s) Learning optimally Learning a good policy
Monte-Carlo Temporal Difference (bootstrapping) Batch – fitted value iteration and
relatives
3
The Learning Problem The MDP is unknown but the agent can
interact with the system Goals:
Learn an optimal policy Where do the samples come from?
Samples are generated externally The agent interacts with the system to get the
samples (“active learning”) Performance measure: What is the performance
of the policy obtained? Learn optimally: Minimize regret while
interacting with the system Performance measure: loss in rewards due to
not using the optimal policy from the beginning Exploration vs. exploitation
4
Learning from Feedback
A protocol for prediction problems: xt – situation (observed by the agent) yt 2 Y – value to be predicted pt 2 Y – predicted value (can depend on all past
values ) learning!) rt(xt,yt,y) – value of predicting y
loss of learner: t= rt(xt,yt,y)-rt(xt, yt,pt) Supervised learning:
agent is told yt, rt(xt,yt,.) Regression: rt(xt,yt,y)=-(y-yt)2 t=(yt-pt)2
Full information prediction problem:8 y2 Y, rt(xt,y) is communicated to the agent, but not yt
Bandit (partial information) problem:rt(xt,pt) is communicated to the agent only
5
Learning Optimally
Explore or exploit? Bandit problems
Simple schemes Optimism in the face of uncertainty (OFU) UCB
Learning optimally in MDPs with the OFU principle
6
Learning Optimally: Exploration vs. Exploitation
Two treatments Unknown success
probabilities Goal:
find the best treatment while loosing few patients
Explore or exploit?
7
Exploration vs. Exploitation: Some Applications
Simple processes: Clinical trials Job shop scheduling (random jobs) What ad to put on a web-page
More complex processes (memory): Optimizing production Controlling an inventory Optimal investment Poker ..
8
Bernoulli Bandits
Payoff is 0 or 1
Arm 1:R1(1), R2(1), R3(1), R4(1), …
Arm 2:R1(2), R2(2), R3(2), R4(2), …
0
1 1 0
1 0
1
0
9
Some definitions
Payoff is 0 or 1
Arm 1:R1(1), R2(1), R3(1), R4(1), …
Arm 2:R1(2), R2(2), R3(2), R4(2), …
Now: t=9T1(t-1) = 4T2(t-1) = 4A1 = 1, A2 = 2, …
L̂Tdef=
P Tt=1 Rt(k¤) ¡
P Tt=1 RTA t (t)(At)
0
1 1 0
1 0
1
0
10
The Exploration/Exploitation Dilemma
Action values: Q*(a) = E[Rt(a)] Suppose you form estimates
The greedy action at t is:
Exploitation: When the agent chooses to follow At*
Exploration: When the agent chooses to do something else
You can’t exploit all the time; you can’t explore all the time You can never stop exploring; but you should always
reduce exploring. Maybe.
Qt(a) ¼Q¤(a)
A¤t = argmaxa Qt(a)
11
Action-Value Methods Methods that adapt action-value estimates
and nothing else How to estimate action-values? Sample average:
Claim: if nt(a)!1 Why??
limt! 1 Qt(a) = Q¤(a);
Qt(a) =R 1(a)+:::+R T t (a ) (a)
Tt (a)
12
-Greedy Action Selection
Greedy action selection:
-Greedy:
. . . the simplest way to “balance” exploration and exploitation
At = A¤t = argmaxa Qt(a)
At =
(A¤
t with probability 1¡ "random action with probability "
13
10-Armed Testbed
n = 10 possible actions Repeat 2000 times:
Q*(a) ~ N(0,1) Play 1000 rounds
Rt(a)~ N(Q*(a),1)
14
-Greedy Methods on the 10-Armed Testbed
15
Softmax Action Selection
Problem with ²-greedy: Neglects action values
Softmax idea: grade action probs. by estimated values.
Gibbs, or Boltzmann action selection, or exponential weights:
=t is the “computational temperature”
P (At = ajH t) = eQ t (a )=¿ tP
b eQ t (b)=¿ t
16
Incremental Implementation
Qt+1(At) = Qt(At) + 1t+1(Rt+1 ¡ Qt(At))
Sample average:
Incremental computation:
Common update rule form:
NewEstimate = OldEstimate + StepSize[Target – OldEstimate]
Qt(a) =R 1(a)+:::+R T t (a ) (a)
Tt (a)
17
UCB: Upper Confidence Bounds Principle: Optimism in the face of uncertainty Works when the environment is not adversary Assume rewards are in [0,1]. Let
(p>2) For a stationary environment, with iid rewards this
algorithm is hard to beat! Formally: regret in T steps is O(log T) Improvement: Estimate variance, use it in place of
p [AuSzeMu ’07] This principle can be used for achieving small
regret in the full RL problem!
At = argmaxa
nQt(a) +
qp log(t)2Tt (a)
o
[Auer et al. ’02]
18
UCRL2: UCB Applied to RL [Auer, Jaksch & Ortner ’07] Algorithm UCRL2(±):
Phase initialization: Estimate mean model p0 using maximum
likelihood (counts) C := { p | ||p(.|x,a)-p0(.|x,a)
· c |X| log(|A|T/delta) / N(x,a) }
p’ :=argmaxp ½*(p), ¼ :=¼*(p’) N0(x,a) := N(x,a), 8 (x,a)2 X£ A
Execution Execute ¼ until some (x,a) have been visited
at least N0(x,a) times in this phase
19
UCRL2 Results Def: Diameter of an MDP M:
D(M) = maxx,y min¼ E[ T(xy; ¼) ] Regret bounds
Lower bound: E[Ln] = ( ( D |X| |A| T )1/2)
Upper bounds: w.p. 1-±/T,
LT · O( D |X| ( |A| T log( |A|T/±)1/2 ) w.p. 1-±,
LT · O( D2 |X|2 |A| log( |A|T/±)/¢ ) ¢ =performance gap between best and second best policy
20
Learning a Good Policy
Monte-Carlo methods Temporal Difference methods
Tabular case Function approximation
Batch learning
21
Learning a good policy Model-based learning
Learn p,r “Solve” the resulting MDP
Model-free learning Learn the optimal action-value function
and (then) act greedily Actor-critic learning Policy gradient methods
Hybrid Learn a model and mix planning and a
model-free method; e.g. Dyna
22
Monte-Carlo Methods Episodic MDPs! Goal: Learn V¼(.) V¼(x)
= E¼[ t°t Rt|X0=x] (Xt,At,Rt):
-- trajectory of ¼ Visits to a state
f(x) = min {t|Xt = x} First visit
E(x) = { t | Xt = x } Every visit
Return:S(t) = °0Rt + °1 Rt+1 + …
K independent trajectories S(k), E(k), f(k), k=1..K
First-visit MC: Average over{ S(k)( f(k)(x) ) : k=1..K }
Every-visit MC: Average over{ S(k)( t ) : k=1..K
, t2 E(k)(x) } Claim: Both converge to
V¼(.) From now on St = S(t)
1 2 3 4 5
[Singh & Sutton ’96]
23
Learning to Control with MC Goal: Learn to behave optimally Method:
Learn Q¼(x,a) ..to be used in an approximate policy iteration (PI)
algorithm Idea/algorithm:
Add randomness Goal: all actions are sampled eventually infinitely often e.g., ²-greedy or exploring starts
Use the first-visit or the every-visit method to estimate Q¼(x,a)
Update policy Once values converged
.. or .. Always at the states visited
24
Monte-Carlo: Evaluation
Convergence rate: Var(S(0)|X=x)/N Advantages over DP:
Learn from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (no
bootstrapping) Issue: maintaining sufficient
exploration exploring starts, soft policies
25
Temporal Difference Methods Every-visit Monte-Carlo:
V(Xt) V(Xt) + ®t(Xt) (St – V(Xt)) Bootstrapping
St = Rt + ° St+1
St’ = Rt + ° V(Xt+1) TD(0):
V(Xt) V(Xt) + ®t(Xt) ( St’– V(Xt) ) Value iteration:
V(Xt) E[ St’ | Xt ] Theorem: Let Vt be the sequence of functions generated
by TD(0). Assume 8 x, w.p.1 t ®t(x)=1, t ®t
2(x)<+1. Then Vt V¼ w.p.1
Proof: Stochastic approximations:Vt+1=Tt(Vt,Vt), Ut+1=Tt(Ut,V¼) TV¼.[Jaakkola et al., ’94, Tsitsiklis ’94, SzeLi99]
[Samuel, ’59], [Holland ’75], [Sutton ’88]
26
TD or MC? TD advantages:
can be fully incremental, i.e., learn before knowing the final outcome Less memory Less peak computation
learn without the final outcome From incomplete sequences
MC advantage: Less harm by Markovian violations
Convergence rate? Var(S(0)|X=x) decides!
27
Learning to Control with TD Q-learning [Watkins ’90]:
Q(Xt,At) Q(Xt,At) + ®t(Xt,At) {Rt+°maxaQ (Xt+1,a)–Q(Xt,At)}
Theorem: Converges to Q* [JJS’94, Tsi’94,SzeLi99]
SARSA [Rummery & Niranjan ’94]: At ~ Greedy²(Q,Xt) Q(Xt,At) Q(Xt,At) +
®t(Xt,At) {Rt+°Q (Xt+1,At+1)–Q(Xt,At)}
Off-policy (Q-learning) vs. on-policy (SARSA) Expecti-SARSA Actor-Critic [Witten ’77, Barto, Sutton & Anderson ’83, Sutton ’84]
28
Cliffwalking
greedy= 0.1
29
N-step TD Prediction Idea: Look farther into the future when you
do TD backup (1, 2, 3, …, n steps)
30
Monte Carlo: St = Rt+° Rt+1 + .. °T-t RT
TD: St(1)
= Rt + ° V(Xt+1) Use V to estimate remaining return
n-step TD: 2 step return:
St(2)
= Rt + ° Rt+1 + °2 V(Xt+2)
n-step return: St
(n) = Rt + ° Rt+1 + … + °n V(Xt+n)
N-step TD Prediction
31
Learning with n-step Backups
Learning with n-step backups: V(Xt) V(Xt) + ®t( St
(n) - V(Xt))
n: controls how much to bootstrap
32
Random Walk Examples
How does 2-step TD work here? How about 3-step TD?
33
A Larger Example Task: 19 state
random walk
Do you think there is an optimal n? for everything?
34
Averaging N-step Returns
Idea: backup an average of several returns e.g. backup half of 2-step and
half of 4-step:
“complex backup”
One backup
Rt = 12R(2)
t + 12R(4)
t
35
Forward View of TD()
Idea: Average over multiple backups
-return:St
(¸) = (1-¸) n1 ¸n St(n+1)
TD(¸):¢V(Xt) = ®t( St
(¸) -V(Xt)) Relation to TD(0) and MC
¸=0 TD(0) ¸=1 MC
[Sutton ’88]
36
-return on the Random Walk
Same 19 state random walk as before Why intermediate values of are
best?
37
Backward View of TD()
±t = Rt + ° V(Xt+1) – V(Xt)
V(x) V(x) + ®t ±t e(x)
e(x) ° ¸ e(x) + I(x=Xt) Off-line updates Same as FW TD(¸) e(x): eligibility trace
Accumulating trace Replacing traces speed up convergence:
e(x) max( °¸ e(x), I(x=Xt) )
[Sutton ’88, Singh & Sutton ’96]
38
Function Approximation with TD
39
Gradient Descent Methods
transpose
µt = (µt(1); : : : ;µt(n))T
Assume Vt is a differentiable function of :
Vt(x) = V(x;).
Assume, for now, training examples of the form:
{ (Xt, V(Xt)) }
40
Performance Measures Many are applicable but… a common and simple one is the mean-squared
error (MSE) over a distribution P:
Why P? Why minimize MSE? Let us assume that P is always the distribution of
states at which backups are done. The on-policy distribution: the distribution
created while following the policy being evaluated. Stronger results are available for this distribution.
L(µ) =P
x2X P (x) (V¼(x) ¡ V(x;µ))2
41
Gradient Descent Let L be any function of the parameters.
Its gradient at any point in this space is:
Iteratively move down the gradient:
(1)
(2)
t t(1), t(2) T
r µL =³
@L@µ(1) ;
@L@µ(2) ; : : : ;
@L@µ(n)
´T
µt+1 = µt ¡ ®t (r µL) jµ=µt
42
Gradient Descent in RL
Function to descent on:
Gradient:
Gradient descent procedure:
Bootstrapping with St’
TD() (forward view):
L(µ) =P
x2X P (x) (V¼(x) ¡ V(x;µ))2
µt+1 = µt + ®t (V¼(X t) ¡ V(X t;µt)) r µV(X t;µt)
µt+1 = µt + ®t (S0t ¡ V(X t;µt)) r µV(X t;µt)
µt+1 = µt + ®t¡S¸
t ¡ V(X t;µt)¢r µV(X t;µt)
r µL(µ) = ¡ 2P
x2X P (x) (V¼(x) ¡ V(x;µ)) r µV(x;µ)
43
Linear Methods Linear FAPP: V(x;µ) =µ T Á(x) rµ V(x;µ) = Á(x) Tabular representation:
Á(x)y = I(x=y) Backward view:
±t = Rt + ° V(Xt+1) – V(Xt)
µ µ + ®t ±t e
e ° ¸ e + rµ V(Xt;µ) Theorem [TsiVaR’97]: Vt converges to
V s.t. ||V-V¼||D,2 · ||V¼-¦ V¼||D,2/(1-°).
[Sutton ’84, ’88, Tsitsiklis & Van Roy ’97]
44
Learning state-action valuesTraining examples:
The general gradient-descent rule:
Gradient-descent Sarsa()
Control with FA
f ((X t;At);Q¤(X t;At) + noiset)g
µt+1 = µt + ®t (St ¡ Q(X t;At;µt)) r µQ(X t;At;µt)
µt+1 = µt + ®t±tet
where
±t = Rt + °Q(X t+1;At+1;µt) ¡ Qt(X t;At;µt)
et = °¸et¡ 1 + r µQ(X t;At;µ)
[Rummery & Niranjan ’94]
45
Mountain-Car Task
[Sutton ’96], [Singh & Sutton ’96]
46
Mountain-Car Results
47
Baird’s Counterexample: Off-policy Updates Can Diverge
[Baird ’95]
48
Baird’s Counterexample Cont.
49
Should We Bootstrap?
50
Batch Reinforcement Learning
51
Batch RL Goal: Given the trajectory of the behavior policy ¼b
X1,A1,R1, …, Xt, At, Rt, …, XN
compute a good policy! “Batch learning” Properties:
Data collection is not influenced Emphasis is on the quality of the solution Computational complexity plays a secondary role
Performance measures: ||V*(x) – V¼(x)||1 = supx |V*(x) - V¼(x)|
= supx V*(x) - V¼(x) ||V*(x) - V¼(x)||2 = s (V*(x)-V¼(x))2 d¹(x)
52
Solution methods
Build a model Do not build a model, but find an
approximation to Q*
using value iteration => fitted Q-iteration
using policy iteration => Policy evaluated by approximate value
iteration Policy evaluated by Bellman-residual minimization (BRM)
Policy evaluated by least-squares temporal difference learning (LSTD) => LSPI
Policy search
[Bradtke, Barto ’96], [Lagoudakis, Parr ’03], [AnSzeMu ’07]
53
Evaluating a policy: Fitted value iteration
Choose a function space F. Solve for i=1,2,…,M the LS (regression) problems:
Counterexamples?!?!? [Baird ’95, Tsitsiklis and van Roy ’96]
When does this work?? Requirement: If M is big enough and the number of
samples is big enough QM should be close to Q¼
We have to make some assumptions on F
Qi+1 = argminQ2F
TX
t=1
(Rt + °Qi (X t+1;¼(X t+1)) ¡ Q(X t;At))2
54
Least-squares vs. gradient Linear least squares (ordinary regression):
yt = w*T xt + ²t
(xt,yt) jointly distributed r.v.s., iid, E[²t|xt]=0. Seeing (xt,yt), t=1,…,T, find out w*. Loss function: L(w) = E[ (y1 – wT x1 )2 ]. Least-squares approach:
wT = argminw t=1T (yt – wT xt)2
Stochastic gradient method: wt+1 = wt + ®t (yt-wt
T xt) xt
Tradeoffs Sample complexity: How good is the estimate Computational complexity: How expensive is
the computation?
55
Fitted value iteration: Analysis Goal: Bound ||QM - Q¼||¹
2 in terms of
maxm ||²m||º2 , ||²m||º
2 = s ²m2(x,a) º(dx,da),
where Qm+1 = T¼Qm + ²m , ²-1= Q0-Q¼
Um = Qm – Q¼
Um+1 = Qm+1 ¡ Q¼
= T¼Qm ¡ Q¼+ ²m
= T¼Qm ¡ T¼Q¼+ ²m
= °P¼Um + ²m:
UM =MX
m=0
(°P¼)M ¡ m ²m¡ 1:
After [AnSzeMu ’07]
56
Analysis/2
UM =MX
m=0
(°P¼)M ¡ m ²m¡ 1:
¹ jUM j2 ·µ
11¡ °
¶2 1¡ °1¡ °M +1
MX
m=0
°m¹ ((P¼)m²M ¡ m¡ 1)2
· C1
µ1
1¡ °
¶2 1¡ °1¡ °M +1
MX
m=0
°mºj²M ¡ m¡ 1j2
· C1
µ1
1¡ °
¶2 1¡ °1¡ °M +1
Ã
°M ºj²¡ 1j2 +MX
m=0
°m²2
!
= C1
µ1
1¡ °
¶2
²2 + C1°M ºj²¡ 1j2
1¡ °M +1 :
Legend:
² ½f =R
f (x)½(dx)
² (P f )(x) =R
f (y)P (dyjx)
J ensen applied to operators,¹ · C1º and:8½: ½P¼ · C1º
J ensen
57
Summary If the regression errors are all small and the system
is noisy (8 ¼,½, ½ P¼ · C1 º) then the final error will be small.
How to make the regression errors small? Regression error decomposition:
kQm+1 ¡ T¼Qmk2 · kQm+1 ¡ ¦ F T¼Qmk2
+k¦ F T¼Qm ¡ T¼Qmk2
Approximation error
Estimation error
58
Controlling the approximation error
F
TF
F
f
Tf
59
Controlling the approximation error
F
TF
Fdp;¹ (TF ;F )
60
Controlling the approximation error
F
TF
F
F
TF
61
Controlling the approximation error
B(X ; R max1¡ ° )
Assume smoothness!Lip®(L)
T³B(X ; R max
1¡ ° )´
62
Learning with (lots of) historical data
Data: A long trajectory of some exploration policy
Goal: Efficient algorithm to learn a policy Idea: Use fitted action-values Algorithms:
Bellman residual minimization, FQI [AnSzeMu ’07] LSPI [Lagoudakis, Parr ’03]
Bounds: Oracle inequalities (BRM, FQI and LSPI) ) consistency
63
BRM insight TD error: tRt+° Q(Xt+1,¼(Xt+1))-Q(Xt,At) Bellman error: E[E[ t | Xt,At ]2] What we can compute/estimate: E[E[ t
2 | Xt,At]] They are different! However:
E[¢ tjX t;At]2 = E[¢ 2t jX t;At]¡ Var[¢ tjX t;At]
[AnSzeMu ’07]
64
Loss function
LN ;¼(Q;h) =
1N
NX
t=1
wt
n(Rt + °Q(X t+1;¼(X t+1)) ¡ Q(X t;At))
2
¡ (Rt + °Q(X t+1;¼(X t+1)) ¡ h(X t;At))2o
wt = 1=¹ (AtjX t)
E[¢ tjX t;At]2 = E[¢ 2t jX t;At]¡ Var[¢ tjX t;At]
65
Algorithm (BRM++)
1. Choose ¼0, i := 0
2. While (i · K ) do:
3. Let Qi+1 = argminQ2F A suph2F A LN ;¼i (Q;h)
4. Let ¼i+1(x) = argmaxa2A Qi+1(x;a)
5. i := i + 1
66
Do we need to reweight or throw away data?
NO! WHY? Intuition from regression:
m(x) = E[Y|X=x] can be learnt no matter what p(x) is!
*(a|x): the same should be possible! BUT..
Performance might be poor! => YES! Like in supervised learning when training and
test distributions are different
67
Bound
kQ¤ ¡ Q¼K k2;½·
2°(1¡ °)2C1=2
½;º
³~E (F ) + E (F ) + S1=2
N ;x
´+ (2°K )1=2Rmax;
SN ;x = c2
³(V
2 + 1) ln(N ) + ln(c1) + 11+· ln(bc2
24 ) + x
´ 1+ ·2·
(b1=· N )1=2
68
The concentration coefficients
Lyapunov exponents
Our case: yt is infinite dimensional Pt depends on the policy chosen If top-Lyap exp.· 0, we are good
yt+1 = Ptyt
°̂top = limsupt! 1
1t
log+(kytk1 )
69
Open question
Abstraction:
Let
True?
f (i1; : : : ; im) = log(jjPi1Pi2 : : :Pim jj), ik 2 f0;1g.
f : f 0;1g¤ ! R+, f (x + y) · f (x) + f (y),limsupm! 1
1m f ([x]m) · ¯ .
8fymgm, ym 2 f0;1gm,
limsupm! 11m logf (ym) · ¯
70
Relation to LSTD
LSTD: Linear function space Bootstrap the “normal equation”
h¤(f ) = infh2F
kh ¡ Qf k2n
QLST D = inff 2F
kf ¡ h¤(f )k2n
QBRM = inff 2F
kf ¡ Qf k2n ¡ kh¤(f ) ¡ Qf k2
n
kQ ¡ Qf k2n = kQ ¡ h¤(Q)k2
n + kh¤(Q) ¡ Qf k2n
[AnSzeMu ’07]
71
Open issues
Adaptive algorithms to take advantage of regularity when present to address the “curse of dimensionality” Penalized least-squares/aggregation? Feature relevance Factorization Manifold estimation
Abstraction – build automatically Active learning Optimal on-line learning for infinite problems
72
References [Auer et al. ’02] P. Auer, N. Cesa-Bianchi and P. Fischer: Finite time analysis of the multiarmed bandit problem,
Machine Learning, 47:235—256, 2002. [AuSzeMu ’07] J.-Y. Audibert, R. Munos and Cs. Szepesvári: Tuning bandit algorithms in stochastic environments,
ALT, 2007. [Auer, Jaksch & Ortner ’07] P. Auer, T. Jaksch and R. Ortner: Near-optimal Regret Bounds for Reinforcement
Learning, (2007), available athttp://www.unileoben.ac.at/~infotech/publications/ucrlrevised.pdf
[Singh & Sutton ’96] S.P. Singh and R.S. Sutton:Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123—158, 1996.
[Sutton ’88] R.S. Sutton: Learning to predict by the method of temporal differences. Machine Learning, 3:9—44, 1988.
[Jaakkola et al. ’94] T. Jaakkola, M.I. Jordan, and S.P. Singh: On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6: 1185—1201, 1994.
[Tsitsiklis, ’94] J.N. Tsitsiklis: Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185—202, 1994.
[SzeLi99] Cs. Szepesvári and M.L. Littman: A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms, Neural Computation, 11, 2017—2059, 1999.
[Watkins ’90] C.J.C.H. Watkins: Learning from Delayed Rewards, PhD Thesis, 1990. [Rummery and Niranjan ’94] G.A. Rummery and M. Niranjan: On-line Q-learning using connectionist systems.
Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department, 1994. [Sutton ’84] R.S. Sutton: Temporal Credit Assignment in Reinforcement Learning.
PhD thesis, University of Massachusetts, Amherst, MA, 1984. [Tsitsiklis & Van Roy ’97] J.N. Tsitsiklis and B. Van Roy: An analysis of temporal-difference learning with function
approximation. IEEE Transactions on Automatic Control, 42:674—690, 1997. [Sutton ’96] R.S. Sutton: Generalization in reinforcement learning: Successful examples using sparse coarse
coding. NIPS, 1996. [Baird ’95] L.C. Baird: Residual algorithms: Reinforcement learning with function approximation, ICML, 1995. [Bradtke, Barto ’96] S.J. Bradtke and A.G. Barto: Linear least-squares algorithms for temporal difference learning.
Machine Learning, 22:33—57, 1996. [Lagoudakis, Parr ’03] M. Lagoudakis and R. Parr: Least-squares policy iteration, Journal of Machine Learning
Research, 4:1107—1149, 2003. [AnSzeMu ’07] A. Antos, Cs. Szepesvari and R. Munos: Learning near-optimal policies with Bellman-residual
minimization based fitted policy iteration and a single sample path, Machine Learning Journal, 2007.
Top Related