Reinforcement Learning, yet another introduction. Part 1/3...
Transcript of Reinforcement Learning, yet another introduction. Part 1/3...
![Page 1: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/1.jpg)
Reinforcement Learning,yet another introduction.
Part 1/3: foundations
Emmanuel Rachelson (ISAE - SUPAERO)
![Page 2: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/2.jpg)
Introduction Modeling Optimizing Learning
Books
2 / 46
![Page 3: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/3.jpg)
Introduction Modeling Optimizing Learning
Example 1: exiting a spiral
Air speed, plane orientation and angularvelocities, pressure measurements,commands’ positions . . .
Thrust, rudder, yoke.
Vital maneuver.
3 / 46
![Page 4: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/4.jpg)
Introduction Modeling Optimizing Learning
Example 2: dynamic treatement regimes for HIV patients
Anti-bodies concentration . . .
Choice of drugs (or absence oftreatment).
Long-term, chronic diseases (HIV,depression, . . . ).
4 / 46
![Page 5: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/5.jpg)
Introduction Modeling Optimizing Learning
Example 3: inverted pendulum
x , x , θ , θ .
Push left or right (or don’t push).
Toy problem representative ofmany examples (Segway PT,juggling, plane control).
5 / 46
![Page 6: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/6.jpg)
Introduction Modeling Optimizing Learning
Example 4: queuing systems
Line length, number of opencounters, . . .
Open or close counters.
Try to get all passengers on theplane in time (at minimal cost)!
6 / 46
![Page 7: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/7.jpg)
Introduction Modeling Optimizing Learning
Example 5: portfolio management
Economic indicators, prices, . . .
Dispatch investment over financialassets.
Maximize long term revenue.
7 / 46
![Page 8: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/8.jpg)
Introduction Modeling Optimizing Learning
Example 6: hydroelectric production
Water level, electricity demand,weather forecast, other sources ofenergy . . .
Evaluate “water value” and decideto use it or not.
8 / 46
![Page 9: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/9.jpg)
Introduction Modeling Optimizing Learning
Intuition
What do all these systems have in common?
Prediction or decision over the future.
Complex / high-dimension / non-linear / non-deterministic environments.
What matters is not a single decision but the sequence of decisions.
One can quantify the value of a sequence of decisions.
9 / 46
![Page 10: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/10.jpg)
Introduction Modeling Optimizing Learning
Sequential agent-environment interaction
10 / 46
![Page 11: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/11.jpg)
Introduction Modeling Optimizing Learning
Sequential agent-environment interaction
10 / 46
![Page 12: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/12.jpg)
Introduction Modeling Optimizing Learning
Questions
A general theory of sequential decision making?
What hypothesis on the “environment”?
What is a “good” behaviour?
Is the knowledge of a model always necessary?
Balancing information acquisition and knowledge exploitation?
11 / 46
![Page 13: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/13.jpg)
Introduction Modeling Optimizing Learning
1 A practical introduction
2 Modeling the decision problems
3 A few words on solving the model-based case
4 Learning for Prediction and Control
12 / 46
![Page 14: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/14.jpg)
Introduction Modeling Optimizing Learning
The ingredients
Set T of time steps T .
Set S of possible states s for the system.
Set A of possible actions a of the agent.
Transition dynamics of the system s′← f (?).
Rewards (reinforcement signal) at each time step r(?).
t
s
t+ 1
s′ = f(?)
r(?)reward:
state:
time:
13 / 46
![Page 15: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/15.jpg)
Introduction Modeling Optimizing Learning
Markov Decision Processes
Sequential decision under probabilistic action uncertainty:
Markov Decision Process (MDP)
5-tuple 〈S,A,p, r ,T 〉Markov transition model p(s′|s,a)Reward model r(s,a)Set T of decision epochs {0,1, . . . ,H}
Infinite (or unbounded) horizon: H→ ∞
t0 1 n n+ 1
s0
}p(s1|s0, a0)r(s0, a0)
}p(s1|s0, a2)r(s0, a2)
sn p(sn+1|sn, an)r(sn, an)
14 / 46
![Page 16: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/16.jpg)
Introduction Modeling Optimizing Learning
Markov Decision Processes
Sequential decision under probabilistic action uncertainty:
Markov Decision Process (MDP)
5-tuple 〈S,A,p, r ,T 〉Markov transition model p(s′|s,a)Reward model r(s,a)Set T of decision epochs {0,1, . . . ,H}
Infinite (or unbounded) horizon: H→ ∞
t
s
t+ 1
s′ ∼ p(s′|s, a)r(s, a)reward:
state:
time:
14 / 46
![Page 17: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/17.jpg)
Introduction Modeling Optimizing Learning
What is a behaviour?
Policy
A policy is a sequence of decision rules δt : π = {δt}t∈N,
with δt :
{St+1×At → P ′(A)
h 7→ δt(a|h)
δt(a|h) indicatesthe distribution over action a
to undertake at time t , giventhe history of states/actions h.
15 / 46
![Page 18: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/18.jpg)
Introduction Modeling Optimizing Learning
Evaluating a sequence of policy?
What can I expect on the long-term,from this sequence of actions,
in my current state?
E.g.s0 s1 s2 s3 . . . sn+1
r0 + γr1 +γ2r2+ +γnrn
Several criteria:
Average reward V (s) = E(
limH→∞
1H
H∑
δ=0rδ
∣∣∣∣s0 = s
)
Total reward V (s) = E(
limH→∞
H∑
δ=0rδ
∣∣∣∣s0 = s
)
γ-discounted reward V (s) = E(
limH→∞
H∑
δ=0γδ rδ
∣∣∣∣s0 = s
)
→ value of a state under a certain behaviour.
16 / 46
![Page 19: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/19.jpg)
Introduction Modeling Optimizing Learning
Evaluating a policy
Value function of a policy under a γ-discounted criterion
V π :
S → R
s 7→ V π(s) = E(
limH→∞
H∑
δ=0γδ rδ
∣∣∣∣s0 = s,π
)
17 / 46
![Page 20: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/20.jpg)
Introduction Modeling Optimizing Learning
Optimal policies
Optimal policyπ∗ is said to be optimal iff π∗ ∈ argmax
π
V π .
A policy is optimal if it dominates over any other policy in every state:
π∗ is optimal⇔∀s ∈ S, ∀π, V π∗(s)≥ V π(s)
18 / 46
![Page 21: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/21.jpg)
Introduction Modeling Optimizing Learning
First fundamental result
Fortunately. . .
Optimal policy
For
{γ-discounted criterioninfinite horizon
, there always exists at least one optimal
stationary, deterministic, Markovian policy.
Markovian :∀(si ,ai) ∈ (S×A)t−1
∀(s′i ,a′i) ∈ (S×A)t−1 ,δt (a|s0,a0, . . . ,st) = δt (a|s′0,a′0, . . . ,st).
One writes δt(a|s).Stationary : ∀(t, t ′) ∈ N2,δt = δ ′t .One writes π = δ0.
Deterministic : δt(a|h) ={
1 for a single a0 otherwise
.
19 / 46
![Page 22: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/22.jpg)
Introduction Modeling Optimizing Learning
Let’s play with actions
What’s the value of “a then π”?
Qπ(s,a) = E
(∞
∑t=0
γt r (st ,at)
∣∣∣∣s0 = s,a0 = a,π
)
= r (s,a)+E
(∞
∑t=1
γt r (st ,at)
∣∣∣∣s0 = s,a0 = a,π
)
= r (s,a)+ γ ∑s′∈S
p(s′|s,a
)E
(∞
∑t=1
γt−1r (st ,at)
∣∣∣∣s1 = s′,π
)
= r (s,a)+ γ ∑s′∈S
p(s′|s,a
)V π(s′)
The best one-step lookahead action can be selected by maximizing Qπ .To improve on a policy π , it is more useful to know Qπ than V π and pickthe greedy action.Also V π(s) = Qπ(s,π(s)). Let’s replace that above (next slide).
20 / 46
![Page 23: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/23.jpg)
Introduction Modeling Optimizing Learning
Computing a policy’s value function
Evaluation equationV π is a solution to the linear system:
V π (s) = r (s,π (s))+ γ ∑s′∈S
p(s′|s,π (s)
)V π(s′)
V π = rπ + γPπV π = T πV π
Similarly:
Qπ (s,a) = r (s,a)+ γ ∑s′∈S
p(s′|s,a
)Qπ(s′,π
(s′))
Qπ = r + γPQπ = T πQπ
Recall also that V π(s) = E
(∞
∑δ=0
γδ r(sδ ,π(sδ ))
∣∣∣∣s0 = s
)
21 / 46
![Page 24: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/24.jpg)
Introduction Modeling Optimizing Learning
Computing a policy’s value function
Evaluation equationV π is a solution to the linear system:
V π (s) = r (s,π (s))+ γ ∑s′∈S
p(s′|s,π (s)
)V π(s′)
V π = rπ + γPπV π = T πV π
Similarly:
Qπ (s,a) = r (s,a)+ γ ∑s′∈S
p(s′|s,a
)Qπ(s′,π
(s′))
Qπ = r + γPQπ = T πQπ
Notes:
For continuous state and action spaces ∑→∫
For stochastic policies: ∀s ∈ S, V π(s) = ∑a∈A
π(s,a)
(r(s,a)+ γ ∑
s′∈Sp(s′|s,a)V π(s′)
)
21 / 46
![Page 25: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/25.jpg)
Introduction Modeling Optimizing Learning
Properties of T π
T πV π (s) = r (s,π (s))+ γ ∑s′∈S
p(s′|s,π (s)
)V π(s′)
T πV π = rπ + γPπV π
Solving the evaluation equation
T π is linear.
⇒ Solving V π = T πV π and Qπ = T πQπ by matrix inversion?With γ < 1, V π = (I− γPπ)−1 rπ and Qπ = (I− γP)−1 rπ
With γ < 1, T π is a ‖ · ‖∞-contraction mapping over the F (S,R) (resp.F (S×A,R)) Banach space.
⇒ With γ < 1, V π (resp. Qπ ) is the unique solution to the (linear) fixed pointequation V = T πV (resp. Q = T πQ).
22 / 46
![Page 26: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/26.jpg)
Introduction Modeling Optimizing Learning
Characterizing an optimal policy
Find π∗ such that π∗ ∈ argmaxπ
V π(s).
Notation: V π∗ = V ∗, Qπ∗ = Q∗
Let’s play with our intuitions:
One has Q∗ (s,a) = r (s,a)+ γ ∑s′∈S
p (s′|s,a)V ∗ (s′).
If π∗ is an optimal policy, then V ∗ (s) = Q∗ (s,π∗ (s)).
Optimal greedy policy
Any policy π defined by π(s) ∈ argmaxa∈A
Q∗(s,a) is an optimal policy.
23 / 46
![Page 27: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/27.jpg)
Introduction Modeling Optimizing Learning
Bellman optimality equation
The key theorem:
Bellman optimality equationThe optimal value function obeys:
V ∗(s) = maxa∈A
{r(s,a)+ γ ∑
s′∈S
p(s′|s,a)V ∗(s′)}
= T ∗V ∗
or in terms of Q-functions:
Q∗(s,a) = r(s,a)+ γ ∑s′∈S
p(s′|s,a)maxa′∈A
Q∗(s′,a′) = T ∗Q∗
24 / 46
![Page 28: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/28.jpg)
Introduction Modeling Optimizing Learning
Properties of T ∗
V ∗(s) = maxπ
V π(s)
V ∗(s) = maxa∈A
{r(s,a)+ γ ∑
s′∈S
p(s′|s,a)V ∗(s)}
= T ∗V ∗
Solving the optimality equationT ∗ is non-linear.
T ∗ is a ‖ · ‖∞-contraction mapping over the F (S,R) (resp. F (S×A,R))Banach space.
⇒ V ∗ (resp. Q∗) is the unique solution to the fixed point equation V = TV(resp. Q = TQ).
25 / 46
![Page 29: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/29.jpg)
Introduction Modeling Optimizing Learning
Let’s summarize
Formalizing the control problem:
Environment (discrete time, non-deterministic, non-linear)↔ MDP.
Behaviour↔ control policy π : s 7→ a.
Policy evaluation criterion↔ γ-discounted criterion.
Goal↔ Maximize value function V π(s), Qπ(s,a).
Evaluation eq. ↔ V π = T πV π , Qπ = T πQπ .
Bellman optimality eq. ↔ V ∗ = T ∗V ∗, Q∗ = T ∗Q∗.
Now what?
p and r are known→ Probab. Planning, Stochastic Optimal Control.
p and r are unknown→ Reinforcement Learning.
26 / 46
![Page 30: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/30.jpg)
Introduction Modeling Optimizing Learning
1 A practical introduction
2 Modeling the decision problems
3 A few words on solving the model-based case
4 Learning for Prediction and Control
27 / 46
![Page 31: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/31.jpg)
Introduction Modeling Optimizing Learning
How does one find π∗?
Three “standard” approaches:
Dynamic Programming in value function space→Value Iteration
Dynamic Programming in policy space→Policy Iteration
Linear Programming in value function space
We won’t see each algorithm in detail, nor explain all their variants. The goal ofthis section is to illustrate three fundamentally different ways of computing anoptimal policy, based on Bellman’s optimality equation.
28 / 46
![Page 32: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/32.jpg)
Introduction Modeling Optimizing Learning
Value Iteration
Key idea:
V ∗(s) = maxa∈A
{r(s,a)+ γ ∑
s′∈Sp(s′|s,a)V ∗(s)
}= T ∗V ∗
Value iterationT ∗ is a contraction mapping,
Value function space is a Banach space.
⇒ The sequence Vn+1 = T ∗Vn converges to V ∗.
π∗ is the V ∗-greedy policy.
29 / 46
![Page 33: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/33.jpg)
Introduction Modeling Optimizing Learning
Value Iteration
Init: V ′← V0.repeat
V = V ′
for s ∈ S do
V ′(s)←maxa∈A
{r(s,a)+ γ ∑
s′∈Sp(s′|s,a)V (s′)
}
until ‖V ′−V‖ ≤ ε
return greedy policy w.r.t. V ′
30 / 46
![Page 34: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/34.jpg)
Introduction Modeling Optimizing Learning
Value Iteration
Init: Q′← Q0.repeat
Q = Q′
for (s,a) ∈ S×A doQ′(s,a)← r(s,a)+ γ ∑
s′∈Sp(s′|s,a)max
a′∈AQ(s′,a′)
until ‖Q′−Q‖ ≤ ε
return greedy policy w.r.t. Q′
31 / 46
![Page 35: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/35.jpg)
Introduction Modeling Optimizing Learning
Illustration - an investment dilemma
A gambler’s bet on a coin flip:
Tails⇒ looses his stake.
Heads⇒ wins as much as his stake.
Goal reach 100 pesos!
States S = {1, . . . ,99}.Actions A = {1,2, . . . ,min(s,100− s)}
Rewards +1 when the gambler reaches 100 pesos.
Transitions Probability of heads-up = p.
Discount γ = 1
V π → probability of reaching the goal.π∗ maximizes V π
32 / 46
![Page 36: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/36.jpg)
Introduction Modeling Optimizing Learning
Illustration - an investment dilemma, p = 0.4
Matlab demo.
33 / 46
![Page 37: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/37.jpg)
Introduction Modeling Optimizing Learning
Policy Iteration
Key idea:
π∗ = argmax
πV π
V π = r(s,π(s))+ γ ∑s′∈S
p(s′|s,π(s))V π(s′) = T πV π
Policy iterationA policy that is Qπ -greedy is not worse than π .→ iteratively improve and evaluate the policy.
Instead of a path V0,V1, . . . among value functions, let’s search for a pathπ0,π1, . . . among policies.
34 / 46
![Page 38: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/38.jpg)
Introduction Modeling Optimizing Learning
Policy Iteration
Policy evaluation: V πn
One-step improvement: πn+1
35 / 46
![Page 39: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/39.jpg)
Introduction Modeling Optimizing Learning
Policy Iteration
Init: π ′← π0.repeat
π ← π ′
V π ← Solve V = T πVfor s ∈ S do
π ′(s)← argmaxa∈A
{r(s,a)+ γ ∑
s′∈Sp(s′|s,a)V π(s′)
}
until π ′ = π
return π
36 / 46
![Page 40: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/40.jpg)
Introduction Modeling Optimizing Learning
Policy Iteration
Init: π ′← π0.repeat
π ← π ′
Qπ ← Solve Q = T πQfor s ∈ S do
π ′(s)← argmaxa∈A
Qπ(s,a)
until π ′ = π
return π
37 / 46
![Page 41: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/41.jpg)
Introduction Modeling Optimizing Learning
Illustration - an investment dilemma, p = 0.4
Matlab demo.
38 / 46
![Page 42: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/42.jpg)
Introduction Modeling Optimizing Learning
Linear Programming
Key idea: formulate the optimality equation as a linear problem.“V ∗ is the smallest value that dominates over all policy values”
∀s ∈ S,V (s) = maxa∈A
{r(s,a)+ γ ∑
s′∈S
p(s′|s,a)V (s′)
}
⇔{
min ∑s∈S
V (s)
s.t. ∀π, V ≥ T πV
⇔
min ∑s∈S
V (s)
s.t. ∀(s,a) ∈ S×A, V (s)− γ ∑s′∈S
p(s′|s,a)V (s′)≥ r(s,a)
39 / 46
![Page 43: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/43.jpg)
Introduction Modeling Optimizing Learning
Linear Programming
Key idea: formulate the optimality equation as a linear problem.“V ∗ is the smallest value that dominates over all policy values”
∀s ∈ S,V (s) = maxa∈A
{r(s,a)+ γ ∑
s′∈S
p(s′|s,a)V (s′)
}
⇔{
min ∑s∈S
V (s)
s.t. ∀π, V ≥ T πV
⇔
min ∑s∈S
V (s)
s.t. ∀(s,a) ∈ S×A, V (s)− γ ∑s′∈S
p(s′|s,a)V (s′)≥ r(s,a)
39 / 46
![Page 44: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/44.jpg)
Introduction Modeling Optimizing Learning
Linear Programming
Key idea: formulate the optimality equation as a linear problem.“V ∗ is the smallest value that dominates over all policy values”
∀s ∈ S,V (s) = maxa∈A
{r(s,a)+ γ ∑
s′∈S
p(s′|s,a)V (s′)
}
⇔{
min ∑s∈S
V (s)
s.t. ∀π, V ≥ T πV
⇔
min ∑s∈S
V (s)
s.t. ∀(s,a) ∈ S×A, V (s)− γ ∑s′∈S
p(s′|s,a)V (s′)≥ r(s,a)
39 / 46
![Page 45: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/45.jpg)
Introduction Modeling Optimizing Learning
In a nutshell
To solve Bellman’s optimality equation:
The sequence of Vn+1 = T ∗Vn converges to V ∗
→ Value Iteration.
The sequence of πn+1 ∈ argmaxa
Qπn converges to π∗
→ Policy Iteration.
V ∗ is the smallest function s.t. V (s)≥ r (s,a)+ γ ∑s′∈S
p (s′|s,a)V (s)
→ Linear Programming resolution.
40 / 46
![Page 46: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/46.jpg)
Introduction Modeling Optimizing Learning
1 A practical introduction
2 Modeling the decision problems
3 A few words on solving the model-based case
4 Learning for Prediction and Control
41 / 46
![Page 47: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/47.jpg)
Introduction Modeling Optimizing Learning
Wait a minute. . .
. . . so far, we’ve characterized and searched for optimal policies, using thesupposed properties (p and r ) of the environment.
We’ve been using p and r each time! We’re cheating!
Where’s the learning you promised?
We’re coming to it. Let’s put it all in perspective.
42 / 46
![Page 48: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/48.jpg)
Introduction Modeling Optimizing Learning
Let’s put it all in perspective: RL within ML
A taxonomy of Machine Learning
Supervised Learning
learning from a teacherinformation: correct examplesgeneralize from examplesclassifier, regressorSVMs, neural networks, trees
Unsupervised Learning
learning from similarityinformation: unlabeled examplesidentify structure in the dataclusters, self-organized datak-means, Kohonen maps, PCA
Reinforcement Learning
learning by interactioninformation: trial and errorreinforce good choicesvalue function, control policyTD-learning, Q-learning
43 / 46
![Page 49: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/49.jpg)
Introduction Modeling Optimizing Learning
Let’s put it all in perspective: RL within ML
Different learning tasks
Supervised Learning
learning from a teacher
information: correct examplesgeneralize from examplesclassifier, regressorSVMs, neural networks, trees
Unsupervised Learning
learning from similarity
information: unlabeled examplesidentify structure in the dataclusters, self-organized datak-means, Kohonen maps, PCA
Reinforcement Learning
learning by interaction
information: trial and errorreinforce good choicesvalue function, control policyTD-learning, Q-learning
43 / 46
![Page 50: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/50.jpg)
Introduction Modeling Optimizing Learning
Let’s put it all in perspective: RL within ML
What kind of input?
Supervised Learning
learning from a teacherinformation: correct examples
generalize from examplesclassifier, regressorSVMs, neural networks, trees
Unsupervised Learning
learning from similarityinformation: unlabeled examples
identify structure in the dataclusters, self-organized datak-means, Kohonen maps, PCA
Reinforcement Learning
learning by interactioninformation: trial and error
reinforce good choicesvalue function, control policyTD-learning, Q-learning
43 / 46
![Page 51: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/51.jpg)
Introduction Modeling Optimizing Learning
Let’s put it all in perspective: RL within ML
For what goal?
Supervised Learning
learning from a teacherinformation: correct examplesgeneralize from examples
classifier, regressorSVMs, neural networks, trees
Unsupervised Learning
learning from similarityinformation: unlabeled examplesidentify structure in the data
clusters, self-organized datak-means, Kohonen maps, PCA
Reinforcement Learning
learning by interactioninformation: trial and errorreinforce good choices
value function, control policyTD-learning, Q-learning
43 / 46
![Page 52: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/52.jpg)
Introduction Modeling Optimizing Learning
Let’s put it all in perspective: RL within ML
Which outputs?
Supervised Learning
learning from a teacherinformation: correct examplesgeneralize from examplesclassifier, regressor
SVMs, neural networks, trees
Unsupervised Learning
learning from similarityinformation: unlabeled examplesidentify structure in the dataclusters, self-organized data
k-means, Kohonen maps, PCA
Reinforcement Learning
learning by interactioninformation: trial and errorreinforce good choicesvalue function, control policy
TD-learning, Q-learning
43 / 46
![Page 53: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/53.jpg)
Introduction Modeling Optimizing Learning
Let’s put it all in perspective: RL within ML
Examples of algorithms
Supervised Learning
learning from a teacherinformation: correct examplesgeneralize from examplesclassifier, regressorSVMs, neural networks, trees
Unsupervised Learning
learning from similarityinformation: unlabeled examplesidentify structure in the dataclusters, self-organized datak-means, Kohonen maps, PCA
Reinforcement Learning
learning by interactioninformation: trial and errorreinforce good choicesvalue function, control policyTD-learning, Q-learning
43 / 46
![Page 54: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/54.jpg)
Introduction Modeling Optimizing Learning
Reinforcement Learning
Evaluate and improve a policy based on experience samples.
44 / 46
![Page 55: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/55.jpg)
Introduction Modeling Optimizing Learning
Reinforcement Learning
experience samples?→ (s,a, r ,s′)
44 / 46
![Page 56: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/56.jpg)
Introduction Modeling Optimizing Learning
Reinforcement Learning
Two problems in RL:
Predict a policy’s value.
Control the system.
44 / 46
![Page 57: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/57.jpg)
Introduction Modeling Optimizing Learning
A little vocabulary
Curse of dimensionality
Exploration/exploitation dilemma
Model-based vs model-free RL
Interactive vs. non-interactive algorithms
On-policy vs. off-policy algorithms
Number of states, actions or outcomes grows exponentially with number ofdimensions.E.g. continuous control problem in S = [0;1]10, discretized with a step-size of 1/10→1010 states!
45 / 46
![Page 58: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/58.jpg)
Introduction Modeling Optimizing Learning
A little vocabulary
Curse of dimensionality
Exploration/exploitation dilemma
Model-based vs model-free RL
Interactive vs. non-interactive algorithms
On-policy vs. off-policy algorithms
Where are the good rewards?Exploit whatever good policy has been found so far or explore unknown transitionshoping for more?How to balance exploration and exploitation?
45 / 46
![Page 59: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/59.jpg)
Introduction Modeling Optimizing Learning
A little vocabulary
Curse of dimensionality
Exploration/exploitation dilemma
Model-based vs model-free RL
Interactive vs. non-interactive algorithms
On-policy vs. off-policy algorithms
Also called indirect vs. direct RL.Indirect: {(s,a, r ,s′)} → (p, r) → V π or π∗
Direct: {(s,a, r ,s′)} → V π or π∗
45 / 46
![Page 60: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/60.jpg)
Introduction Modeling Optimizing Learning
A little vocabulary
Curse of dimensionality
Exploration/exploitation dilemma
Model-based vs model-free RL
Interactive vs. non-interactive algorithms
On-policy vs. off-policy algorithms
Non-interactive: D = {(si ,ai , ri ,s′i )}i∈[1;N]
→ no exploration/exploitation dilemma; batch learning.Interactive episodic: trajectories (s0,a0, r0,s1, . . . ,sN ,aN , rN ,sN+1)→ Interactive “with reset”; Monte-Carlo-like methods; is s0 known?Interactive non-episodic: (s,a, r ,s′) at each time step→ the most general case!
45 / 46
![Page 61: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/61.jpg)
Introduction Modeling Optimizing Learning
A little vocabulary
Curse of dimensionality
Exploration/exploitation dilemma
Model-based vs model-free RL
Interactive vs. non-interactive algorithms
On-policy vs. off-policy algorithms
Evaluate/improve policy π while applying π ′?
45 / 46
![Page 62: Reinforcement Learning, yet another introduction. Part 1/3 ...emmanuel.rachelson.free.fr/extras/cours/RL/YaIReL1.pdf · Reinforcement Learning, yet another introduction. Part 1/3:](https://reader030.fdocuments.in/reader030/viewer/2022040407/5ea8bf321b65e834d82c8e7b/html5/thumbnails/62.jpg)
Introduction Modeling Optimizing Learning
Next classes
1 Predict a policy’s value.1 Model-based prediction2 Monte-Carlo methods3 Temporal differences4 Unifying MC and TD: TD(λ )
2 Control the system.1 Actor-Critic architectures2 Online problems, the exploration vs. exploitation dilemma3 Offline problems, focussing on the critic alone4 An overview of control learning
46 / 46