Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. ·...
Transcript of Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. ·...
![Page 1: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/1.jpg)
Reinforcement LearningCS 5522: Artificial Intelligence II
Instructor: Alan Ritter
Ohio State University [These slides were adapted from CS188 Intro to AI at UC Berkeley. All materials available at http://ai.berkeley.edu.]
![Page 2: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/2.jpg)
Reinforcement Learning
![Page 3: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/3.jpg)
Reinforcement Learning
▪ Basic idea: ▪ Receive feedback in the form of rewards ▪ Agent’s utility is defined by the reward function ▪ Must (learn to) act so as to maximize expected rewards ▪ All learning is based on observed samples of outcomes!
Environment
Agent
Actions: aState: s
Reward: r
![Page 4: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/4.jpg)
Example: Learning to Walk
Initial A Learning Trial After Learning [1K Trials]
[Kohl and Stone, ICRA 2004]
![Page 5: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/5.jpg)
Example: Learning to Walk
Initial[Video: AIBO WALK – initial][Kohl and Stone, ICRA 2004]
![Page 6: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/6.jpg)
Example: Learning to Walk
Initial[Video: AIBO WALK – initial][Kohl and Stone, ICRA 2004]
![Page 7: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/7.jpg)
Example: Learning to Walk
Initial[Video: AIBO WALK – initial][Kohl and Stone, ICRA 2004]
![Page 8: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/8.jpg)
Example: Learning to Walk
Training[Video: AIBO WALK – training][Kohl and Stone, ICRA 2004]
![Page 9: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/9.jpg)
Example: Learning to Walk
Training[Video: AIBO WALK – training][Kohl and Stone, ICRA 2004]
![Page 10: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/10.jpg)
Example: Learning to Walk
Training[Video: AIBO WALK – training][Kohl and Stone, ICRA 2004]
![Page 11: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/11.jpg)
Example: Learning to Walk
Finished[Video: AIBO WALK – finished][Kohl and Stone, ICRA 2004]
![Page 12: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/12.jpg)
Example: Learning to Walk
Finished[Video: AIBO WALK – finished][Kohl and Stone, ICRA 2004]
![Page 13: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/13.jpg)
Example: Learning to Walk
Finished[Video: AIBO WALK – finished][Kohl and Stone, ICRA 2004]
![Page 14: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/14.jpg)
Example: Toddler Robot
[Tedrake, Zhang and Seung, 2005] [Video: TODDLER – 40s]
![Page 15: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/15.jpg)
Example: Toddler Robot
[Tedrake, Zhang and Seung, 2005] [Video: TODDLER – 40s]
![Page 16: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/16.jpg)
Example: Toddler Robot
[Tedrake, Zhang and Seung, 2005] [Video: TODDLER – 40s]
![Page 17: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/17.jpg)
The Crawler!
[Demo: Crawler Bot (L10D1)] [You, in Project 3]
![Page 18: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/18.jpg)
Video of Demo Crawler Bot
![Page 19: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/19.jpg)
Video of Demo Crawler Bot
![Page 20: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/20.jpg)
Video of Demo Crawler Bot
![Page 21: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/21.jpg)
Reinforcement Learning
▪ Still assume a Markov decision process (MDP):▪ A set of states s ∈ S▪ A set of actions (per state) A▪ A model T(s,a,s’)▪ A reward function R(s,a,s’)
▪ Still looking for a policy π(s)
![Page 22: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/22.jpg)
Reinforcement Learning
▪ Still assume a Markov decision process (MDP):▪ A set of states s ∈ S▪ A set of actions (per state) A▪ A model T(s,a,s’)▪ A reward function R(s,a,s’)
▪ Still looking for a policy π(s)
▪ New twist: don’t know T or R▪ I.e. we don’t know which states are good or what the actions do▪ Must actually try actions and states out to learn
![Page 23: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/23.jpg)
Reinforcement Learning
▪ Still assume a Markov decision process (MDP):▪ A set of states s ∈ S▪ A set of actions (per state) A▪ A model T(s,a,s’)▪ A reward function R(s,a,s’)
▪ Still looking for a policy π(s)
▪ New twist: don’t know T or R▪ I.e. we don’t know which states are good or what the actions do▪ Must actually try actions and states out to learn
![Page 24: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/24.jpg)
Offline (MDPs) vs. Online (RL)
![Page 25: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/25.jpg)
Offline (MDPs) vs. Online (RL)
Offline Solution
![Page 26: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/26.jpg)
Offline (MDPs) vs. Online (RL)
Offline Solution Online Learning
![Page 27: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/27.jpg)
Model-Based Learning
![Page 28: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/28.jpg)
Model-Based Learning
▪ Model-Based Idea:▪ Learn an approximate model based on experiences▪ Solve for values as if the learned model were correct
![Page 29: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/29.jpg)
Model-Based Learning
▪ Model-Based Idea:▪ Learn an approximate model based on experiences▪ Solve for values as if the learned model were correct
▪ Step 1: Learn empirical MDP model▪ Count outcomes s’ for each s, a▪ Normalize to give an estimate of▪ Discover each when we experience (s, a, s’)
![Page 30: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/30.jpg)
Model-Based Learning
▪ Model-Based Idea:▪ Learn an approximate model based on experiences▪ Solve for values as if the learned model were correct
▪ Step 1: Learn empirical MDP model▪ Count outcomes s’ for each s, a▪ Normalize to give an estimate of▪ Discover each when we experience (s, a, s’)
▪ Step 2: Solve the learned MDP▪ For example, use value iteration, as before
![Page 31: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/31.jpg)
Example: Model-Based Learning
Input Policy π
Assume: γ = 1
A
B C D
E
![Page 32: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/32.jpg)
Example: Model-Based Learning
Input Policy π
Assume: γ = 1
Observed Episodes (Training)
A
B C D
E
B, east, C, -1 C, east, D, -1 D, exit, x, +10
B, east, C, -1 C, east, D, -1 D, exit, x, +10
E, north, C, -1 C, east, A, -1 A, exit, x, -10
Episode 1 Episode 2
Episode 3 Episode 4E, north, C, -1 C, east, D, -1 D, exit, x, +10
![Page 33: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/33.jpg)
Example: Model-Based Learning
Input Policy π
Assume: γ = 1
Observed Episodes (Training) Learned Model
A
B C D
E
B, east, C, -1 C, east, D, -1 D, exit, x, +10
B, east, C, -1 C, east, D, -1 D, exit, x, +10
E, north, C, -1 C, east, A, -1 A, exit, x, -10
Episode 1 Episode 2
Episode 3 Episode 4E, north, C, -1 C, east, D, -1 D, exit, x, +10
T(s,a,s’). T(B, east, C) = 1.00 T(C, east, D) = 0.75 T(C, east, A) = 0.25
…
R(s,a,s’). R(B, east, C) = -1 R(C, east, D) = -1 R(D, exit, x) = +10
…
![Page 34: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/34.jpg)
Example: Expected AgeGoal: Compute expected age of cse5522 students
![Page 35: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/35.jpg)
Example: Expected AgeGoal: Compute expected age of cse5522 students
Known P(A)
![Page 36: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/36.jpg)
Example: Expected AgeGoal: Compute expected age of cse5522 students
Known P(A)
![Page 37: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/37.jpg)
Example: Expected AgeGoal: Compute expected age of cse5522 students
Known P(A)
![Page 38: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/38.jpg)
Example: Expected AgeGoal: Compute expected age of cse5522 students
Without P(A), instead collect samples [a1, a2, … aN]
Known P(A)
![Page 39: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/39.jpg)
Example: Expected AgeGoal: Compute expected age of cse5522 students
Unknown P(A): “Model Based”
Without P(A), instead collect samples [a1, a2, … aN]
Known P(A)
![Page 40: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/40.jpg)
Example: Expected AgeGoal: Compute expected age of cse5522 students
Unknown P(A): “Model Based”
Without P(A), instead collect samples [a1, a2, … aN]
Known P(A)
![Page 41: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/41.jpg)
Example: Expected AgeGoal: Compute expected age of cse5522 students
Unknown P(A): “Model Based”
Without P(A), instead collect samples [a1, a2, … aN]
Known P(A)
![Page 42: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/42.jpg)
Example: Expected AgeGoal: Compute expected age of cse5522 students
Unknown P(A): “Model Based”
Without P(A), instead collect samples [a1, a2, … aN]
Known P(A)
Why does this work? Because eventually you learn the right
model.
![Page 43: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/43.jpg)
Example: Expected AgeGoal: Compute expected age of cse5522 students
Unknown P(A): “Model Based” Unknown P(A): “Model Free”
Without P(A), instead collect samples [a1, a2, … aN]
Known P(A)
Why does this work? Because eventually you learn the right
model.
![Page 44: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/44.jpg)
Example: Expected AgeGoal: Compute expected age of cse5522 students
Unknown P(A): “Model Based” Unknown P(A): “Model Free”
Without P(A), instead collect samples [a1, a2, … aN]
Known P(A)
Why does this work? Because eventually you learn the right
model.
![Page 45: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/45.jpg)
Example: Expected AgeGoal: Compute expected age of cse5522 students
Unknown P(A): “Model Based” Unknown P(A): “Model Free”
Without P(A), instead collect samples [a1, a2, … aN]
Known P(A)
Why does this work? Because samples appear with the right frequencies.
Why does this work? Because eventually you learn the right
model.
![Page 46: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/46.jpg)
Model-Free Learning
![Page 47: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/47.jpg)
Passive Reinforcement Learning
![Page 48: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/48.jpg)
Passive Reinforcement Learning
▪ Simplified task: policy evaluation ▪ Input: a fixed policy π(s) ▪ You don’t know the transitions T(s,a,s’) ▪ You don’t know the rewards R(s,a,s’) ▪ Goal: learn the state values
▪ In this case: ▪ Learner is “along for the ride” ▪ No choice about what actions to take ▪ Just execute the policy and learn from experience ▪ This is NOT offline planning! You actually take actions in the world.
![Page 49: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/49.jpg)
Direct Evaluation
▪ Goal: Compute values for each state under π
▪ Idea: Average together observed sample values ▪ Act according to π ▪ Every time you visit a state, write down what
the sum of discounted rewards turned out to be ▪ Average those samples
▪ This is called direct evaluation
![Page 50: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/50.jpg)
Example: Direct Evaluation
Input Policy π
Assume: γ = 1
Output Values
A
B C D
E
![Page 51: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/51.jpg)
Example: Direct Evaluation
Input Policy π
Assume: γ = 1
Observed Episodes (Training) Output Values
A
B C D
E
![Page 52: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/52.jpg)
Example: Direct Evaluation
Input Policy π
Assume: γ = 1
Observed Episodes (Training) Output Values
A
B C D
E
B, east, C, -1 C, east, D, -1 D, exit, x, +10
Episode 1
![Page 53: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/53.jpg)
Example: Direct Evaluation
Input Policy π
Assume: γ = 1
Observed Episodes (Training) Output Values
A
B C D
E
B, east, C, -1 C, east, D, -1 D, exit, x, +10
B, east, C, -1 C, east, D, -1 D, exit, x, +10
Episode 1 Episode 2
![Page 54: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/54.jpg)
Example: Direct Evaluation
Input Policy π
Assume: γ = 1
Observed Episodes (Training) Output Values
A
B C D
E
B, east, C, -1 C, east, D, -1 D, exit, x, +10
B, east, C, -1 C, east, D, -1 D, exit, x, +10
Episode 1 Episode 2
Episode 3E, north, C, -1 C, east, D, -1 D, exit, x, +10
![Page 55: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/55.jpg)
Example: Direct Evaluation
Input Policy π
Assume: γ = 1
Observed Episodes (Training) Output Values
A
B C D
E
B, east, C, -1 C, east, D, -1 D, exit, x, +10
B, east, C, -1 C, east, D, -1 D, exit, x, +10
E, north, C, -1 C, east, A, -1 A, exit, x, -10
Episode 1 Episode 2
Episode 3 Episode 4E, north, C, -1 C, east, D, -1 D, exit, x, +10
![Page 56: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/56.jpg)
Example: Direct Evaluation
Input Policy π
Assume: γ = 1
Observed Episodes (Training) Output Values
A
B C D
E
B, east, C, -1 C, east, D, -1 D, exit, x, +10
B, east, C, -1 C, east, D, -1 D, exit, x, +10
E, north, C, -1 C, east, A, -1 A, exit, x, -10
Episode 1 Episode 2
Episode 3 Episode 4E, north, C, -1 C, east, D, -1 D, exit, x, +10
A
B C D
E
![Page 57: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/57.jpg)
Example: Direct Evaluation
Input Policy π
Assume: γ = 1
Observed Episodes (Training) Output Values
A
B C D
E
B, east, C, -1 C, east, D, -1 D, exit, x, +10
B, east, C, -1 C, east, D, -1 D, exit, x, +10
E, north, C, -1 C, east, A, -1 A, exit, x, -10
Episode 1 Episode 2
Episode 3 Episode 4E, north, C, -1 C, east, D, -1 D, exit, x, +10
A
B C D
E
+8 +4 +10
-10
-2
![Page 58: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/58.jpg)
Problems with Direct Evaluation
▪ What’s good about direct evaluation?▪ It’s easy to understand▪ It doesn’t require any knowledge of T, R▪ It eventually computes the correct average
values, using just sample transitions
Output Values
A
B C D
E
+8 +4 +10
-10
-2
![Page 59: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/59.jpg)
Problems with Direct Evaluation
▪ What’s good about direct evaluation?▪ It’s easy to understand▪ It doesn’t require any knowledge of T, R▪ It eventually computes the correct average
values, using just sample transitions
▪ What bad about it?▪ It wastes information about state connections▪ Each state must be learned separately▪ So, it takes a long time to learn
Output Values
A
B C D
E
+8 +4 +10
-10
-2
![Page 60: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/60.jpg)
Problems with Direct Evaluation
▪ What’s good about direct evaluation?▪ It’s easy to understand▪ It doesn’t require any knowledge of T, R▪ It eventually computes the correct average
values, using just sample transitions
▪ What bad about it?▪ It wastes information about state connections▪ Each state must be learned separately▪ So, it takes a long time to learn
Output Values
A
B C D
E
+8 +4 +10
-10
-2
If B and E both go to C under this policy, how
can their values be different?
![Page 61: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/61.jpg)
Why Not Use Policy Evaluation?
▪ Simplified Bellman updates calculate V for a fixed policy:▪ Each round, replace V with a one-step-look-ahead layer over V
π(s)
s
s, π(s)
s, π(s),s’s’
![Page 62: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/62.jpg)
Why Not Use Policy Evaluation?
▪ Simplified Bellman updates calculate V for a fixed policy:▪ Each round, replace V with a one-step-look-ahead layer over V
π(s)
s
s, π(s)
s, π(s),s’s’
![Page 63: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/63.jpg)
Why Not Use Policy Evaluation?
▪ Simplified Bellman updates calculate V for a fixed policy:▪ Each round, replace V with a one-step-look-ahead layer over V
π(s)
s
s, π(s)
s, π(s),s’s’
![Page 64: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/64.jpg)
Why Not Use Policy Evaluation?
▪ Simplified Bellman updates calculate V for a fixed policy:▪ Each round, replace V with a one-step-look-ahead layer over V
▪ This approach fully exploited the connections between the states▪ Unfortunately, we need T and R to do it!
π(s)
s
s, π(s)
s, π(s),s’s’
![Page 65: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/65.jpg)
Why Not Use Policy Evaluation?
▪ Simplified Bellman updates calculate V for a fixed policy:▪ Each round, replace V with a one-step-look-ahead layer over V
▪ This approach fully exploited the connections between the states▪ Unfortunately, we need T and R to do it!
▪ Key question: how can we do this update to V without knowing T and R?▪ In other words, how to we take a weighted average without knowing the weights?
π(s)
s
s, π(s)
s, π(s),s’s’
![Page 66: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/66.jpg)
Sample-Based Policy Evaluation?
![Page 67: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/67.jpg)
Sample-Based Policy Evaluation?
▪ We want to improve our estimate of V by computing these averages:
▪ Idea: Take samples of outcomes s’ (by doing the action!) and average
![Page 68: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/68.jpg)
Sample-Based Policy Evaluation?
▪ We want to improve our estimate of V by computing these averages:
▪ Idea: Take samples of outcomes s’ (by doing the action!) and average
π(s)
s
s, π(s)
s, π(s),s’s'
![Page 69: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/69.jpg)
Sample-Based Policy Evaluation?
▪ We want to improve our estimate of V by computing these averages:
▪ Idea: Take samples of outcomes s’ (by doing the action!) and average
π(s)
s
s, π(s)
s1'
![Page 70: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/70.jpg)
Sample-Based Policy Evaluation?
▪ We want to improve our estimate of V by computing these averages:
▪ Idea: Take samples of outcomes s’ (by doing the action!) and average
π(s)
s
s, π(s)
s1's2'
![Page 71: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/71.jpg)
Sample-Based Policy Evaluation?
▪ We want to improve our estimate of V by computing these averages:
▪ Idea: Take samples of outcomes s’ (by doing the action!) and average
π(s)
s
s, π(s)
s1's2' s3'
![Page 72: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/72.jpg)
Sample-Based Policy Evaluation?
▪ We want to improve our estimate of V by computing these averages:
▪ Idea: Take samples of outcomes s’ (by doing the action!) and average
π(s)
s
s, π(s)
s1's2' s3'
![Page 73: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/73.jpg)
Sample-Based Policy Evaluation?
▪ We want to improve our estimate of V by computing these averages:
▪ Idea: Take samples of outcomes s’ (by doing the action!) and average
π(s)
s
s, π(s)
s1's2' s3'
Almost! But we can’t rewind time to get
sample after sample from state s.
![Page 74: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/74.jpg)
Sample-Based Policy Evaluation?
▪ We want to improve our estimate of V by computing these averages:
▪ Idea: Take samples of outcomes s’ (by doing the action!) and average
π(s)
s
s, π(s)
s1's2' s3'
Almost! But we can’t rewind time to get
sample after sample from state s.
![Page 75: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/75.jpg)
Temporal Difference Learning
▪ Big idea: learn from every experience!▪ Update V(s) each time we experience a transition (s, a, s’, r)▪ Likely outcomes s’ will contribute updates more often π(s)
s
s, π(s)
s’
![Page 76: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/76.jpg)
Temporal Difference Learning
▪ Big idea: learn from every experience!▪ Update V(s) each time we experience a transition (s, a, s’, r)▪ Likely outcomes s’ will contribute updates more often
▪ Temporal difference learning of values▪ Policy still fixed, still doing evaluation!▪ Move values toward value of whatever successor occurs: running
average
π(s)s
s, π(s)
s’
![Page 77: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/77.jpg)
Temporal Difference Learning
▪ Big idea: learn from every experience!▪ Update V(s) each time we experience a transition (s, a, s’, r)▪ Likely outcomes s’ will contribute updates more often
▪ Temporal difference learning of values▪ Policy still fixed, still doing evaluation!▪ Move values toward value of whatever successor occurs: running
average
π(s)s
s, π(s)
s’
Sample of V(s):
![Page 78: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/78.jpg)
Temporal Difference Learning
▪ Big idea: learn from every experience!▪ Update V(s) each time we experience a transition (s, a, s’, r)▪ Likely outcomes s’ will contribute updates more often
▪ Temporal difference learning of values▪ Policy still fixed, still doing evaluation!▪ Move values toward value of whatever successor occurs: running
average
π(s)s
s, π(s)
s’
Sample of V(s):
Update to V(s):
![Page 79: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/79.jpg)
Temporal Difference Learning
▪ Big idea: learn from every experience!▪ Update V(s) each time we experience a transition (s, a, s’, r)▪ Likely outcomes s’ will contribute updates more often
▪ Temporal difference learning of values▪ Policy still fixed, still doing evaluation!▪ Move values toward value of whatever successor occurs: running
average
π(s)s
s, π(s)
s’
Sample of V(s):
Update to V(s):
Same update:
![Page 80: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/80.jpg)
Exponential Moving Average
▪ Exponential moving average
![Page 81: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/81.jpg)
Exponential Moving Average
▪ Exponential moving average ▪ The running interpolation update:
![Page 82: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/82.jpg)
Exponential Moving Average
▪ Exponential moving average ▪ The running interpolation update:
![Page 83: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/83.jpg)
Exponential Moving Average
▪ Exponential moving average ▪ The running interpolation update:
▪ Makes recent samples more important:
![Page 84: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/84.jpg)
Exponential Moving Average
▪ Exponential moving average ▪ The running interpolation update:
▪ Makes recent samples more important:
![Page 85: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/85.jpg)
Exponential Moving Average
▪ Exponential moving average ▪ The running interpolation update:
▪ Makes recent samples more important:
▪ Forgets about the past (distant past values were wrong anyway)
![Page 86: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/86.jpg)
Exponential Moving Average
▪ Exponential moving average ▪ The running interpolation update:
▪ Makes recent samples more important:
▪ Forgets about the past (distant past values were wrong anyway)
▪ Decreasing learning rate (alpha) can give converging averages
![Page 87: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/87.jpg)
Example: Temporal Difference Learning
Assume: γ = 1, α = 1/2
A
B C D
E
States
![Page 88: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/88.jpg)
Example: Temporal Difference Learning
Assume: γ = 1, α = 1/2
0
0 0 8
0
A
B C D
E
States
![Page 89: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/89.jpg)
Example: Temporal Difference Learning
Assume: γ = 1, α = 1/2
Observed Transitions
0
0 0 8
0
A
B C D
E
States
![Page 90: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/90.jpg)
Example: Temporal Difference Learning
Assume: γ = 1, α = 1/2
Observed Transitions
B, east, C, -2
0
0 0 8
0
A
B C D
E
States
![Page 91: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/91.jpg)
Example: Temporal Difference Learning
Assume: γ = 1, α = 1/2
Observed Transitions
B, east, C, -2
0
0 0 8
0
A
B C D
E
States
![Page 92: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/92.jpg)
Example: Temporal Difference Learning
Assume: γ = 1, α = 1/2
Observed Transitions
B, east, C, -2
0
0 0 8
0
A
B C D
E
States
![Page 93: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/93.jpg)
Example: Temporal Difference Learning
Assume: γ = 1, α = 1/2
Observed Transitions
B, east, C, -2
0
0 0 8
0
0
-1 0 8
0
A
B C D
E
States
![Page 94: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/94.jpg)
Example: Temporal Difference Learning
Assume: γ = 1, α = 1/2
Observed Transitions
B, east, C, -2
0
0 0 8
0
0
-1 0 8
0
C, east, D, -2
A
B C D
E
States
![Page 95: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/95.jpg)
Example: Temporal Difference Learning
Assume: γ = 1, α = 1/2
Observed Transitions
B, east, C, -2
0
0 0 8
0
0
-1 0 8
0
C, east, D, -2
A
B C D
E
States
![Page 96: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/96.jpg)
Example: Temporal Difference Learning
Assume: γ = 1, α = 1/2
Observed Transitions
B, east, C, -2
0
0 0 8
0
0
-1 0 8
0
0
-1 3 8
0
C, east, D, -2
A
B C D
E
States
![Page 97: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/97.jpg)
Problems with TD Value Learning
▪ TD value leaning is a model-free way to do policy evaluation, mimicking Bellman updates with running sample averages
▪ However, if we want to turn values into a (new) policy, we’re sunk:
▪ Idea: learn Q-values, not values ▪ Makes action selection model-free too!
a
s
s, a
s,a,s’s’
![Page 98: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/98.jpg)
Active Reinforcement Learning
![Page 99: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/99.jpg)
Active Reinforcement Learning
▪ Full reinforcement learning: optimal policies (like value iteration) ▪ You don’t know the transitions T(s,a,s’) ▪ You don’t know the rewards R(s,a,s’) ▪ You choose the actions now ▪ Goal: learn the optimal policy / values
▪ In this case: ▪ Learner makes choices! ▪ Fundamental tradeoff: exploration vs. exploitation ▪ This is NOT offline planning! You actually take actions in the world and
find out what happens…
![Page 100: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/100.jpg)
Detour: Q-Value Iteration
▪ Value iteration: find successive (depth-limited) values▪ Start with V0(s) = 0, which we know is right▪ Given Vk, calculate the depth k+1 values for all states:
![Page 101: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/101.jpg)
Detour: Q-Value Iteration
▪ Value iteration: find successive (depth-limited) values▪ Start with V0(s) = 0, which we know is right▪ Given Vk, calculate the depth k+1 values for all states:
![Page 102: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/102.jpg)
Detour: Q-Value Iteration
▪ Value iteration: find successive (depth-limited) values▪ Start with V0(s) = 0, which we know is right▪ Given Vk, calculate the depth k+1 values for all states:
▪ But Q-values are more useful, so compute them instead▪ Start with Q0(s,a) = 0, which we know is right▪ Given Qk, calculate the depth k+1 q-values for all q-states:
![Page 103: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/103.jpg)
Detour: Q-Value Iteration
▪ Value iteration: find successive (depth-limited) values▪ Start with V0(s) = 0, which we know is right▪ Given Vk, calculate the depth k+1 values for all states:
▪ But Q-values are more useful, so compute them instead▪ Start with Q0(s,a) = 0, which we know is right▪ Given Qk, calculate the depth k+1 q-values for all q-states:
![Page 104: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/104.jpg)
Q-Learning
▪ Q-Learning: sample-based Q-value iteration
[Demo: Q-learning – gridworld (L10D2)] [Demo: Q-learning – crawler (L10D3)]
![Page 105: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/105.jpg)
Q-Learning
▪ Q-Learning: sample-based Q-value iteration
▪ Learn Q(s,a) values as you go
[Demo: Q-learning – gridworld (L10D2)] [Demo: Q-learning – crawler (L10D3)]
![Page 106: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/106.jpg)
Q-Learning
▪ Q-Learning: sample-based Q-value iteration
▪ Learn Q(s,a) values as you go
[Demo: Q-learning – gridworld (L10D2)] [Demo: Q-learning – crawler (L10D3)]
![Page 107: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/107.jpg)
Q-Learning
▪ Q-Learning: sample-based Q-value iteration
▪ Learn Q(s,a) values as you go▪ Receive a sample (s,a,s’,r)
[Demo: Q-learning – gridworld (L10D2)] [Demo: Q-learning – crawler (L10D3)]
![Page 108: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/108.jpg)
Q-Learning
▪ Q-Learning: sample-based Q-value iteration
▪ Learn Q(s,a) values as you go▪ Receive a sample (s,a,s’,r)▪ Consider your old estimate:
[Demo: Q-learning – gridworld (L10D2)] [Demo: Q-learning – crawler (L10D3)]
![Page 109: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/109.jpg)
Q-Learning
▪ Q-Learning: sample-based Q-value iteration
▪ Learn Q(s,a) values as you go▪ Receive a sample (s,a,s’,r)▪ Consider your old estimate:▪ Consider your new sample estimate:
[Demo: Q-learning – gridworld (L10D2)] [Demo: Q-learning – crawler (L10D3)]
![Page 110: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/110.jpg)
Q-Learning
▪ Q-Learning: sample-based Q-value iteration
▪ Learn Q(s,a) values as you go▪ Receive a sample (s,a,s’,r)▪ Consider your old estimate:▪ Consider your new sample estimate:
▪ Incorporate the new estimate into a running average:
[Demo: Q-learning – gridworld (L10D2)] [Demo: Q-learning – crawler (L10D3)]
![Page 111: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/111.jpg)
Q-Learning
▪ Q-Learning: sample-based Q-value iteration
▪ Learn Q(s,a) values as you go▪ Receive a sample (s,a,s’,r)▪ Consider your old estimate:▪ Consider your new sample estimate:
▪ Incorporate the new estimate into a running average:
[Demo: Q-learning – gridworld (L10D2)] [Demo: Q-learning – crawler (L10D3)]
![Page 112: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/112.jpg)
Video of Demo Q-Learning -- Gridworld
![Page 113: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/113.jpg)
Video of Demo Q-Learning -- Gridworld
![Page 114: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/114.jpg)
Video of Demo Q-Learning -- Gridworld
![Page 115: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/115.jpg)
Video of Demo Q-Learning -- Crawler
![Page 116: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/116.jpg)
Video of Demo Q-Learning -- Crawler
![Page 117: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/117.jpg)
Video of Demo Q-Learning -- Crawler
![Page 118: Reinforcement Learningaritter.github.io/courses/5522_slides/rl1.pdf · 2021. 8. 9. · Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility](https://reader036.fdocuments.in/reader036/viewer/2022071606/6143e3936cc38f259c25d2cb/html5/thumbnails/118.jpg)
Q-Learning Properties
▪ Amazing result: Q-learning converges to optimal policy -- even if you’re acting suboptimally!
▪ This is called off-policy learning
▪ Caveats: ▪ You have to explore enough ▪ You have to eventually make the learning rate small enough ▪ … but not decrease it too quickly ▪ Basically, in the limit, it doesn’t matter how you select actions (!)