Introduction to Reinforcement...
Transcript of Introduction to Reinforcement...
![Page 1: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/1.jpg)
Introduction to Reinforcement Learning
CS 285: Deep Reinforcement Learning, Decision Making, and Control
Sergey Levine
![Page 2: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/2.jpg)
Class Notes
1. Homework 1 is due next Monday!
2. Remember to start forming final project groups• Final project proposal due Sep 25
• Final project ideas document coming soon!
![Page 3: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/3.jpg)
Today’s Lecture
1. Definition of a Markov decision process
2. Definition of reinforcement learning problem
3. Anatomy of a RL algorithm
4. Brief overview of RL algorithm types
• Goals:• Understand definitions & notation
• Understand the underlying reinforcement learning objective
• Get summary of possible algorithms
![Page 4: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/4.jpg)
Definitions
![Page 5: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/5.jpg)
1. run away
2. ignore
3. pet
Terminology & notation
![Page 6: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/6.jpg)
Images: Bojarski et al. ‘16, NVIDIA
trainingdata
supervisedlearning
Imitation Learning
![Page 7: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/7.jpg)
Reward functions
![Page 8: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/8.jpg)
Definitions
Andrey Markov
![Page 9: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/9.jpg)
Definitions
Andrey Markov
Richard Bellman
![Page 10: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/10.jpg)
Definitions
Andrey Markov
Richard Bellman
![Page 11: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/11.jpg)
Definitions
![Page 12: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/12.jpg)
The goal of reinforcement learningwe’ll come back to partially observed later
![Page 13: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/13.jpg)
The goal of reinforcement learning
![Page 14: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/14.jpg)
The goal of reinforcement learning
![Page 15: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/15.jpg)
Finite horizon case: state-action marginal
state-action marginal
![Page 16: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/16.jpg)
Infinite horizon case: stationary distribution
stationary distribution
stationary = the same before and after transition
![Page 17: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/17.jpg)
Infinite horizon case: stationary distribution
stationary distribution
stationary = the same before and after transition
![Page 18: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/18.jpg)
Expectations and stochastic systems
infinite horizon case finite horizon case
In RL, we almost always care about expectations
+1 -1
![Page 19: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/19.jpg)
Algorithms
![Page 20: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/20.jpg)
The anatomy of a reinforcement learning algorithm
generate samples (i.e. run the policy)
fit a model/ estimate the return
improve the policy
![Page 21: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/21.jpg)
A simple example
generate samples (i.e. run the policy)
fit a model/ estimate the return
improve the policy
![Page 22: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/22.jpg)
Another example: RL by backprop
generate samples (i.e. run the policy)
fit a model/ estimate the return
improve the policy
![Page 23: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/23.jpg)
Which parts are expensive?
generate samples (i.e. run the policy)
fit a model/ estimate the return
improve the policy
real robot/car/power grid/whatever:1x real time, until we invent time travel
MuJoCo simulator:up to 10000x real time
trivial, fast
expensive
![Page 24: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/24.jpg)
Review
generate samples (i.e.
run the policy)
fit a model/ estimate return
improve the policy
• Definitions• Markov chain
• Markov decision process
• RL objective• Expected reward
• How to evaluate expected reward?
• Structure of RL algorithms• Sample generation
• Fitting a model/estimating return
• Policy Improvement
![Page 25: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/25.jpg)
Break
![Page 26: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/26.jpg)
How do we deal with all these expectations?
what if we knew this part?
![Page 27: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/27.jpg)
Definition: Q-function
Definition: value function
![Page 28: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/28.jpg)
Using Q-functions and value functions
![Page 29: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/29.jpg)
The anatomy of a reinforcement learning algorithm
generate samples (i.e. run the policy)
fit a model/ estimate the return
improve the policy
this often uses Q-functions or value functions
![Page 30: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/30.jpg)
Types of RL algorithms
• Policy gradients: directly differentiate the above objective
• Value-based: estimate value function or Q-function of the optimal policy (no explicit policy)
• Actor-critic: estimate value function or Q-function of the current policy, use it to improve policy
• Model-based RL: estimate the transition model, and then…• Use it for planning (no explicit policy)
• Use it to improve a policy
• Something else
![Page 31: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/31.jpg)
Model-based RL algorithms
generate samples (i.e. run the policy)
fit a model/ estimate the return
improve the policy
![Page 32: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/32.jpg)
Model-based RL algorithms
improve the policy
1. Just use the model to plan (no policy)• Trajectory optimization/optimal control (primarily in continuous spaces) –
essentially backpropagation to optimize over actions• Discrete planning in discrete action spaces – e.g., Monte Carlo tree search
2. Backpropagate gradients into the policy• Requires some tricks to make it work
3. Use the model to learn a value function• Dynamic programming• Generate simulated experience for model-free learner
![Page 33: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/33.jpg)
Value function based algorithms
generate samples (i.e. run the policy)
fit a model/ estimate the return
improve the policy
![Page 34: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/34.jpg)
Direct policy gradients
generate samples (i.e. run the policy)
fit a model/ estimate the return
improve the policy
![Page 35: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/35.jpg)
Actor-critic: value functions + policy gradients
generate samples (i.e. run the policy)
fit a model/ estimate the return
improve the policy
![Page 36: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/36.jpg)
Tradeoffs
![Page 37: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/37.jpg)
Why so many RL algorithms?
• Different tradeoffs• Sample efficiency
• Stability & ease of use
• Different assumptions• Stochastic or deterministic?
• Continuous or discrete?
• Episodic or infinite horizon?
• Different things are easy or hard in different settings• Easier to represent the policy?
• Easier to represent the model?
generate samples (i.e.
run the policy)
fit a model/ estimate return
improve the policy
![Page 38: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/38.jpg)
Comparison: sample efficiency
• Sample efficiency = how many samples do we need to get a good policy?
• Most important question: is the algorithm off policy?• Off policy: able to improve the policy
without generating new samples from that policy
• On policy: each time the policy is changed, even a little bit, we need to generate new samples
generate samples (i.e.
run the policy)
fit a model/ estimate return
improve the policy
just one gradient step
![Page 39: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/39.jpg)
Comparison: sample efficiency
More efficient (fewer samples)
Less efficient (more samples)
on-policyoff-policy
Why would we use a less efficient algorithm?
Wall clock time is not the same as efficiency!
evolutionary or gradient-free algorithms
on-policy policy gradient algorithms
actor-criticstyle methods
off-policy Q-function learning
model-based deep RL
model-based shallow RL
![Page 40: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/40.jpg)
Comparison: stability and ease of use
Why is any of this even a question???
• Does it converge?
• And if it converges, to what?
• And does it converge every time?
• Supervised learning: almost always gradient descent
• Reinforcement learning: often not gradient descent• Q-learning: fixed point iteration
• Model-based RL: model is not optimized for expected reward
• Policy gradient: is gradient descent, but also often the least efficient!
![Page 41: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/41.jpg)
Comparison: stability and ease of use
• Value function fitting• At best, minimizes error of fit (“Bellman error”)
• Not the same as expected reward
• At worst, doesn’t optimize anything• Many popular deep RL value fitting algorithms are not guaranteed to converge to
anything in the nonlinear case
• Model-based RL• Model minimizes error of fit
• This will converge
• No guarantee that better model = better policy
• Policy gradient• The only one that actually performs gradient descent (ascent) on the true
objective
![Page 42: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/42.jpg)
Comparison: assumptions
• Common assumption #1: full observability• Generally assumed by value function fitting
methods
• Can be mitigated by adding recurrence
• Common assumption #2: episodic learning• Often assumed by pure policy gradient methods
• Assumed by some model-based RL methods
• Common assumption #3: continuity or smoothness• Assumed by some continuous value function
learning methods
• Often assumed by some model-based RL methods
![Page 43: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/43.jpg)
Examples of specific algorithms
• Value function fitting methods• Q-learning, DQN• Temporal difference learning• Fitted value iteration
• Policy gradient methods• REINFORCE• Natural policy gradient• Trust region policy optimization
• Actor-critic algorithms• Asynchronous advantage actor-critic (A3C)• Soft actor-critic (SAC)
• Model-based RL algorithms• Dyna• Guided policy search
We’ll learn about most of these in the
next few weeks!
![Page 44: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/44.jpg)
Example 1: Atari games with Q-functions
• Playing Atari with deep reinforcement learning, Mnih et al. ‘13
• Q-learning with convolutional neural networks
![Page 45: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/45.jpg)
Example 2: robots and model-based RL
• End-to-end training of deep visuomotor policies, L.* , Finn* ’16
• Guided policy search (model-based RL) for image-based robotic manipulation
![Page 46: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/46.jpg)
Example 3: walking with policy gradients
• High-dimensional continuous control with generalized advantage estimation, Schulman et al. ‘16
• Trust region policy optimization with value function approximation
![Page 47: Introduction to Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/lec-4.pdf · Introduction to Reinforcement Learning CS 285: Deep Reinforcement Learning,](https://reader035.fdocuments.in/reader035/viewer/2022062318/605429b0bfc06c477a0660db/html5/thumbnails/47.jpg)
Example 4: robotic grasping with Q-functions
• QT-Opt, Kalashnikov et al. ‘18
• Q-learning from images for real-world robotic grasping