Supervised Learning of...
Transcript of Supervised Learning of...
![Page 1: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/1.jpg)
Supervised Learning of Behaviors
CS 285: Deep Reinforcement Learning, Decision Making, and Control
Sergey Levine
![Page 2: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/2.jpg)
Class Notes
1. Homework 1 is out this evening
2. Remember to start forming final project groups• Final project assignment document is now out!
• Proposal due Sep 25
![Page 3: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/3.jpg)
Today’s Lecture
1. Definition of sequential decision problems
2. Imitation learning: supervised learning for decision makinga. Does direct imitation work?
b. How can we make it work more often?
3. A little bit of theory
4. Case studies of recent work in (deep) imitation learning
• Goals:• Understand definitions & notation
• Understand basic imitation learning algorithms
• Understand tools for theoretical analysis
![Page 4: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/4.jpg)
1. run away
2. ignore
3. pet
Terminology & notation
![Page 5: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/5.jpg)
1. run away
2. ignore
3. pet
Terminology & notation
![Page 6: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/6.jpg)
Aside: notation
Richard Bellman Lev Pontryagin
управление
![Page 7: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/7.jpg)
Images: Bojarski et al. ‘16, NVIDIA
trainingdata
supervisedlearning
Imitation Learning
behavioral cloning
![Page 8: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/8.jpg)
Does it work? No!
![Page 9: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/9.jpg)
Does it work? Yes!
Video: Bojarski et al. ‘16, NVIDIA
![Page 10: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/10.jpg)
Why did that work?
Bojarski et al. ‘16, NVIDIA
![Page 11: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/11.jpg)
Can we make it work more often?
cost
stability (more on this later)
![Page 12: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/12.jpg)
Can we make it work more often?
![Page 13: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/13.jpg)
Can we make it work more often?
DAgger: Dataset Aggregation
Ross et al. ‘11
![Page 14: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/14.jpg)
DAgger Example
Ross et al. ‘11
![Page 15: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/15.jpg)
What’s the problem?
Ross et al. ‘11
![Page 16: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/16.jpg)
Can we make it work without more data?
• DAgger addresses the problem of distributional “drift”
• What if our model is so good that it doesn’t drift?
• Need to mimic expert behavior very accurately
• But don’t overfit!
![Page 17: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/17.jpg)
Why might we fail to fit the expert?
1. Non-Markovian behavior
2. Multimodal behavior
behavior depends only on current observation
If we see the same thing twice, we do the same thing twice, regardless of what happened before
Often very unnatural for human demonstrators
behavior depends on all past observations
![Page 18: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/18.jpg)
How can we use the whole history?
variable number of frames, too many weights
![Page 19: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/19.jpg)
How can we use the whole history?
RNN state
RNN state
RNN state
shared weights
Typically, LSTM cells work better here
![Page 20: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/20.jpg)
Aside: why might this work poorly?
“causal confusion” see: de Haan et al., “Causal Confusion in Imitation Learning”
Question 1: Does including history exacerbate causal confusion?
Question 2: Can DAgger mitigate causal confusion?
![Page 21: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/21.jpg)
Why might we fail to fit the expert?
1. Non-Markovian behavior
2. Multimodal behavior1. Output mixture of
Gaussians
2. Latent variable models
3. Autoregressive discretization
![Page 22: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/22.jpg)
Why might we fail to fit the expert?
1. Output mixture of Gaussians
2. Latent variable models
3. Autoregressive discretization
![Page 23: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/23.jpg)
Why might we fail to fit the expert?
1. Output mixture of Gaussians
2. Latent variable models
3. Autoregressive discretization
Look up some of these:• Conditional variational autoencoder• Normalizing flow/realNVP• Stein variational gradient descent
![Page 24: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/24.jpg)
Why might we fail to fit the expert?
1. Output mixture of Gaussians
2. Latent variable models
3. Autoregressive discretization
(discretized) distribution over dimension 1 only
discrete sampling
discrete sampling
dim 1 value
dim 2 value
![Page 25: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/25.jpg)
Imitation learning: recap
• Often (but not always) insufficient by itself• Distribution mismatch problem
• Sometimes works well• Hacks (e.g. left/right images)
• Samples from a stable trajectory distribution
• Add more on-policy data, e.g. using Dagger
• Better models that fit more accurately
trainingdata
supervisedlearning
![Page 26: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/26.jpg)
Break
![Page 27: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/27.jpg)
Case study 1: trail following as classification
![Page 28: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/28.jpg)
![Page 29: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/29.jpg)
Imitation learning: what’s the problem?
• Humans need to provide data, which is typically finite• Deep learning works best when data is plentiful
• Humans are not good at providing some kinds of actions
• Humans can learn autonomously; can our machines do the same?• Unlimited data from own experience
• Continuous self-improvement
![Page 30: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/30.jpg)
1. run away
2. ignore
3. pet
Terminology & notation
![Page 31: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/31.jpg)
Aside: notation
Richard Bellman Lev Pontryagin
![Page 32: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/32.jpg)
A cost function for imitation?
trainingdata
supervisedlearning
Ross et al. ‘11
![Page 33: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/33.jpg)
Some analysis How bad is it?
![Page 34: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/34.jpg)
More general analysis
For more analysis, see Ross et al. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”
![Page 35: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/35.jpg)
Another imitation idea
![Page 36: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/36.jpg)
Goal-conditioned behavioral cloning
![Page 37: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/37.jpg)
![Page 38: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/38.jpg)
1. Collect data 2. Train goal conditioned policy
![Page 39: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/39.jpg)
3. Reach goals
![Page 40: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/40.jpg)
1. run away
2. ignore
3. pet
Terminology & notation
![Page 41: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdfSupervised Learning of Behaviors CS 285: Deep Reinforcement Learning, Decision Making, and](https://reader030.fdocuments.in/reader030/viewer/2022040611/5ed3df0a73d3d24575700d2a/html5/thumbnails/41.jpg)
Cost/reward functions in theory and practice