Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf ·...
Transcript of Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf ·...
![Page 1: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/1.jpg)
Generative Adversarial
Imitation Learning
Stefano Ermon
Joint work with Jayesh Gupta, Jonathan Ho, Yunzhu Li,
Hongyu Ren, and Jiaming Song
![Page 2: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/2.jpg)
Reinforcement Learning
• Goal: Learn policies
• High-dimensional, raw
observations
action
![Page 3: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/3.jpg)
+5
+1
0
Reinforcement Learning
• MDP: Model for (stochastic) sequential decision making problems
• States S
• Actions A
• Cost function (immediate): C: SxA R
• Transition Probabilities: P(s’|s,a)
• Policy: mapping from states to actions– E.g., (S0->a1, S1->a0, S2->a0)
• Reinforcement learning: minimize total (expected, discounted) cost
1
0
)(T
t
tsc
![Page 4: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/4.jpg)
Reinforcement Learning
Optimal
policy p
Reinforcement
Learning (RL)
Cost Function
c(s,a)
+5
+1
0
Environment
(MDP)
• States S
• Actions A
• Transitions: P(s’|s, a)
Cost
Policy: mapping from
states to actions
E.g., (S0->a1,
S1->a0,
S2->a0)
C: SxA R
RL needs
cost signal
![Page 5: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/5.jpg)
Imitation
Input: expert behavior generated by πE
Goal: learn cost function (reward) or policy(Ng and Russell, 2000), (Abbeel and Ng, 2004; Syed and Schapire, 2007), (Ratliff et al.,
2006), (Ziebart et al., 2008), (Kolter et al., 2008), (Finn et al., 2016), etc.
![Page 6: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/6.jpg)
Behavioral Cloning
• Small errors compound over time (cascading
errors)
• Decisions are purposeful (require planning)
(State,Action)
(State,Action)
…
(State,Action)
Policy
Supervised Learning
(regression)
![Page 7: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/7.jpg)
Inverse RL
• An approach to imitation
• Learns a cost c such that
![Page 8: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/8.jpg)
Problem setup
15
Optimal
policy p
Reinforcement
Learning (RL)
Environment
(MDP)
Cost Function
c(s)
Expert’s Trajectories
s0, s1, s2, …
Cost Function
c(s)
Inverse Reinforcement
Learning (IRL)
Expert has
small costEverything else
has high cost(Ziebart et al., 2010;
Rust 1987)
![Page 9: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/9.jpg)
Problem setup
16
Optimal
policy p
Reinforcement
Learning (RL)
Environment
(MDP)
Cost Function
c(s)
Expert’s Trajectories
s0, s1, s2, …
Cost Function
c(s)
Inverse Reinforcement
Learning (IRL)
?
Convex cost regularizer
≈
(similar wrt ψ)
![Page 10: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/10.jpg)
Combining RLoIRL
17
Optimal
policy p
Reinforcement
Learning (RL)
Expert’s Trajectories
s0, s1, s2, …
ψ-regularized Inverse
Reinforcement
Learning (IRL)
≈
(similar w.r.t. ψ)
ρp = occupancy measure =
distribution of state-action pairs
encountered when navigating
the environment with the policy
ρpE = Expert’s
occupancy measure
Theorem: ψ-regularized inverse reinforcement learning,
implicitly, seeks a policy whose occupancy measure is close to
the expert’s, as measured by ψ* (convex conjugate of ψ)
![Page 11: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/11.jpg)
Takeaway
Theorem: ψ-regularized inverse reinforcement learning, implicitly, seeks a policy whose occupancy measure is close to the expert’s, as measured by ψ*
• Typical IRL definition: finding a cost function c such that the expert policy is uniquely optimal w.r.t. c
• Alternative view: IRL as a procedure that tries to induce a policy that matches the expert’s occupancy measure (generative model)
![Page 12: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/12.jpg)
Special cases
• If ψ(c)=constant, then
– Not a useful algorithm. In practice, we only have
sampled trajectories
• Overfitting: Too much flexibility in choosing
the cost function (and the policy)
All cost functions
ψ(c)=constant
![Page 13: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/13.jpg)
Towards Apprenticeship learning
• Solution: use features fs,a
• Cost c(s,a) = θ . fs,a
20
Only these “simple” cost functions are allowed
All cost functions
Linear in
features
ψ(c)= 0
ψ(c)= ∞
![Page 14: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/14.jpg)
Apprenticeship learning
• For that choice of ψ, RL oIRLψ framework
gives apprenticeship learning
• Apprenticeship learning: find π performing
better than πE over costs linear in the
features
– Abbeel and Ng (2004)
– Syed and Schapire (2007)
![Page 15: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/15.jpg)
Apprenticeship learning
• Given
• Goal: find π performing better than πE over a
class of costs
Approximated using
demonstrations
![Page 16: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/16.jpg)
Issues with Apprenticeship learning
• Need to craft features very carefully
– unless the true expert cost function (assuming it
exists) lies in C, there is no guarantee that AL
will recover the expert policy
• RL o IRLψ(pE) is “encoding” the expert
behavior as a cost function in C.
– it might not be possible to decode it back if C is
too simple All cost functions
pEpRIRL RL
![Page 17: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/17.jpg)
Generative Adversarial Imitation
Learning
• Solution: use a more expressive class of cost
functions
All cost functions
Linear in
features
![Page 18: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/18.jpg)
Generative Adversarial Imitation
Learning
• ψ* = optimal negative log-loss of the binary
classification problem of distinguishing between
state-action pairs of π and πE
D
Policy π
Expert Policy πE
![Page 19: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/19.jpg)
Generative Adversarial Networks
Figure from Goodfellow et al, 2014
![Page 20: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/20.jpg)
GAIL
Simulator
(Environment)
Sample from
expert
Differentiable
function D
D tries to
output 0
Sample from
model
Differentiable
function D
D tries to
output 1
Differentiable
function P
Black box
simulator
Generator
G
Ho and Ermon, Generative Adversarial Imitation Learning
![Page 21: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/21.jpg)
How to optimize the objective
• Previous Apprenticeship learning work:
– Full dynamics model
– Small environment
– Repeated RL
• We propose: gradient descent over policy
parameters (and discriminator)
J. Ho, J. K. Gupta, and S. Ermon. Model-free imitation learning with policy optimization.
ICML 2016.
![Page 22: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/22.jpg)
Properties
• Inherits pros of policy gradient
– Convergence to local minima
– Can be model free
• Inherits cons of policy gradient
– High variance
– Small steps required
![Page 23: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/23.jpg)
Properties
• Inherits pros of policy gradient
– Convergence to local minima
– Can be model free
• Inherits cons of policy gradient
– High variance
– Small steps required
• Solution: trust region policy optimization
![Page 24: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/24.jpg)
Results
![Page 25: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/25.jpg)
Results
Input: driving demonstrations (Torcs)
Output policy:
Li et al, 2017. InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations
From raw visual inputs
![Page 26: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/26.jpg)
Experimental results
![Page 27: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/27.jpg)
Latent structure in demonstrations
35
EnvironmentPolicyObserved
Behavior
Human model
Latent variables
Z
Semantically meaningful latent structure?
![Page 28: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/28.jpg)
InfoGAILLa
ten
t str
uctu
reObserved
data
Infer
structure
EnvironmentPolicyObserved
Behavior
Latent variables
Z
Maximize mutual information
Hou el al.
![Page 29: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/29.jpg)
InfoGAIL
EnvironmentPolicyObserved
BehaviorZ
Maximize mutual information
c
(s,a)Latent code
![Page 30: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/30.jpg)
Synthetic Experiment
Demonstrations
Demonstrations GAIL Info-GAIL
![Page 31: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/31.jpg)
InfoGAIL
40
EnvironmentPolicy Trajectories
model
Pass left (z=0) Pass right (z=1)
Latent variables
Z
Li et al, 2017. InfoGAIL: Interpretable
Imitation Learning from Visual
Demonstrations
![Page 32: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/32.jpg)
InfoGAIL
41
EnvironmentPolicy Trajectories
model
Turn inside (z=0) Turn outside (z=1)
Latent variables
Z
Li et al, 2017. InfoGAIL: Interpretable
Imitation Learning from Visual
Demonstrations
![Page 33: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/33.jpg)
Multi-agent environments
What are the goals of these 4 agents?
![Page 34: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/34.jpg)
…
Problem setupOptimal
policies
p1
MA Reinforcement
Learning (MARL)
Environment
(Markov Game)
Cost Functions
c1(s,a1)
..
cN(s,aN)
Optimal
policies
pK
R L
R 0,0 10,10
L 10,10 0,0
![Page 35: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/35.jpg)
Problem setup
46
Optimal
policies pMA Reinforcement
Learning (MARL)
Environment
(Markov Game)
Cost Functions
c1(s,a1)
..
cN(s,aN)
Expert’s Trajectories
(s0,a01,..a0
N)
(s1,a11,..a1
N)
…
Inverse Reinforcement
Learning (MAIRL)
Cost Functions
c1(s,a1)
..
cN(s,aN)
≈
(similar wrt ψ)
![Page 36: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/36.jpg)
MAGAIL
Sample from expert
(s,a1,a2,…,aN)
Diff.
function
D1
D1 tries
to
output 0
Sample from model
(s,a1,a2,…,aN)
Policy
Agent 1
Black box simulator
Generator
G
Song, Ren, Sadigh, Ermon, Multi-Agent Generative
Adversarial Imitation Learning
Diff.
function
DN
DN tries
to
output 0
…
Policy
Agent N
Diff.
function
DN
DN tries
to
output 1
Diff.
function
D1
D1 tries
to
output 1
Diff.
function
D2
D2 tries
to
output 0
…Diff.
function
D2
D2 tries
to
output 1
![Page 37: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/37.jpg)
Environments
Demonstrations MAGAIL
![Page 38: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/38.jpg)
Environments
Demonstrations
MAGAIL
![Page 39: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/39.jpg)
Suboptimal demos
ExpertMAGAIL
lighter plank + bumps on ground
![Page 40: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:](https://reader034.fdocuments.in/reader034/viewer/2022042919/5f62089585e8ca7d785a16dd/html5/thumbnails/40.jpg)
Conclusions
51
• IRL is a dual of an occupancy measure
matching problem (generative modeling)
• Might need flexible cost functions
– GAN style approach
• Policy gradient approach
– Scales to high dimensional settings
• Towards unsupervised learning of latent
structure from demonstrations