Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf ·...

40
Generative Adversarial Imitation Learning Stefano Ermon Joint work with Jayesh Gupta, Jonathan Ho, Yunzhu Li, Hongyu Ren, and Jiaming Song

Transcript of Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf ·...

Page 1: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Generative Adversarial

Imitation Learning

Stefano Ermon

Joint work with Jayesh Gupta, Jonathan Ho, Yunzhu Li,

Hongyu Ren, and Jiaming Song

Page 2: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Reinforcement Learning

• Goal: Learn policies

• High-dimensional, raw

observations

action

Page 3: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

+5

+1

0

Reinforcement Learning

• MDP: Model for (stochastic) sequential decision making problems

• States S

• Actions A

• Cost function (immediate): C: SxA R

• Transition Probabilities: P(s’|s,a)

• Policy: mapping from states to actions– E.g., (S0->a1, S1->a0, S2->a0)

• Reinforcement learning: minimize total (expected, discounted) cost

1

0

)(T

t

tsc

Page 4: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Reinforcement Learning

Optimal

policy p

Reinforcement

Learning (RL)

Cost Function

c(s,a)

+5

+1

0

Environment

(MDP)

• States S

• Actions A

• Transitions: P(s’|s, a)

Cost

Policy: mapping from

states to actions

E.g., (S0->a1,

S1->a0,

S2->a0)

C: SxA R

RL needs

cost signal

Page 5: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Imitation

Input: expert behavior generated by πE

Goal: learn cost function (reward) or policy(Ng and Russell, 2000), (Abbeel and Ng, 2004; Syed and Schapire, 2007), (Ratliff et al.,

2006), (Ziebart et al., 2008), (Kolter et al., 2008), (Finn et al., 2016), etc.

Page 6: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Behavioral Cloning

• Small errors compound over time (cascading

errors)

• Decisions are purposeful (require planning)

(State,Action)

(State,Action)

(State,Action)

Policy

Supervised Learning

(regression)

Page 7: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Inverse RL

• An approach to imitation

• Learns a cost c such that

Page 8: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Problem setup

15

Optimal

policy p

Reinforcement

Learning (RL)

Environment

(MDP)

Cost Function

c(s)

Expert’s Trajectories

s0, s1, s2, …

Cost Function

c(s)

Inverse Reinforcement

Learning (IRL)

Expert has

small costEverything else

has high cost(Ziebart et al., 2010;

Rust 1987)

Page 9: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Problem setup

16

Optimal

policy p

Reinforcement

Learning (RL)

Environment

(MDP)

Cost Function

c(s)

Expert’s Trajectories

s0, s1, s2, …

Cost Function

c(s)

Inverse Reinforcement

Learning (IRL)

?

Convex cost regularizer

(similar wrt ψ)

Page 10: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Combining RLoIRL

17

Optimal

policy p

Reinforcement

Learning (RL)

Expert’s Trajectories

s0, s1, s2, …

ψ-regularized Inverse

Reinforcement

Learning (IRL)

(similar w.r.t. ψ)

ρp = occupancy measure =

distribution of state-action pairs

encountered when navigating

the environment with the policy

ρpE = Expert’s

occupancy measure

Theorem: ψ-regularized inverse reinforcement learning,

implicitly, seeks a policy whose occupancy measure is close to

the expert’s, as measured by ψ* (convex conjugate of ψ)

Page 11: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Takeaway

Theorem: ψ-regularized inverse reinforcement learning, implicitly, seeks a policy whose occupancy measure is close to the expert’s, as measured by ψ*

• Typical IRL definition: finding a cost function c such that the expert policy is uniquely optimal w.r.t. c

• Alternative view: IRL as a procedure that tries to induce a policy that matches the expert’s occupancy measure (generative model)

Page 12: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Special cases

• If ψ(c)=constant, then

– Not a useful algorithm. In practice, we only have

sampled trajectories

• Overfitting: Too much flexibility in choosing

the cost function (and the policy)

All cost functions

ψ(c)=constant

Page 13: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Towards Apprenticeship learning

• Solution: use features fs,a

• Cost c(s,a) = θ . fs,a

20

Only these “simple” cost functions are allowed

All cost functions

Linear in

features

ψ(c)= 0

ψ(c)= ∞

Page 14: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Apprenticeship learning

• For that choice of ψ, RL oIRLψ framework

gives apprenticeship learning

• Apprenticeship learning: find π performing

better than πE over costs linear in the

features

– Abbeel and Ng (2004)

– Syed and Schapire (2007)

Page 15: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Apprenticeship learning

• Given

• Goal: find π performing better than πE over a

class of costs

Approximated using

demonstrations

Page 16: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Issues with Apprenticeship learning

• Need to craft features very carefully

– unless the true expert cost function (assuming it

exists) lies in C, there is no guarantee that AL

will recover the expert policy

• RL o IRLψ(pE) is “encoding” the expert

behavior as a cost function in C.

– it might not be possible to decode it back if C is

too simple All cost functions

pEpRIRL RL

Page 17: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Generative Adversarial Imitation

Learning

• Solution: use a more expressive class of cost

functions

All cost functions

Linear in

features

Page 18: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Generative Adversarial Imitation

Learning

• ψ* = optimal negative log-loss of the binary

classification problem of distinguishing between

state-action pairs of π and πE

D

Policy π

Expert Policy πE

Page 19: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Generative Adversarial Networks

Figure from Goodfellow et al, 2014

Page 20: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

GAIL

Simulator

(Environment)

Sample from

expert

Differentiable

function D

D tries to

output 0

Sample from

model

Differentiable

function D

D tries to

output 1

Differentiable

function P

Black box

simulator

Generator

G

Ho and Ermon, Generative Adversarial Imitation Learning

Page 21: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

How to optimize the objective

• Previous Apprenticeship learning work:

– Full dynamics model

– Small environment

– Repeated RL

• We propose: gradient descent over policy

parameters (and discriminator)

J. Ho, J. K. Gupta, and S. Ermon. Model-free imitation learning with policy optimization.

ICML 2016.

Page 22: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Properties

• Inherits pros of policy gradient

– Convergence to local minima

– Can be model free

• Inherits cons of policy gradient

– High variance

– Small steps required

Page 23: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Properties

• Inherits pros of policy gradient

– Convergence to local minima

– Can be model free

• Inherits cons of policy gradient

– High variance

– Small steps required

• Solution: trust region policy optimization

Page 24: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Results

Page 25: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Results

Input: driving demonstrations (Torcs)

Output policy:

Li et al, 2017. InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations

From raw visual inputs

Page 26: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Experimental results

Page 27: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Latent structure in demonstrations

35

EnvironmentPolicyObserved

Behavior

Human model

Latent variables

Z

Semantically meaningful latent structure?

Page 28: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

InfoGAILLa

ten

t str

uctu

reObserved

data

Infer

structure

EnvironmentPolicyObserved

Behavior

Latent variables

Z

Maximize mutual information

Hou el al.

Page 29: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

InfoGAIL

EnvironmentPolicyObserved

BehaviorZ

Maximize mutual information

c

(s,a)Latent code

Page 30: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Synthetic Experiment

Demonstrations

Demonstrations GAIL Info-GAIL

Page 31: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

InfoGAIL

40

EnvironmentPolicy Trajectories

model

Pass left (z=0) Pass right (z=1)

Latent variables

Z

Li et al, 2017. InfoGAIL: Interpretable

Imitation Learning from Visual

Demonstrations

Page 32: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

InfoGAIL

41

EnvironmentPolicy Trajectories

model

Turn inside (z=0) Turn outside (z=1)

Latent variables

Z

Li et al, 2017. InfoGAIL: Interpretable

Imitation Learning from Visual

Demonstrations

Page 33: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Multi-agent environments

What are the goals of these 4 agents?

Page 34: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Problem setupOptimal

policies

p1

MA Reinforcement

Learning (MARL)

Environment

(Markov Game)

Cost Functions

c1(s,a1)

..

cN(s,aN)

Optimal

policies

pK

R L

R 0,0 10,10

L 10,10 0,0

Page 35: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Problem setup

46

Optimal

policies pMA Reinforcement

Learning (MARL)

Environment

(Markov Game)

Cost Functions

c1(s,a1)

..

cN(s,aN)

Expert’s Trajectories

(s0,a01,..a0

N)

(s1,a11,..a1

N)

Inverse Reinforcement

Learning (MAIRL)

Cost Functions

c1(s,a1)

..

cN(s,aN)

(similar wrt ψ)

Page 36: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

MAGAIL

Sample from expert

(s,a1,a2,…,aN)

Diff.

function

D1

D1 tries

to

output 0

Sample from model

(s,a1,a2,…,aN)

Policy

Agent 1

Black box simulator

Generator

G

Song, Ren, Sadigh, Ermon, Multi-Agent Generative

Adversarial Imitation Learning

Diff.

function

DN

DN tries

to

output 0

Policy

Agent N

Diff.

function

DN

DN tries

to

output 1

Diff.

function

D1

D1 tries

to

output 1

Diff.

function

D2

D2 tries

to

output 0

…Diff.

function

D2

D2 tries

to

output 1

Page 37: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Environments

Demonstrations MAGAIL

Page 38: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Environments

Demonstrations

MAGAIL

Page 39: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Suboptimal demos

ExpertMAGAIL

lighter plank + bumps on ground

Page 40: Generative Adversarial Imitation Learningcs236.stanford.edu/assets/slides/cs236_lecture14.pdf · 2019-12-06 · •Solution: trust region policy optimization. Results. Results Input:

Conclusions

51

• IRL is a dual of an occupancy measure

matching problem (generative modeling)

• Might need flexible cost functions

– GAN style approach

• Policy gradient approach

– Scales to high dimensional settings

• Towards unsupervised learning of latent

structure from demonstrations