CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621...

81
CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning (Part II)

Transcript of CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621...

Page 1: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

CSC2621Imitation Learning for Robotics

Florian Shkurti

Week 10: Inverse Reinforcement Learning (Part II)

Page 2: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Today’s agenda

• Guided Cost Learning by Finn, Levine, Abbeel

• Inverse KKT by Englert, Vien, Toussaint

• Bayesian Inverse RL by Ramachandran and Amir

• Max Margin Planning by Ratliff, Zinkevitch, and Bagnell

Page 3: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Today’s agenda

• Guided Cost Learning by Finn, Levine, Abbeel

• Inverse KKT by Englert, Vien, Toussaint

• Bayesian Inverse RL by Ramachandran and Amir

• Max Margin Planning by Ratliff, Zinkevitch, and Bagnell

Page 4: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Recall: Maximum Entropy IRL [Ziebart et al. 2008]

Assumption: Trajectories (states and action sequences) here are discrete

Page 5: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Recall: Maximum Entropy IRL [Ziebart et al. 2008]

Linear Reward Function

Page 6: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Recall: Maximum Entropy IRL [Ziebart et al. 2008]

Linear Reward Function

Page 7: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Recall: Maximum Entropy IRL [Ziebart et al. 2008]

Linear Reward Function

Assumption: known and

deterministic dynamics

Page 8: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Recall: Maximum Entropy IRL [Ziebart et al. 2008]

Linear Reward Function

Assumption: known and

deterministic dynamics

Log-likelihood of observed dataset D of trajectories

Page 9: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Recall: Maximum Entropy IRL [Ziebart et al. 2008]

Linear Reward Function

Assumption: known and

deterministic dynamics

Log-likelihood of observed dataset D of trajectories

Page 10: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Recall: Maximum Entropy IRL [Ziebart et al. 2008]

Linear Reward Function

Assumption: known and

deterministic dynamics

Log-likelihood of observed dataset D of trajectories

Serious problem:

Need to compute Z(theta)

every time we compute

the gradient

Hand-Engineered Features

Page 11: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Guided Cost Learning [Finn, Levine, Abbeel et al. 2016]

Nonlinear Reward Function

Learned Features

True and stochastic dynamics (unknown)

Log-likelihood of observed dataset D of trajectories

Page 12: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Guided Cost Learning [Finn, Levine, Abbeel et al. 2016]

Nonlinear Reward Function

Learned FeaturesLog-likelihood of observed dataset D of trajectories

Serious problem

remains

True and stochastic dynamics (unknown)

Page 13: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Approximating the gradient of the log-likelihood

Nonlinear Reward Function

Learned Features

How do you approximate this expectation?

Page 14: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Approximating the gradient of the log-likelihood

Nonlinear Reward Function

Learned Features

How do you approximate this expectation?

Idea #1: sample from

(can you do this?)

Page 15: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Approximating the gradient of the log-likelihood

Nonlinear Reward Function

Learned Features

How do you approximate this expectation?

Idea #1: sample from

(don’t know the dynamics )

Page 16: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Approximating the gradient of the log-likelihood

Nonlinear Reward Function

Learned Features

How do you approximate this expectation?

Idea #1: sample from

(don’t know the dynamics )

Idea #2: sample from an easier distribution

that approximates

Importance Samplingsee Relative Entropy Inverse RL

by Boularias, Kober, Peters

Page 17: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Importance Sampling

How to estimate properties/statistics of one distribution (p) given samples from another distribution (q)

Weights = likelihood ratio,

i.e. how to reweigh samples to obtain statistics of p from samples of q

p q

Page 18: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Importance Sampling: Pitfalls and Drawbacks

What can go wrong?

p q

Problem #1:

If q(x) = 0 but f(x)p(x) > 0

for x in non-measure-zero

set then there is estimation

bias

Page 19: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Importance Sampling: Pitfalls and Drawbacks

What can go wrong?

p q

Problem #1:

If q(x) = 0 but f(x)p(x) > 0

for x in non-measure-zero

set then there is estimation

bias

Problem #2:

Weights measure mismatch

between q(x) and p(x). If

mismatch is large then some

weights will dominate. If

x lives in high dimensions a

single weight may dominate

Page 20: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Importance Sampling: Pitfalls and Drawbacks

What can go wrong?

p q

Problem #1:

If q(x) = 0 but f(x)p(x) > 0

for x in non-measure-zero

set then there is estimation

bias

Problem #2:

Weights measure mismatch

between q(x) and p(x). If

mismatch is large then some

weights will dominate. If

x lives in high dimensions a

single weight may dominate

Problem #3:

Variance of estimator

is high if (q – fp)(x) is high

Page 21: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Importance Sampling: Pitfalls and Drawbacks

What can go wrong?

p q

Problem #1:

If q(x) = 0 but f(x)p(x) > 0

for x in non-measure-zero

set then there is estimation

bias

Problem #2:

Weights measure mismatch

between q(x) and p(x). If

mismatch is large then some

weights will dominate. If

x lives in high dimensions a

single weight may dominate

For more info see:

#1, #3: Monte Carlo theory, methods, and examples, Art Owen, chapter 9

#2: Bayesian reasoning and machine learning, David Barber, chapter 27.6 on importance sampling

Problem #3:

Variance of estimator

is high if (q – fp)(x) is high

Page 22: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Importance Sampling

What is the best approximating distribution q?

p q

Best approximation

Page 23: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Importance Sampling

How does this connect back to partition function estimation?

p q

Page 24: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Importance Sampling

How does this connect back to partition function estimation?

p q

Best approximating distribution

Page 25: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Importance Sampling

How does this connect back to partition function estimation?

p q

Best approximating distributionCost function estimate changes at each gradient step

Therefore the best approximating distribution should change as well

Page 26: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Approximating the gradient of the log-likelihood

Nonlinear Reward Function

Learned Features

How do you approximate this expectation?

Idea #1: sample from

(don’t know the dynamics )

Idea #2: sample from an easier distribution

that approximates

Importance Samplingsee Relative Entropy Inverse RL

by Boularias, Kober, Peters

Page 27: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Approximating the gradient of the log-likelihood

Nonlinear Reward Function

Learned Features

How do you approximate this expectation?

Idea #1: sample from

(don’t know the dynamics )

Idea #2: sample from an easier distribution

that approximates

Importance Samplingsee Relative Entropy Inverse RL

by Boularias, Kober, Peters

Previous papers used

a fixed

Page 28: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Approximating the gradient of the log-likelihood

Nonlinear Reward Function

Learned Features

How do you approximate this expectation?

Idea #1: sample from

(don’t know the dynamics )

Idea #2: sample from an easier distribution

that approximates

Importance Samplingsee Relative Entropy Inverse RL

by Boularias, Kober, Peters

Previous papers used

a fixed

Adaptive Importance Samplingsee Guided Cost Learning

By Finn, Levine, Abbeel

This paper uses

adaptive

Page 29: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

How do you select q?

How do you adapt it as the cost c changes?

Guided Cost Learning

Page 30: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

How do you select q?

How do you adapt it as the cost c changes?

Guided Cost Learning: the punchline

Given a fixed cost function c, the distribution of trajectories that Guided Policy Search computes is close to

i.e. it is good for importance sampling of the partition function Z

Page 31: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Recall: Finite-Horizon LQR

// n is the # of steps left

for n = 1…N

Optimal control for time t = N – n is with cost-to-go

where the states are predicted forward in time according to linear dynamics

Page 32: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Recall: LQG = LQR with stochastic dynamics

Assume and

Then the form of the optimal policy is the same as in LQR

No need to change the algorithm, as long as you observe the state at each

step (closed-loop policy)

zero mean Gaussian

Linear Quadratic GaussianLQG

estimate of the state

Page 33: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Deterministic Nonlinear Cost & Deterministic Nonlinear Dynamics

Arbitrary differentiable functions c, f

iLQR: iteratively approximate solution by solving linearized versions of the problem via LQR

Page 34: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Deterministic Nonlinear Cost & Stochastic Nonlinear Dynamics

Arbitrary differentiable functions c, f

iLQG: iteratively approximate solution by solving linearized versions of the problem via LQG

Page 35: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Recall from Guided Policy Search

Learn linear Gaussian dynamics

Page 36: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Recall from Guided Policy Search

Learn linear Gaussian dynamics

Page 37: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Recall from Guided Policy Search

Learn linear Gaussian dynamics

Linear Gaussian

dynamics and controller

Page 38: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Recall from Guided Policy Search

Learn linear Gaussian dynamics

Linear Gaussian

dynamics and controller

Run controller on the robot

Collect trajectories

Page 39: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Recall from Guided Policy Search

Learn linear Gaussian dynamics

Given a fixed cost function c, the linear

Gaussian controllers that GPS computes

induce a distribution of trajectories close to

i.e. good for importance sampling of the

partition function Z

Page 40: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Guided Cost Learning [rough sketch]

Importance sample trajectories from

Do forward optimization using Guided Policy Search for cost function

and compute linear Gaussian distribution of trajectories

Collect demonstration trajectories D

Initialize cost parameters

Page 41: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Regularization of learned cost functions

Page 42: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning
Page 43: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Today’s agenda

• Guided Cost Learning by Finn, Levine, Abbeel

• Inverse KKT by Englert, Vien, Toussaint

• Bayesian Inverse RL by Ramachandran and Amir

• Max Margin Planning by Ratliff, Zinkevitch, and Bagnell

Page 44: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Setting up trajectory optimization problems[e.g. for manipulation]

Non-learned features of the current state, e.g. distance to object

Features and their weights are time-dependent

Page 45: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Non-learned features of the current state, e.g. distance to object

Features and their weights are time-dependent

Constraints such as “stay away from an obstacle”,

or “acceleration should be bounded”

Setting up trajectory optimization problems[e.g. for manipulation]

Page 46: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Non-learned features of the current state, e.g. distance to object

Features and their weights are time-dependent

Constraints such as “stay away from an obstacle”,

or “acceleration should be bounded”

Constraints such as “always touch the door handle”

Setting up trajectory optimization problems[e.g. for manipulation]

Page 47: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Solving constrained optimization problems

Lagrangian function for this problem:

Page 48: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

KKT conditions for trajectory optimization

Lagrangian function for this problem:

One of the necessary conditions for optimal motion

Page 49: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

KKT conditions for trajectory optimization

Lagrangian function for this problem:

What are the conditions on

the feature weights to ensure

optimality of demonstrated motion?

One of the necessary conditions for optimal motion

Page 50: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Inverse KKT conditions: optimality of cost

Lagrangian function for this problem:

What are the conditions on

the feature weights to ensure

optimality of demonstrated motion?

Minimize

One of the necessary conditions for optimal motion

Page 51: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Lagrangian function for this problem:

What are the conditions on

the feature weights to ensure

optimality of demonstrated motion?

Minimize

One of the necessary conditions for optimal motion

Inverse KKT conditions: optimality of cost

Page 52: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Lagrangian function for this problem:

What are the conditions on

the feature weights to ensure

optimality of demonstrated motion?

Minimize

subject to

One of the necessary conditions for optimal motion

Inverse KKT conditions: optimality of cost

Page 53: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Lagrangian function for this problem:

What are the conditions on

the feature weights to ensure

optimality of demonstrated motion?

Minimize

subject to

One of the necessary conditions for optimal motion

Inverse KKT conditions: optimality of cost

Page 54: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Lagrangian function for this problem:

What are the conditions on

the feature weights to ensure

optimality of demonstrated motion?

Minimize

subject to

Quadratic program

Efficient solvers exist (CPLEX, CVXGEN, Gurobi)

One of the necessary conditions for optimal motion

Inverse KKT conditions: optimality of cost

Page 55: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning
Page 56: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Features

Page 57: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Feature weights over time

Page 58: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Today’s agenda

• Guided Cost Learning by Finn, Levine, Abbeel

• Inverse KKT by Englert, Vien, Toussaint

• Bayesian Inverse RL by Ramachandran and Amir

• Max Margin Planning by Ratliff, Zinkevitch, and Bagnell

Page 59: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Bayesian updates of deterministic rewards

Demonstration trajectory

Reward parameters

How does our belief in the reward

change after a demonstration?

Page 60: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Bayesian updates of deterministic rewards

Demonstration trajectory

Reward parameters

How does our belief in the reward

change after a demonstration?In this paper it is assumed that

Page 61: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

MCMC sampling of the posterior

Next candidate reward vector

picked randomly from current one

If the optimal policy has changed

then do policy iteration starting from

the old policy

The paper has results on mixing

times for the MCMC walk

Page 62: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Interesting result

Page 63: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Today’s agenda

• Guided Cost Learning by Finn, Levine, Abbeel

• Inverse KKT by Englert, Vien, Toussaint

• Bayesian Inverse RL by Ramachandran and Amir

• Max Margin Planning by Ratliff, Zinkevitch, and Bagnell

Page 64: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]

Assumptions:

- Linear rewards with respect to handcrafted features

- Discrete states and actions

Main idea: (reward weights should be such that) demonstrated trajectories collect more reward than any other trajectory, by a large margin

Page 65: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]

Assumptions:

- Linear rewards with respect to handcrafted features

- Discrete states and actions

Main idea: (reward weights should be such that) demonstrated trajectories collect more reward than any other trajectory, by a large margin

How can we formulate

this mathematically?

Page 66: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Detour: solving MDPs via linear programming

Page 67: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Detour: solving MDPs via linear programming

Discounted state action counts / occupancy measure

Optimal policy

Page 68: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Detour: solving MDPs via linear programming

Primal LP

Dual LP

Page 69: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]

Page 70: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]

Is searching over visitation

frequencies the same as searching

over policies?

Page 71: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]

Feature frequencies

for i-th demonstrated trajectory

for any trajectory or state action pair

Page 72: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]

Impose large margin that is

dependent on state action pairs

Page 73: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]

Page 74: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]

Page 75: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]

Don’t allow too much slack

Page 76: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]

Is this a proper formulation of a quadratic program? NO

Page 77: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]

But it optimizes a linear objective with linear constraints,

so we can use duality in linear programming

Page 78: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]

Page 79: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]

reward

Page 80: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Max Margin Planning [Ratliff, Zinkevitch, and Bagnell]

Is this sufficient to make v_i the optimal value function?

Page 81: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/lectures/... · CSC2621 Imitation Learning for Robotics Florian Shkurti Week 10: Inverse Reinforcement Learning

Results