Inverse Reinforcement Learning - Artificial Intelligencecbfinn/_files/bootcamp_inverserl.pdf ·...

Inverse Reinforcement LearningChelsea Finn

Deep RL Bootcamp

Where does the reward come from?Computer Games Real World Scenarios

robotics dialog autonomous driving

what is the reward?often use a proxy

*frequently easier to provide expert data* Approach: infer reward function from roll-outs of expert policy

reward

Mnih et al. ‘15

Why infer the reward?Behavioral Cloning/Direct Imitation: Mimic actions of expert

- but no reasoning about outcomes or dynamics - the expert might have different degrees of freedom

Can we reason about what the expert is trying to achieve?

Inverse Optimal Control / Inverse Reinforcement Learning:infer reward function from demonstrations

(IOC/IRL)

Challenges underdefined problem

difficult to evaluate a learned reward demonstrations may not be precisely optimal

(Kalman ’64, Ng & Russell ’00)

given: - state & action space - Roll-outs from π* - dynamics model (sometimes)

goal: - recover reward function - then use reward to get policy

handle ambiguity using probabilistic model of behavior

(energy-based model for behavior)

Notation:

MaxEnt formulation:

Maximum Entropy Inverse RL(Ziebart et al. ’08)

Maximum Entropy IRL Optimization

Maximum Entropy IRL Optimization{blackboard

Maximum Entropy Inverse RL

Algorithm:

(Ziebart et al. ’08)

How can we: (1) handle unknown dynamics? (2) avoid solving the MDP in the inner loop

sample adaptively to estimate Z

[by constructing a policy]sample to estimate Z

Update reward using samples & demos

generate policy samples from π

update π w.r.t. rewardpolicy π reward r

guided cost learning algorithm

policy π

(Finn et al. ICML ’16)

generate policy samples from π

update π w.r.t. reward take 1 policy opt. steppolicy π reward r

policy π

Update reward in inner loop of policy optimization

generator

discriminator

Ho & Ermon, NIPS ‘16

guided cost learning algorithm(Finn et al. ICML ’16)

Aside: Generative Adversarial Networks (Goodfellow et al. ’14)

Similar to inverse RL, GANs learn an objective for generative modeling.

Isola et al. ‘17Arjovsky et al. ‘17

trajectory τ

Zhu et al. ‘17

sample x

policy π~q(τ) generator G

reward r discriminator D

Inverse RL GANs

(Finn*, Christiano*, et al. ’16)

Connection to Generative Adversarial Networks(Goodfellow et al. ’14)

trajectory τ sample xpolicy π~q(τ) generator G

reward r discriminator DInverse RL GANs

data distribution p

Reward/discriminator optimization:GCL: GAIL:

some classifier

Connection to Generative Adversarial Networks(Goodfellow et al. ’14)

trajectory τ sample xpolicy π~q(τ) generator G

reward r discriminator DInverse RL GANs

data distribution p

Policy/generator optimization:

Baram et al. ICML ’17: use learned dynamics model to backdrop through discriminator

Unknown dynamics: train generator/policy with RL

*entropy-regularized RL*

Guided Cost Learning Experiments

dish placement pouring almonds

Real-world Tasks

ComparisonRelative Entropy IRL

(Boularias et al. ‘11)

state includes goal plate posestate includes unsupervised

visual features [Finn et al. ’16]

generate policy samples from q

reward r

ComparisonsRelative Entropy IRL

(Boularias et al. ‘11)Path Integral IRL

(Kalakrishnan et al. ‘13)

Dish placement, demos

Dish placement, standard reward

Dish placement, RelEnt IRL

• video of dish baseline method

Dish placement, GCL policy

• video of dish our method - samples & reoptimizing

Pouring, demos

• video of pouring demos

Pouring, RelEnt IRL

• video of pouring baseline method

Pouring, GCL policy

• video of pouring our method - samples

Generative Adversarial Imitation Learning Experiments(Ho & Ermon NIPS ’16)

- demonstrations from TRPO-optimized policy - use TRPO as a policy optimizer

Conclusion: IRL requires fewer demonstrations than behavioral cloning

learned behaviors from human motion capture

(Ho & Ermon NIPS ’16)

Merel et al. ‘17

Generative Adversarial Imitation Learning Experiments

walking falling & getting up

GCL & GAIL: Pros & Cons

(with an efficiency policy optimizer)

Acquiring a reward function is important (and challenging!)

Inverse Reinforcement Learning Review

Goal of Inverse RL: infer reward function underlying expert demonstrations

Evaluating the partition function: - initial approaches solve the MDP in the inner loop and/or

assume known dynamics - with unknown dynamics, estimate Z using samples

Connection to generative adversarial networks: - sampling-based MaxEnt IRL is a GAN with a special form of

discriminator and uses RL to optimize the generator

Suggested Reading on Inverse RLClassic Papers: Abbeel & Ng ICML ’04. Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement learning Ziebart et al. AAAI ’08. Maximum Entropy Inverse Reinforcement Learning. Introduction to probabilistic method for inverse reinforcement learning

Modern Papers: Wulfmeier et al. arXiv ’16. Deep Maximum Entropy Inverse Reinforcement Learning. MaxEnt inverse RL using deep reward functions Finn et al. ICML ’16. Guided Cost Learning. Sampling based method for MaxEnt IRL that handles unknown dynamics and deep reward functions Ho & Ermon NIPS ’16. Generative Adversarial Imitation Learning. Inverse RL method using generative adversarial networks

Inverse Reinforcement Learning - Artificial Intelligencecbfinn/_files/bootcamp_inverserl.pdf ·...

Documents

Transcript of Inverse Reinforcement Learning - Artificial Intelligencecbfinn/_files/bootcamp_inverserl.pdf ·...

Maximum Entropy Deep Inverse Reinforcement LearningMaximum Entropy Deep Inverse Reinforcement Learning Markus Wulfmeier MARKUS@ROBOTS.OX.AC.UK Peter Ondru´ˇska ONDRUSKA@ROBOTS.OX.AC.UK

Non-Cooperative Inverse Reinforcement Learningpapers.nips.cc/paper/9145-non-cooperative-inverse...Non-Cooperative Inverse Reinforcement Learning Xiangyuan Zhang Kaiqing Zhang Erik

INVERSE REINFORCEMENT LEARNING: A REVIEW

Learning Pastoralists Preferences via Inverse Reinforcement Learning (IRL)

Risk-sensitive Inverse Reinforcement Learning via … · Risk-sensitive Inverse Reinforcement Learning via Coherent Risk Models Anirudha Majumdar y, Sumeet Singh , Ajay Mandlekar

Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning · 2014-09-30 · Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning Jaedeug

Meta-Inverse Reinforcement Learning with Probabilistic ...

Training Parsers by Inverse Reinforcement Learningszepesva/papers/MLJ-SISP-09.pdf · Training Parsers by Inverse Reinforcement Learning 3 lem, dynamic programming might be impractical).

Inverse Reinforcement Learning via Deep Gaussian Process · Inverse Reinforcement Learning via Deep Gaussian Process Presenter:MingJin UAI2017 1 MingJin,AndreasDamianou, PieterAbbeel,

Model-Based Inverse Reinforcement Learning from ... - arXiv

Stroke-Based Stylization Learning and Rendering with Inverse Reinforcement Learning · 2016-02-13 · Stroke-Based Stylization Learning and Rendering with Inverse Reinforcement Learning

Maximum Likelihood Inverse Reinforcement Learning

Driving with Style: Inverse Reinforcement Learning …Driving with Style: Inverse Reinforcement Learning in General-Purpose Planning for Automated Driving Sascha Rosbach 1;2, Vinit

Reinforcement learning techniques in RNA inverse folding › Lehre › Theses › MA_Parastou_Kohvaei.pdf · Reinforcement learning techniques in RNA inverse folding Parastou Kohvaei

Dungeon Digger: Apprenticeship Learning for Procedural ... · Procedural Content Generation; Machine Learning; Inverse Reinforcement Learning; Apprenticeship Learning Introduction

Inverse Reinforcement Learning with Natural Language Goals

Non-Cooperative Inverse Reinforcement Learning · Non-Cooperative Inverse Reinforcement Learning Xiangyuan Zhang Kaiqing Zhang Erik Miehling Tamer Bas¸ar Coordinated Science Laboratory

Online Inverse Reinforcement Learning Under OcclusionSaurabh Arora, Prashant Doshi and Bikramjit Banerjee. 2019. Online Inverse Reinforcement Learning Under Occlusion. In Proc. of

Inverse Reinforcement Learning for Telepresence Robots

1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning.