Class notes - BAIR

66
1. Homework 5 due Tuesday, November 13 th 11:59pm Class notes

Transcript of Class notes - BAIR

Page 1: Class notes - BAIR

1. Homework 5 due Tuesday, November 13th 11:59pm

Class notes

Page 2: Class notes - BAIR

Real-World Robot Learning:Safety and Flexibility

CS294-112: Deep Reinforcement Learning

Gregory Kahn

Page 3: Class notes - BAIR

Safety Flexibility

Why should you care?

Page 4: Class notes - BAIR

Topics• Safety

• Flexibility

Outline

Algorithms• Imitation learning

• Model-free

• Model-based

Safety Flexibility Imitation learning Model-free Model-based

2 * 3 = 6 papers we’ll cover

By no means the best / only papers on these topics

Page 5: Class notes - BAIR

Safety Flexibility Imitation learning Model-free Model-based

Page 6: Class notes - BAIR

Learn control policy that maps observations to controls

Observation

Control

Policy

Goal

Safety Flexibility Imitation learning Model-free Model-based

Page 7: Class notes - BAIR

Human expert

● Able to generate good trajectories using an expert policy

- cost function

- optimization

- full state information

only during training

Trajectory optimization

Assumption

Safety Flexibility Imitation learning Model-free Model-based

Page 8: Class notes - BAIR

● Problem: training and test distributions differ

Gather expert trajectories

Supervised learning

Training trajectory

Policy reaches states not in training set! [Ross et al 2010]

Learned policy trajectory

Trajectoryoptimization

Supervised Learning

Safety Flexibility Imitation learning Model-free Model-based

Page 9: Class notes - BAIR

[Ross et al 2011]

● Problem: training and test distributions differ● Solution: execute policy during training

Gather expert trajectories

Supervised learning

Dataset Aggregation (DAgger)

Safety Flexibility Imitation learning Model-free Model-based

Page 10: Class notes - BAIR

● DAgger mixes the actions

Safety during training

Safety Flexibility Imitation learning Model-free Model-based

Page 11: Class notes - BAIR

● DAgger mixes the actions ● PLATO mixes the objectives

cost J → avoids high cost

Policy Learning using Adaptive Trajectory Optimization (PLATO)

Safety Flexibility Imitation learning Model-free Model-based

Page 12: Class notes - BAIR

approach sampling policy

safe similar training and

test distributions

PLATO

supervised learning

DAgger

Algorithm comparisons

Safety Flexibility Imitation learning Model-free Model-based

Page 13: Class notes - BAIR

Canyon Forest

Experiments: final neural network policies

Safety Flexibility Imitation learning Model-free Model-based

Page 14: Class notes - BAIR

Canyon Forest

Experiments: metrics

Safety Flexibility Imitation learning Model-free Model-based

Page 15: Class notes - BAIR

CanyonForest CanyonForest

Experiments: metrics

Safety Flexibility Imitation learning Model-free Model-based

Page 16: Class notes - BAIR

Safety Flexibility Imitation learning Model-free Model-based

Page 17: Class notes - BAIR

Goal

NOT SAFE

Safety Flexibility Imitation learning Model-free Model-based

Page 18: Class notes - BAIR

Shielding

Like learning in a transformed MDP

Pre-emptive shielding

Shield can be used at test time

Post-posed shielding

Safety Flexibility Imitation learning Model-free Model-based

Page 19: Class notes - BAIR

How to shield: linear temporal logic

● Encode safety with temporal logic

● Assumption: Known approximate/conservative transition dynamics

Safety Flexibility Imitation learning Model-free Model-based

Page 20: Class notes - BAIR

Experiments

Safety criteria- Don’t crash

Safety Flexibility Imitation learning Model-free Model-based

Page 21: Class notes - BAIR

Experiments

Safety criteria- Don’t run out of oxygen- If enough oxygen,

don’t surface w/o divers

Safety Flexibility Imitation learning Model-free Model-based

Page 22: Class notes - BAIR

Safety Flexibility Imitation learning Model-free Model-based

Page 23: Class notes - BAIR

How to do reinforcement learningwithout destroying the robot during training using only onboard images

unknown environment

Goal

Safety Flexibility Imitation learning Model-free Model-based

Page 24: Class notes - BAIR

unknown environment

learn a collision prediction model

command velocities

raw image

neural network

Approach

Safety Flexibility Imitation learning Model-free Model-based

Page 26: Class notes - BAIR

Train uncertainty-aware collision prediction model

Gather trajectories using MPC controller

Data

Deep neural network with uncertainty estimates from bootstrapping and dropout

Encourage safe, low-speed collisions by reasoning about

the model’s uncertainty

Robot increases speed as model becomes more

confident

May experience collisions

Form speed-dependent, uncertainty-awarecollision cost .

Model-based RL using collision prediction model

Safety Flexibility Imitation learning Model-free Model-based

Page 27: Class notes - BAIR

high speed predict collision large uncertainty

large cost

Collision cost

Safety Flexibility Imitation learning Model-free Model-based

Page 28: Class notes - BAIR

Bootstrapping

Data

D1 D2 D3

Resample with replacement

TrainTrainTrain

M1 M3M2

Training time Test time

Input

M1 M2 M3

Estimating neural network output uncertainty

Safety Flexibility Imitation learning Model-free Model-based

Page 29: Class notes - BAIR

Dropout

Data

Model ModelModel Model ModelModel

Input

Training time Test time

Estimating neural network output uncertainty

Safety Flexibility Imitation learning Model-free Model-based

Page 30: Class notes - BAIR

Not accounting for uncertainty(higher-speed collisions)

Preliminary real-world experiments

Safety Flexibility Imitation learning Model-free Model-based

Page 31: Class notes - BAIR
Page 32: Class notes - BAIR

accounting for uncertainty(lower-speed collisions)

Preliminary real-world experiments

Safety Flexibility Imitation learning Model-free Model-based

Page 33: Class notes - BAIR
Page 34: Class notes - BAIR

successful flight past obstacle

Preliminary real-world experiments

Safety Flexibility Imitation learning Model-free Model-based

Page 35: Class notes - BAIR
Page 36: Class notes - BAIR

• Tradeoff between safety and exploration

• Safety guarantees require expert oversight or known environment + dynamics

• Uncertainty can play a key role

Safety takeaways

Safety Flexibility Imitation learning Model-free Model-based

Page 37: Class notes - BAIR

Safety Flexibility Imitation learning Model-free Model-based

Page 38: Class notes - BAIR

Goal

User-specified command

Safety Flexibility Imitation learning Model-free Model-based

Page 39: Class notes - BAIR

Approach

Option A: Input command Option B: Branch using command

+ empirically better- only works for discrete commands

Safety Flexibility Imitation learning Model-free Model-based

Page 40: Class notes - BAIR

Important details

• Data augmentation• Contrast

• Brightness

• Tone

• Gaussian blur

• Salt-and-pepper noise

• Region dropout

• Adding noise to expert

Approach

Safety Flexibility Imitation learning Model-free Model-based

Page 41: Class notes - BAIR
Page 42: Class notes - BAIR
Page 43: Class notes - BAIR
Page 44: Class notes - BAIR

[slides adapted from Tuomas Haarnoja]

Safety Flexibility Imitation learning Model-free Model-based

Page 45: Class notes - BAIR

Avoidance skill

Reaching skill

Task 1: Reach

Task 2: Avoid

Reaching while avoiding skill

Space of trajectories

Goal

Safety Flexibility Imitation learning Model-free Model-based

Page 46: Class notes - BAIR

Task 1+2: Reach and avoid

Task 1: Reach Task 2: Avoid

Reusability!Related to divergence between and

Avoidance skill

Reaching skillReaching while avoiding skill

Space of trajectories

Policy Composition

Safety Flexibility Imitation learning Model-free Model-based

Page 47: Class notes - BAIR

Task 1 Task 2 Task 1 + 2

Page 48: Class notes - BAIR

Avoidance policyStacking policy

Page 49: Class notes - BAIR

Avoidance policyStacking policy Combined policy

Page 50: Class notes - BAIR

Safety Flexibility Imitation learning Model-free Model-based

Page 51: Class notes - BAIR

Standard Reinforcement Learning

Data

Policy

Trai

nTe

st

Data

Policy

Data

Policy

Data inefficientExpert in the loop

Inflexible

Page 52: Class notes - BAIR

CAPs Approach

CAPs

DataTrai

nTe

st

Event CuesDetector

Data efficientDetector in the loop

Flexible

Page 53: Class notes - BAIR

Detect Predict Control

Safety Flexibility Imitation learning Model-free Model-based

Page 54: Class notes - BAIR

Detect Predict Control

Event Cues

Detector

Safety Flexibility Imitation learning Model-free Model-based

Page 56: Class notes - BAIR

Detect Predict Control

Safety Flexibility Imitation learning Model-free Model-based

Page 57: Class notes - BAIR

Detect Predict Control

Safety Flexibility Imitation learning Model-free Model-based

Page 58: Class notes - BAIR

8x 8x

8x8x

8x 8x

Safety Flexibility Imitation learning Model-free Model-based

Page 59: Class notes - BAIR

8x

Safety Flexibility Imitation learning Model-free Model-based

Page 60: Class notes - BAIR

Drive in right lane

6x 6x

Drive in either lane

Drive at 7m/sAvoid collisions

Safety Flexibility Imitation learning Model-free Model-based

Page 61: Class notes - BAIR

6x

CAPs

Page 62: Class notes - BAIR

Safety Flexibility Imitation learning Model-free Model-based

Page 63: Class notes - BAIR

Safety Flexibility Imitation learning Model-free Model-based

Page 64: Class notes - BAIR

CAPs DQL

Collision Avoidance

Safety Flexibility Imitation learning Model-free Model-based

Page 65: Class notes - BAIR

Heading

Avoid collisionsFollow goal headingMove towards doors

Page 66: Class notes - BAIR

• Carefully construct how your policy / model deals with goals

• Model-free methods require extra care to reuse

• Model-based methods are flexible by construction

Flexibility takeaways

Safety Flexibility Imitation learning Model-free Model-based