Using Inaccurate Models in Reinforcement Learning

23
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University

description

Using Inaccurate Models in Reinforcement Learning. Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University. Overview. Reinforcement learning in high-dimensional continuous state-spaces. Model-based RL: Difficult to build an accurate model. - PowerPoint PPT Presentation

Transcript of Using Inaccurate Models in Reinforcement Learning

Page 1: Using Inaccurate Models  in Reinforcement Learning

Using Inaccurate Models

in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Stanford University

Page 2: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Overview Reinforcement learning in high-dimensional

continuous state-spaces. Model-based RL: Difficult to build an accurate model. Model-free RL: Often requires large numbers of real-

life trials. We present a hybrid algorithm, which requires

only an approximate model, a small number of real-life trials.

Resulting policy is (locally) near-optimal. Experiments on flight simulator and real RC car.

Page 3: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Markov Decision Process (MDP) M = (S, A, T , H, s0, R ). S = n (continuous state space)

Time varying, deterministic dynamics T = { ft : S x A ! S, t = 0,…,H}.

Goal: find policy : S ! A, that maximizes U() = E [ R (st) | ].

Focus: task of trajectory following.

Reinforcement learning formalism

H

t=0

Page 4: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Motivating Example

Student-driver learning to make a 90 degree right turn Only a few trials needed. No accurate model.

Student-driver has access to: Real-life trial. Crude model.

Result: good policy gradient estimate.

Page 5: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Input to algorithm: approximate model. Start by computing the optimal policy

according to the model.

Algorithm Idea

Real-life trajectory

Target trajectory

The policy is optimal according to the model, so no improvement is possible based on the model.

Page 6: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Algorithm Idea (2)

Update the model such that it becomes exact for the current policy.

Page 7: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Algorithm Idea (2)

Update the model such that it becomes exact for the current policy.

Page 8: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Algorithm Idea (2)

The updated model perfectly predicts the state sequence obtained under the current policy.

We can use the updated model to find an improved policy.

Page 9: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Algorithm

1. Find the (locally) optimal policy for the model.2. Execute the current policy and record the state

trajectory.3. Update the model such that the new model is exact

for the current policy .4. Use the new model to compute the policy gradient

and update the policy: := + 5. Go back to Step 2.

Notes: The step-size parameter is determined by a line search. Instead of the policy gradient, any algorithm that provides

a local policy improvement direction can be used. In our experiments we used differential dynamic programming.

Page 10: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Performance Guarantees: Intuition

Exact policy gradient:

Model based policy gradient:

Evaluation of derivatives along wrong trajectory

Derivative of approximate transition

function

Our algorithm eliminates one (of two) sources of error.

Page 11: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Performance Guarantees Let the local policy improvement algorithm be policy gradient.

Notes: These assumptions are insufficient to give the same

performance guarantees for model-based RL. The constant K depends only on the dimensionality of the

state, action, and policy (), the horizon H and an upper bound on the 1st and 2nd derivatives of the transition model, the policy and the reward function.

Page 12: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Experiments We use differential dynamic programming (DDP) to

find control policies in the model.

Two Systems: Flight Simulator RC Car

Page 13: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Flight Simulator Setup

Flight simulator model has 43 parameters (mass, inertia, drag coefficients, lift coefficients etc.).

We generated “approximate models” by randomly perturbing the parameters.

All 4 standard fixed-wing control actions: throttle, ailerons, elevators and rudder.

Our reward function quadratically penalizes for deviation from the desired trajectory.

Page 14: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Flight Simulator Movie

Page 15: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Flight Simulator Results

desired trajectorymodel-based controllerour algorithm

76% utility improvement over

model-based approach

Page 16: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

RC Car Setup

Control actions: throttle and steering. Low-speed dynamics model with state variables:

Position, velocity, heading, heading rate. Model estimated from 30 minutes of data.

Page 17: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

RC Car: Open-Loop Turn

Page 18: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

RC Car: Circle

Page 19: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

RC Car: Figure-8 Maneuver

Page 20: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Related Work Iterative Learning Control:

Uchiyama (1978), Longman et al. (1992), Moore (1993), Horowitz (1993), Bien et al. (1991), Owens et al. (1995), Chen et al. (1997), …

Successful robot control with limited number of trials: Atkeson and Schaal (1997), Morimoto and Doya

(2001). Robust control theory:

Zhou et al. (1995), Dullerud and Paganini (2000), … Bagnell et al. (2001), Morimoto and Atkeson (2002),

Page 21: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Conclusion We presented an algorithm that uses a crude

model and a small number of real-life trials to find a policy that works well in real-life.

Our theoretical results show that----assuming a deterministic setting and assuming a reasonable model----our algorithm returns a policy that is (locally) near-optimal.

Our experiments show that our algorithm can significantly improve on purely model-based RL by using only a small number of real-life trials, even when the true system is not deterministic.

Page 22: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Page 23: Using Inaccurate Models  in Reinforcement Learning

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng

Motivating Example

Student-driver learning to make a 90 degree right turn Only a few trials needed. No accurate model.

Key aspects Real-life trial: shows whether turn is wide or

short. Crude model: turning steering wheel more to the

right results in sharper turn, turning steering wheel more to the left results in wider turn.

Result: good policy gradient estimate.