Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC...

45
Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

Transcript of Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC...

Page 1: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019

Distributed Meta Optimization of Reinforcement Learning Agents

Page 2: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

2

AGENDA

Contents

Introduction to Reinforcement Learning

Introduction to Metaoptimization (on distributed systems) / Maglev

Metaoptimization and Reinforcement Learning (on distributed systems)

HyperTrick

Results

Conclusion

Page 3: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

3

GPU-Based A3C forDeep Reinforcement Learning (RL)

keywords: GPU, A3C, RL

M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, J. Kautz, Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU, ICLR 2017(available at https://openreview.net/forum?id=r1VGvBcxl&noteId=r1VGvBcxl). Open source implementation: https://github.com/NVlabs/GA3C.

Page 4: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

4

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Learning to accomplish a task

Image from www.33rdsquare.com

Page 5: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

5

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Definitions

St

at = π(St)

~Rt

StRL agent

✓ Environment

✓ Agent

✓ Observable status St

✓ Reward Rt

✓ Action at

✓ Policy at = π(St)

, Rt

Page 6: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

6

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Definitions

St, Rt

at = π(St)

~Rt

StDeep RL agent

Page 7: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

7

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Definitions

St, Rt

at = π(St)

Δπ(∙)

R0 R1 R2 R3 R4

~Rt

St

Page 8: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

8

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Objective: maximize expected discounted rewards

Value of a state

The role of 𝛾: short or far-sighted agents

0 < 𝛾 < 1, usually 0.99

Page 9: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

9

GPU-Based A3C forDeep Reinforcement Learning (RL)

keywords: GPU, A3C, RL

M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, J. Kautz, Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU, ICLR 2017(available at https://openreview.net/forum?id=r1VGvBcxl&noteId=r1VGvBcxl). Open source implementation: https://github.com/NVlabs/GA3C.

Page 10: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

10

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

Asynchronous Advantage Actor-Critic (Mnih et al., arXiv:1602.01783v2, 2015)

Δπ(∙)

π’(∙)Master

model

St, Rt

R0 R1 R2 R3 R4at = π(St)

St, Rt

R0 R1 R2 R3 R4at = π(St)

St, Rt

R0 R1 R2 R3 R4at = π(St)

Ag

en

t 1

Ag

en

t 2

Ag

en

t 1

6

Page 11: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

11

M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, J. Kautz, Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU, ICLR 2017(available at https://openreview.net/forum?id=r1VGvBcxl&noteId=r1VGvBcxl). Open source implementation: https://github.com/NVlabs/GA3C.

GPU-Based A3C forDeep Reinforcement Learning (RL)

keywords: GPU, A3C, RL

Page 12: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

12

REGRESSION, CLASSIFICATION, …

MAPPING DEEP PROBLEMS TO A GPU

REINFORCEMENT LEARNING

Pear, pear, pear, pear, …

Empty, empty, …

Fig, fig, fig, fig, fig, fig,

Strawberry, Strawberry,

Strawberry, …

data

labels

100% utilization / occupancy

status,

reward

action

?

Page 13: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

13

A3C

Master

model

St, Rt

R0 R1 R2 R3 R4at = π(St)

St, Rt

R0 R1 R2 R3 R4at = π(St)

St, Rt

R0 R1 R2 R3 R4at = π(St)

Ag

en

t 1

Ag

en

t 2

Ag

en

t 1

6

t_max

Page 14: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

14

A3C

Master

model

St, Rt

R0 R1 R2 R3 R4at = π(St)

St, Rt

R0 R1 R2 R3 R4at = π(St)

St, Rt

R0 R1 R2 R3 R4at = π(St)

Ag

en

t 1

Ag

en

t 2

Ag

en

t 1

6

Page 15: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

15

A3C

Master

model

St, Rt

R0 R1 R2 R3 R4at = π(St)

St, Rt

R0 R1 R2 R3 R4at = π(St)

St, Rt

R0 R1 R2 R3 R4at = π(St)

Ag

en

t 1

Ag

en

t 2

Ag

en

t 1

6

Δπ(∙)

t_max

Page 16: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

16

A3C

Δπ(∙)

π’(∙)Master

model

St, Rt

R0 R1 R2 R3 R4at = π(St)

St, Rt

R0 R1 R2 R3 R4at = π(St)

St, Rt

R0 R1 R2 R3 R4at = π(St)

Ag

en

t 1

Ag

en

t 2

Ag

en

t 1

6

t_max

Page 17: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

17

GA3C (INFERENCE)

Master

model

… …

St

predictors

{St}

prediction queue

Ag

en

t 1

Ag

en

t 2

Ag

en

t N

{at}

at

Page 18: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

18

GA3C (TRAINING)

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Δπ(∙)

Ag

en

t 1

Ag

en

t 2

Ag

en

t N

St,Rt

Master

model

Page 19: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

19

GA3C

Master

model

… …

St

predictors

{St}

prediction queue

Ag

en

t 1

Ag

en

t 2

Ag

en

t N

{at}

at

… …

R0 R1 R2

R4

trainers

{St,Rt}

training queue

Δπ(∙)

Page 20: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

20

CPU & GPU UTILIZATION IN GA3CFor larger DNNs - bandwidth limited, do not scale to multiple GPUs!

CPU for

environment

simulation

GPU for

inference /

training

Page 21: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

21

Role of t_max

t_max = 4 [Play to the end - Monte Carlo]

No variance (collected rewards are real)

High bias (we played only once)

One update every t_max frames 1

4

-2

-6

Page 22: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

22

Role of t_max

t_max = 2

Value network

High variance (noisy value network)

Low bias (unbiased net, many agents)

More updates per second1

4

Value network: from here, approximately 2.5

t_max affects bias, variance, computational cost (number of updates per second, batch size)

Page 23: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

23

Other parameters and stability

Hyperparameter search in 2015:

The search for the optimal learning rate:

https://arxiv.org/pdf/1602.01783.pdf

Page 24: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

24

GA3C on distributed systems

● RL is unstable, metaoptimization for optimal hyperparameters search

● E.g. learning rate may affect the stability and speed of convergence

● GA3C does not scale to multiple GPUs (bandwidth limited), but … We can run parallel instances of GA3C on a distributed system

● The discount factor 𝛾 affects the final aim (short or far-sighted agent)

● The t_max factor affects the computational cost and stability* of GA3C

* See G. Heinrich, I. Frosio, Metaoptimization on a Distributed System forDeep Reinforcement Learning, ttps://arxiv.org/abs/1902.02725.

Page 25: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

25

AGENDA

Contents

Introduction to Reinforcement Learning

Introduction to Metaoptimization (on distributed systems) / Maglev

Metaoptimization and Reinforcement Learning (on distributed systems)

HyperTrick

Results

Conclusion

Page 26: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

2626

GA3C Agent Concorde

It is as easy as flying a Concorde.

META OPTIMIZATION

• Topology parameters▪ Number of layers and their width

▪ Choice of activations

• Training parameters▪ Learning rate

▪ Reward decay rate (𝛄)

▪ back-propagation window size (tmax)▪ Choice of optimizer

▪ Number of training episodes

• Data parameters▪ Environment model.

Exhaustive search is intractable

Source: Christian Kath

Page 27: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

2727

Example: Tree of Parzen Estimators

How does a standard optimization algorithm fare?

META OPTIMIZATION

• Two Parameters, one Metric to

minimize.

• Optimization Trade-offs:▪ Exploitation v.s. exploration.

▪ Wall time v.s. resource efficiency.

• Optimization packages start with a

Random Search.

• Tens of experiments are needed

before historical records can be

leveraged.

Warm Starts are needed to cut

down complexity over time.

→→

Page 28: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

2828

Metric Variance

The Need for Diversity

META OPTIMIZATION

• Non-determinism makes individual

experiments inconclusive.

• A change can only be considered an

improvement if it works under a

variety of conditions.

Meta Optimization should be part of

data scientists’ daily routine.

→→

Page 29: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

2929

Complex Pipelines

• Evaluation cannot be

reduced to a single

Python function, or

Docker container.

Meta Optimization must

be independent of task

scheduling.

The Complexity of Evaluating Models

META OPTIMIZATION

Page 30: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

3030

Architecture

• Scalable Platform for

Traceable Machine

Learning Workflows

• Self-Documented

Experiments

• Services can be used in

isolation, or combined

for maximum

traceability.

Project MagLev: Machine Learning Platform

META OPTIMIZATION

Page 31: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

3131

Knowledge Base

• Experiment Data is fully

connected.

• Objects are searched

through their

relationships with

others.

No Information silo.

MagLev Experiment TrackingMETA OPTIMIZATION

Modelid 6a-e7-59

job_id ae-45-4a

param_set_id 45-4a-26

creator bob

created 2018-09-28

dataset_id ac-73-fc

Workflowid fb-d5-7a

Desc my_v0.1.1

spec ...

project DormRoom

creator bob

created 2018-09-28

Jobid ae-45-4a

scm SHA:3487c

creator bob

workflow_id fb-d5-7a

created 2018-09-28

ExperimentSetid 03-ad-24

config yml

creator bob

project DormRoom

created 2018-09-28

ParamSetid 45-4a-26

exp_id 78-3f-58

Metricid 69-bd-4c

param_set_id 45-4a-26

job_id ae-45-4a

model_id 6a-e7-59

name xentropy

value 0.01

creator bob

created 2018-09-28

dataset_id ac-73-fc

Datasetid ac-73-fc

vdisk zzz://...

project DormRoom

creator bob

created 2018-09-28

Page 32: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

3232

Main SDK Features

• All common parameter

types.

• Early-termination

methods.

• Standard + custom

parameter picking

methods.

Typical SetupMETA OPTIMIZATION

Page 33: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

33

AGENDA

Contents

Introduction to Reinforcement Learning

Introduction to Metaoptimization (on distributed systems) / Maglev

Metaoptimization and Reinforcement Learning (on distributed systems)

HyperTrick

Results

Conclusion

Page 34: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

3434

Hyper Parameters and Preview of results.

META OPTIMIZATION + GA3C

• Learning Rate: log uniform distribution over [1e-5, 1e-2] interval.

• tmax: quantized (q=1) log uniform distribution over [2, 100]

interval.

• 𝜸: one of {0.9, 0.95, 0.99, 0.995, 0.999, 0.9995, 0.9999}

Page 35: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

35

AGENDA

Contents

Introduction to Reinforcement Learning

Introduction to Metaoptimization (on distributed systems) / Maglev

Metaoptimization and Reinforcement Learning (on distributed systems)

HyperTrick

Results

Conclusion

Page 36: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

3636

Successive Halving (SH) [1]

Early Termination Without Compromise

HYPERTRICK

Terminate ½ of workers every N*2P

units of work (P is a phase index).

Requires synchronization between

workers.

Assumes relative perf over time is

constant.

[1] https://arxiv.org/abs/1502.07943

Page 37: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

3737

Successive Halving (SH) [1]

Early Termination Without Compromise

HYPERTRICK

Terminate ½ of workers every N*2P

units of work (P is a phase index).

Requires synchronization between

workers.

Assumes relative perf over time is

constant.

Run several instances of SH in parallel

with different values of N.

Does not assume constant relative

perf but still requires

synchronization.

HyperBand [2]

[1] https://arxiv.org/abs/1502.07943 [2] http://arxiv.org/abs/1603.06560

Page 38: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

3838

Successive Halving (SH) [1] HyperTrick [3]

Early Termination Without Compromise

HYPERTRICK

Terminate ½ of workers every N*2P

units of work (P is a phase index).

Requires synchronization between

workers.

Assumes relative perf over time is

constant.

Run several instances of SH in parallel

with different values of N.

Does not assume constant relative

perf but still requires

synchronization.

Parameters:

- N: total number of workers.

- r: eviction rate per phase P.

Worker:while maglev.should_continue():

work()

maglev.report_metrics()

MagLev:

- Let earliest N(1-√r)(1-rP) run.

- Others are terminated, if in

bottom √r quantile.

Expected number of workers at

end of phase P is N(1-r)P

HyperBand [2]

[1] https://arxiv.org/abs/1502.07943 [2] http://arxiv.org/abs/1603.06560 [3] https://arxiv.org/abs/1902.02725

Page 39: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

3939

Successive Halving HyperTrick

16 workers on 6 nodes running up to 4 iterationsHYPERTRICK v.s. SUCCESSIVE HALVING

No context switches, shorter wall timeMust support preemption

Page 40: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

40

AGENDA

Contents

Introduction to Reinforcement Learning

Introduction to Metaoptimization (on distributed systems) / Maglev

Metaoptimization and Reinforcement Learning (on distributed systems)

HyperTrick

Results

Conclusion

Page 41: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

4141

Videos of Trained AgentsPONG

𝛄=0.9 (short-sighted) 𝛄=0.995 (far-sighted)

Page 42: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

4242

Terminate UnderperformersHYPERTRICK

Boxing Centipede Pacman Pong

unpainted area = saved resources

Page 43: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

4343

Comparison Against HyperBand

HYPERTRICK

→→

Cluster occupancy(Pong)

Timelines(Pong)

Best Score v.s. Wall Time (Pong)

Page 44: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

4444

Experimental Comparison of HyperTrick v.s. HyperBand

META OPTIMIZATION

Page 45: Distributed Meta Optimization of Reinforcement Learning Agents · Greg Heinrich, Iuri Frosio - GTC San Jose, March 2019 Distributed Meta Optimization of Reinforcement Learning Agents

4545

Take Away Related Talks

Conclusion

META OPTIMIZATION + GA3C

RL is unstable => meta optimization is useful.

Hypertrick, a new algorithm for metaoptimization.

GA3C + HyperTrick + Maglev is effective.

Our paper:

https://arxiv.org/abs/1902.02725

GA3C:

https://github.com/NVlabs/GA3C

MagLev info: Yehia Khoja

([email protected])

S9649Wed

9:00am

NVIDIA's AI

Infrastructure for

Self-Driving Cars

Clement

Farabet

S9613Wed

10:00am

Deep Active

Learning

Adam

Lesnikowski

S9911Wed

2:00pm

Determinism In

Deep Learning

Duncan

Riach

S9630Thu

2:00pm

Scaling Up DL

for Autonomous

DrivingJose Alvarez

S9987Thu

9:00am

MagLev:

production-grade

AI platform...Divya Vavili