Exploration (Part 1) -...
Transcript of Exploration (Part 1) -...
![Page 1: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/1.jpg)
Distributed RLRichard Liaw
![Page 2: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/2.jpg)
Common Computational Patterns for RL
Batch Optimization
Simulation
Simulation
Simulation
OptimizationOptimization
How can we better utilize our computational resourcesto accelerate RL progress?
Original
![Page 3: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/3.jpg)
History of large scale distributed RL
2013
DQN
Playing Atari with Deep Reinforcement Learning
(Mnih 2013)
GORILA
Massively Parallel Methods for Deep
Reinforcement Learning(Nair 2015)
2015
A3C
Asynchronous Methods for Deep Reinforcement
Learning(Mnih 2016)
2016
Ape-X
Distributed Prioritized Experience Replay
(Horgan 2018)
2018
IMPALA
IMPALA: Scalable Distributed Deep-RL with
Importance Weighted Actor-Learner Architectures
(Espeholt 2018)
2018
R2D3
Making Efficient Use of Demonstrations to Solve
Hard Exploration Problems
(Le Paine 2019)
2019
![Page 4: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/4.jpg)
2013/2015: DQN
for i in range(T): s, a, s_1, r = evaluate() replay.store((s, a, s_1, r))
minibatch = replay.sample() q_network.update(mini_batch)
if should_update_target(): q_network.sync_with(target_net)
![Page 5: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/5.jpg)
2015: General Reinforcement Learning Architecture (GORILA)
![Page 6: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/6.jpg)
GORILA Performance
![Page 7: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/7.jpg)
History of large scale distributed RL
2013
DQN
Playing Atari with Deep Reinforcement Learning
(Mnih 2013)
GORILA
Massively Parallel Methods for Deep
Reinforcement Learning(Nair 2015)
2015
A3C
Asynchronous Methods for Deep Reinforcement
Learning(Mnih 2016)
2016
Ape-X
Distributed Prioritized Experience Replay
(Horgan 2018)
2018
IMPALA
IMPALA: Scalable Distributed Deep-RL with
Importance Weighted Actor-Learner Architectures
(Espeholt 2018)
2018
R2D3
Making Efficient Use of Demonstrations to Solve
Hard Exploration Problems
(Le Paine 2019)
2019
![Page 8: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/8.jpg)
2016: Asynchronous Advantage Actor Critic (A3C)
Sends gradients back
# Each worker:
while True: sync_weights_from_master()
for i in range(5): collect sample from env
grad = compute_grad(samples) async_send_grad_to_master()
Each has different exploration -> more diverse samples!
![Page 9: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/9.jpg)
A3C Performance
Changes to GORILA:
1. Faster updates2. Removes the
replay buffer3. Moves to
Actor-Critic (from Q learning)
![Page 10: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/10.jpg)
Importance Weighted Actor-Learner Architectures (IMPALA)
Motivated by progress in distributed deep learning!
![Page 11: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/11.jpg)
How to correct for Policy Lag? Importance Sampling!
Given an actor-critic model:
1. Apply importance-sampling to policy gradient
2. Apply importance sampling to critic update
![Page 12: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/12.jpg)
Ape-X/R2D2 (2018)
Scaling Off-Policy learning...
Ape-X:1. Distributed
DQN/DDPG/R2D2
2. Reintroduces replay
3. Distributed Prioritization: Unlike Prioritized DQN, initial priorities are not set to “max TD”
![Page 13: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/13.jpg)
Ape-X Performance
![Page 14: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/14.jpg)
With Demonstrations: R2D3 (2019)
![Page 15: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/15.jpg)
Other interesting distributed architectures
![Page 17: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/17.jpg)
AlphaZero
Each model trained on 64 GPUs and 19 parameter servers!
![Page 18: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/18.jpg)
Evolution Strategies
![Page 19: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/19.jpg)
Beyond RL: Population-based Training
![Page 20: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/20.jpg)
Benefits of PBT
https://deepmind.com/blog/article/population-based-training-neural-networks
![Page 21: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/21.jpg)
http://rllib.io
RLlib: Abstractions for Distributed Reinforcement Learning (ICML'18)
21
Eric Liang*, Richard Liaw*, Philipp Moritz, Robert Nishihara, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, Ion Stoica
![Page 22: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/22.jpg)
http://rllib.io 22
Fig. courtesy OpenAI
Fig. courtesy NVidia Inc.
RL research scales with compute
![Page 23: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/23.jpg)
http://rllib.io
How do we leverage this hardware?
23
scalable abstractions for RL?
(a) Supervised Learning (b) Reinforcement Learning
![Page 24: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/24.jpg)
http://rllib.io
Systems for RL today• Many implementations (16000+ repos on GitHub!)
– how general are they (and do they scale)?PPO: multiprocessing, MPI AlphaZero: custom systems
Evolution Strategies: Redis IMPALA: Distributed TensorFlow
A3C: shared memory, multiprocessing, TF
• Huge variety of algorithms and distributed systems used to implement, but little reuse of components
24
![Page 25: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/25.jpg)
http://rllib.io
Challenges to reuse
1. Wide range of physical execution strategies for one "algorithm"
25
single-node cluster
GPU
CPUsynchronous
asynchronous
send gradients
send experiences
MPImultiprocessing
param-server
![Page 26: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/26.jpg)
http://rllib.io
Challenges to reuse
2. Tight coupling with deep learning frameworks
26
Different parallelism paradigms:– Distributed TensorFlow vs TensorFlow + MPI?
![Page 27: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/27.jpg)
http://rllib.io
Challenges to reuse
3. Large variety of algorithms with different structures
27
![Page 28: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/28.jpg)
http://rllib.io
We need abstractions for RL
Good abstractions decompose RL algorithms into reusable components.
Goals:– Code reuse across deep learning frameworks– Scalable execution of algorithms– Easily compare and reproduce algorithms
28
![Page 29: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/29.jpg)
http://rllib.io
Structure of RL computations
29
Agent Environment
action (ai+1)
Policy:state → action state (si)(observation)
reward (ri)
![Page 30: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/30.jpg)
http://rllib.io
Structure of RL computations
30
Agent Environment
action (ai+1)
state (si)(observation)
reward (ri)
Policy evaluation
(state → action)
Policy improvement
(e.g., SGD)
trajectory X: s0, (s1, r1), …, (sn, rn)
policy
![Page 31: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/31.jpg)
http://rllib.io
Many RL loop decompositions
31
Actor-Learner
Actor-Learner
Actor-Learner
ParamServer Learner
Replay
Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018)
Actor
Actor
Actor
X <- rollout()dθ <- grad(L, X)sync(dθ)
θ <- sync()rollout()
X <- replay()apply(grad(L, X))
![Page 32: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/32.jpg)
http://rllib.io
Replay
Common components
32
Actor-Learner
Actor-Learner
Actor-Learner
ParamServer
Actor
Actor
Actor
Learner
Replay
Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018)
ActorActor
ActorActor
ActorActor
Policy πθ(ot)
Trajectory postprocessor ρθ(X)
Loss L(θ,X)
![Page 33: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/33.jpg)
http://rllib.io
Replay
Common components
33
Actor-Learner
Actor-Learner
Actor-Learner
ParamServer
Actor
Actor
Actor
Learner
Replay
Async DQN (Mnih et al; 2016) Ape-X DQN (Horgan et al; 2018)
ActorActor
ActorActor
ActorActor
Policy πθ(ot)
Trajectory postprocessor ρθ(X)
Loss L(θ,X)
![Page 34: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/34.jpg)
http://rllib.io
Structural differences
34
Async DQN (Mnih et al; 2016)● Asynchronous optimization● Replicated workers● Single machine
Ape-X DQN (Horgan et al; 2018)● Central learner● Data queues between components● Large replay buffers● Scales to clusters
...and this is just one family!
➝ No existing system can effectively meet all the varied demands of RL workloads.
+ Population-Based Training (Jaderberg et al; 2017)
● Nested parallel computations● Control decisions based on
intermediate results
![Page 35: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/35.jpg)
http://rllib.io
Requirements for a new systemGoal: Capture a broad range of RL workloads with high performance and substantial code reuse1. Support stateful computations
- e.g., simulators, neural nets, replay buffers- big data frameworks, e.g., Spark, are typically stateless
2. Support asynchrony- difficult to express in MPI, esp. nested parallelism
3. Allow easy composition of (distributed) components
35
![Page 36: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/36.jpg)
http://rllib.io
Ray System Substrate
36
Hierarchical Task Model
• RLlib builds on Ray to provide higher-level RL abstractions• Hierarchical parallel task model with stateful workers
– flexible enough to capture a broad range of RL workloads (vs specialized sys.)
single-node cluster
GPU
CPUsynchronous
asynchronous
send gradients
send experiences
MPImultiprocessing
param-server
![Page 37: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/37.jpg)
http://rllib.io
Ray Cluster
Hierarchical Parallel Task Model1. Create Python class instances in the cluster (stateful workers)2. Schedule short-running tasks onto workers
– Challenge: High performance: 1e6+ tasks/s, ~200us task overhead
37
Top-level worker(Python process)
Sub-worker (process)
Sub-worker
Sub-worker
"collect experiences"
Sub-sub workerprocesses
"do model-based rollouts"
"allreduce your
gradients"
exchange weight shards through Ray object store
"run K steps of training"
![Page 38: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/38.jpg)
http://rllib.io
Unifying system enables RL Abstractions
38
Policy Optimizer Abstraction
SyncSamples AsyncSamplesAsyncGradientsSyncReplay MultiGPU ...
Policy Graph Abstraction{πθ, ρθ, L(θ,X)}
{Q-func, n-step, Q-loss}
{LSTM, adv. calc, PG loss}
Examples:
Hierarchical Task Model
single-node cluster
GPU
CPUsynchronous
asynchronous
send gradients
send experiences
![Page 39: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/39.jpg)
http://rllib.io
RLlib Abstractions in Action
39
Policy Optimizers
SyncSamples AsyncSamplesAsyncGradientsSyncReplay MultiGPU ...
{Q-func, n-step, Q-loss}
{LSTM, adv. calc, PG loss}
DQN (2015)
Async DQN (2016)
Ape-X (2018)
Policy Gradient (2000)
+actor-critic loss, GAE
A2C (2016)
PPO (GPU-optimized)PPO (2017)+clipped obj.
IMPALA (2018)+V-trace
A3C (2016)
PolicyGraphs
![Page 40: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/40.jpg)
http://rllib.io
RLlib Reference Algorithms• High-throughput architectures
– Distributed Prioritized Experience Replay (Ape-X)– Importance Weighted Actor-Learner Architecture (IMPALA)
• Gradient-based– Advantage Actor-Critic (A2C, A3C)– Deep Deterministic Policy Gradients (DDPG)– Deep Q Networks (DQN, Rainbow)– Policy Gradients– Proximal Policy Optimization (PPO)
• Derivative-free– Augmented Random Search (ARS)– Evolution Strategies
Community Contributions
![Page 41: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/41.jpg)
http://rllib.io
Scale your algorithms with RLlib
41
• Beyond a "collection of algorithms",• RLlib's abstractions let you easily implement and scale new
algorithms (multi-agent, novel losses, architectures, etc)
![Page 42: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/42.jpg)
http://rllib.io
Code example: training PPO
![Page 43: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/43.jpg)
http://rllib.io
Code example: hyperparam tuning
![Page 44: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/44.jpg)
http://rllib.io
Code example: hyperparam tuning
![Page 45: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/45.jpg)
http://rllib.io
RLlib is open source and available at http://rllib.ioThanks!
45
Summary: Ray and RLlib addresses challenges in providing scalable abstractions for reinforcement learning.
![Page 46: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/46.jpg)
http://rllib.io
Ray distributed execution engine
46
• Ray provides Task parallel and Actor APIs built on dynamic task graphs
• These APIs are used to build distributed applications, libraries and systems
Ray execution modelDynamic Task Graphs
Applications...Numerical computation
Third-party simulators
Ray programming modelTask Parallelism Actors
![Page 47: Exploration (Part 1) - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-17.pdf · Today’s Lecture 1. What is exploration? Why is it a problem? 2. Multi-armed](https://reader030.fdocuments.in/reader030/viewer/2022040707/5e088bd5d7b98c607d465887/html5/thumbnails/47.jpg)
http://rllib.io
Ray distributed scheduler
47
• Faster than Python multi-processing on a single node
• Competitive with MPI in many workloads
WorkerDriver WorkerWorker WorkerWorker
Object Store Object Store Object Store
Local Scheduler Local Scheduler Local Scheduler
Global Scheduler
Global Scheduler
Global SchedulerGlobal Scheduler