Accelerating Online Reinforcement Learning with Ofﬂine ...

Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Murtaza Dalal, Abhishek Gupta, Sergey LevineUC Berkeley

How can we learn difficult tasks with little online fine-tuning by using prior datasets?

D = {(s, a, s0, r)j}

⇡✓(a|s)

Update

⇡✓

p(s0|s, a)

(s, a, s0, r)

Q� Update

⇡✓ Q��

- off-policy data - expert demos - prior runs of RL

1. O✏ine Learning 2. Online Fine-tuning

1. On-policy fine-tuning methods exhibit slow online improvement.

Average

Return

Challenges Fine-tuning with Existing Methods

[30] Advantage Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning. Peng et al. 2019. [35] Learning Complex Dextrous Manipulation with Deep Reinforcement Learning and Demonstrations. Rajeswaran et al. 2017.

2. Standard actor-critic methods do not take advantage of offline training, even if the policy is pretrained with behavioral cloning.

Average

Return

[11] Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Haarnoja et al. 2019. [43] Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewarsd. Vecerik et al. 2017.

Actor-critic methods can be stabilized for offline training by incorporating a policy constraint in the policy improvement step.

s.t. D(⇡(·|s)||⇡�(·|s)) ✏

Policy Improvement Step

Trained by supervised

learning

Maximize estimated

returns

while staying close to the data

distribution

⇡k+1 = argmax⇡2⇧

Ea⇠⇡(·|s)[Q⇡k(s,a)]

Average

Return

Offline Training

[21] Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. Kumar et al. 2019. [38] Keep Doing What Worked: Behavior Modelling Priors for Offline Reinforcement Learning. Siegel et al. 2019. [46] Behavior Regularized Offline Reinforcement Learning. Yifan Wu et al. 2019.

3. Existing policy constraint methods (BEAR [21], ABM [38], BRAC [46]) rely on behavior models of prior data, which are difficult to train online.

AWAC trains well offline, fine-tunes quickly online, and does not need to estimate a behavior model.

⇡k+1 = argmax⇡2⇧

Ea⇠⇡(·|s)[A⇡k(s,a)]

s.t. D(⇡(·|s)||⇡�(·|s)) ✏

Policy Improvement Step

✓k+1 = argmax✓

Es,a⇠�

log ⇡✓(a|s)

Z(s)exp

�A⇡k(s,a)

◆�✓k+1 = argmax

✓Es,a⇠�

log ⇡✓(a|s)

Z(s)exp

�A⇡k(s,a)

◆�

Advantage Weighted Actor Critic (AWAC)

AWAC incorporates a KL constraint into the actor-critic framework implicitly

Dextrous Manipulation Tasks

Our algorithm can be used to solve difficult dextrous manipulation tasks - it solves door opening in under 1 hour of online interaction.

Ashvin Nair, Murtaza Dalal, Abhishek Gupta, Sergey LevineUC Berkeley

Accelerating Online Reinforcement Learning with Offline Datasets

Accelerating Online Reinforcement Learning with Ofﬂine ... · 2. Standard actor-critic methods do...

Documents

Transcript of Accelerating Online Reinforcement Learning with Ofﬂine ... · 2. Standard actor-critic methods do...

How to Train Triplet Networks with 100K Identities? · hard triplets. Wang et al. [13] use a pretrained classification model to select possible hard triplets offline, but the offline

Online Actions with Offline Impact: How Online …...Online Actions with Offline Impact: How Online Social Networks Influence Online and Offline User Behavior Tim Althoff Stanford

Online Actions with Offline Impact: How Online …Online Actions with Offline Impact: How Online Social Networks Influence Online and Offline User Behavior Tim Althoff Stanford

OffPAD: Ofﬂine Personal Authenticating Device ...

Finetuning Pretrained Transformers into RNNs

How to Adapt Your Pretrained Multilingual Model to 1600 ...

Pretrained Language Model Embryology: The Birth of ALBERT · 2020. 10. 30. · Pretrained Language Model Embryology: The Birth of ALBERT Cheng-Han Chiang National Taiwan University,

P ENCYCLOPEDIA: WEAKLY SUPERVISED K -PRETRAINED …

LexFit: Lexical Fine-Tuning of Pretrained Language Models

Probing Pretrained Language Models for Lexical SemanticsLanguage models (LMs) based on deep Trans-former networks (Vaswani et al.,2017), pretrained on unprecedentedly large amounts

Critic Etnographer

Evaluating Pretrained Transformer-based Models on the Task ...

physics critic

Recognizing Families through Images with Pretrained Encoder ...Recognizing Families through Images with Pretrained Encoder Tuan-Duy H. Nguyen, Huu-Nghia H. Nguyen, and Hieu Dao Faculty

Semantic Enrichment of Pretrained Embedding Output for ...

Ofﬂine Arabic Handwriting Recognition with ...graves/arabic_ocr_chapter.pdf · Ofﬂine Arabic Handwriting Recognition with Multidimensional Recurrent Neural ... ease to, for example,

Conservative Q-Learning for Ofﬂine Reinforcement Learning

Innovative Pretrained-based Reranking Language Models for ...

Pessimistic Model Selection for Ofﬂine Deep Reinforcement ...