Guiding Policies with John D (JD) Co-Reyes,...
Transcript of Guiding Policies with John D (JD) Co-Reyes,...
![Page 1: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/1.jpg)
Guiding Policies with Language via Meta-LearningJohn D (JD) Co-Reyes, Abhishek Gupta, Suvansh Sanjeev, Nick Altieri,
John DeNero, Pieter Abbeel, Sergey Levine
![Page 2: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/2.jpg)
Ideal Robot
![Page 3: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/3.jpg)
Learning new tasks quickly
● Want diverse range of skills
● Cost of supervision can be high
● Want to learn new things with as little supervision as possible
![Page 4: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/4.jpg)
Meta-RL
Prior Experience Fast Learning of New Tasks
Leverage prior experience to quickly learn new tasks
Meta-Training Meta-Testing
![Page 5: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/5.jpg)
Problem with reward design
Hard to provideHard to design Hard to learn from
![Page 6: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/6.jpg)
More natural way to provide supervision
Human feedback
![Page 7: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/7.jpg)
Replace reward with human feedback
Human-in-the-loop supervision
RL Algorithm
Deep TAMER Preferences
Warnell et al Christiano et al
![Page 8: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/8.jpg)
Why current methods are insufficient?
Very few bits of information per intervention
RL Algorithm
1 bit
Significanthuman effort
More bits of information per intervention
RL Algorithm
1 bit
Lesshuman effort
Scalar Feedback
Language Feedback
230 bits
![Page 9: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/9.jpg)
Language Corrections
![Page 10: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/10.jpg)
Quickly incorporate language corrections in the loop
Agent provided with ambiguous/incomplete instruction
Problem Setting
![Page 11: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/11.jpg)
Model improves based on previous trajectories and corrections.
3 modules – corrections, policy and instruction modules
Language Guided Policy Model
![Page 12: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/12.jpg)
Algorithm Overview
Meta-Training Meta-Testing
![Page 13: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/13.jpg)
Meta-Training
![Page 14: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/14.jpg)
Meta-Training
![Page 15: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/15.jpg)
Experimental Setup
Multi-room domain
Instruction: Move green triangle to yellow goal. Instruction: Move red square to yellow goal.
![Page 16: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/16.jpg)
Experimental Setup
Block pushing domain
Instruction: Move red block above magenta block. Instruction: Move cyan block left of blue block.
![Page 17: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/17.jpg)
Instruction: Move blue triangle to green goal.
Correction 1: Enter the blue room.
Correction 2: Enter the red room.
Correction 3: Exit the blue room.
Correction 4: Pick up the blue triangle.
Solved
Quick Learning of New Tasks
![Page 18: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/18.jpg)
Quick Learning of New TasksInstruction: Move cyan block below magenta block.
Correction 1: Touch cyan block.
Correction 2: Move closer to magenta block.
Correction 3: Move a lot up.
Correction 4: Move a little up. Solved
![Page 19: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/19.jpg)
Quantitative EvaluationSuccess Rates on New Tasks
Much quicker learning than using rewards
RL – PPO with reward used to train expert GPL – Ours
![Page 20: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/20.jpg)
• Avoid demos/reward functions using human-in-the-loop
• Language provides more information per intervention
• Ground language in multi-task setup; learn new tasks quickly with corrections
Summary
![Page 21: Guiding Policies with John D (JD) Co-Reyes, …metalearning.ml/2018/slides/meta_learning_2018_CoReyes.pdfGuiding Policies with Language via Meta-Learning John D (JD) Co-Reyes, Abhishek](https://reader033.fdocuments.in/reader033/viewer/2022060501/5f1b0e7598dd4832d956682d/html5/thumbnails/21.jpg)
Thank you
Abhishek Gupta Suvansh Sanjeev Nick Altieri
John DeNero Pieter Abbeel Sergey Levine