Introduction into Imitation Learning

83
Introduction into Imitation Learning Artem Sokolov Institute for Computational Linguistics, Heidelberg University 8 October 2018 1 Admin 2 Literature 3 What is Imitation Learning? 4 Applications of Imitation Learning 5 Problems with Reinforcement Learning 6 Problems with Supervised Learning 7 What does Imitation Learning add?

Transcript of Introduction into Imitation Learning

Page 1: Introduction into Imitation Learning

Introduction into Imitation Learning

Artem Sokolov

Institute for Computational Linguistics, Heidelberg University

8 October 2018

1 Admin2 Literature3 What is Imitation Learning?4 Applications of Imitation Learning5 Problems with Reinforcement Learning6 Problems with Supervised Learning7 What does Imitation Learning add?

Page 2: Introduction into Imitation Learning

Admin

Artem Sokolov | 8 October 2018 1 / 48

Page 3: Introduction into Imitation Learning

Before we start

n lots of reading

n knowledge of these is helpful:

á foundations of probability theoryá foundations of statistical machine learningá reinforcement learningá structured predictioná neural networks

n assessment: presentations or projects

n projects: programming (if you want to really learn)

á project emphasis open applied research problems

Artem Sokolov | 8 October 2018 2 / 48

Page 4: Introduction into Imitation Learning

Deliverables

This a block module, so you have two options:

1 presentation

á presentation (end of the 1st or in the 2nd week)á write-up detailing the presented article, during the semesterá compare with rival approachesá describe and defend extensions if anyá small implementation is encouraged but not required

2 project

á briefly describe the task, research question and your plan (2nd week)á no slides necessary (unless you prefer)á implementation at your own pace during the semesterá report detailing the approach and experimentsá bi-weekly updates over skype or email

Artem Sokolov | 8 October 2018 3 / 48

Page 5: Introduction into Imitation Learning

Deliverables

This a block module, so you have two options:

1 presentation

á presentation (end of the 1st or in the 2nd week)á write-up detailing the presented article, during the semesterá compare with rival approachesá describe and defend extensions if anyá small implementation is encouraged but not required

2 project

á briefly describe the task, research question and your plan (2nd week)á no slides necessary (unless you prefer)á implementation at your own pace during the semesterá report detailing the approach and experimentsá bi-weekly updates over skype or email

Artem Sokolov | 8 October 2018 3 / 48

Page 6: Introduction into Imitation Learning

Deliverables

n presentation:

á larger up-front effortá mostly free for the rest of semester

n project:

á lesser effort nowá more time to think and more experience gained

n both projects and presentations are open-ended

á no fixed upper bar of what can be doneá possible that some things may not worká you may be asked to add/change things as you advance

n there is a list of tentative projects and presentation topics

n but you are encouraged to come up with your own ideas

Artem Sokolov | 8 October 2018 4 / 48

Page 7: Introduction into Imitation Learning

Deliverables

n presentation:

á larger up-front effortá mostly free for the rest of semester

n project:

á lesser effort nowá more time to think and more experience gained

n both projects and presentations are open-ended

á no fixed upper bar of what can be doneá possible that some things may not worká you may be asked to add/change things as you advance

n there is a list of tentative projects and presentation topics

n but you are encouraged to come up with your own ideas

Artem Sokolov | 8 October 2018 4 / 48

Page 8: Introduction into Imitation Learning

Deliverables

n presentation:

á larger up-front effortá mostly free for the rest of semester

n project:

á lesser effort nowá more time to think and more experience gained

n both projects and presentations are open-ended

á no fixed upper bar of what can be doneá possible that some things may not worká you may be asked to add/change things as you advance

n there is a list of tentative projects and presentation topics

n but you are encouraged to come up with your own ideas

Artem Sokolov | 8 October 2018 4 / 48

Page 9: Introduction into Imitation Learning

Deliverables

n presentation:

á larger up-front effortá mostly free for the rest of semester

n project:

á lesser effort nowá more time to think and more experience gained

n both projects and presentations are open-ended

á no fixed upper bar of what can be doneá possible that some things may not worká you may be asked to add/change things as you advance

n there is a list of tentative projects and presentation topics

n but you are encouraged to come up with your own ideas

Artem Sokolov | 8 October 2018 4 / 48

Page 10: Introduction into Imitation Learning

Schedule

n 1st week, Oct. 8-12

á 10:15-11:45 morning sessioná 13:15-14:45 afternoon sessioná first article presentationsá no morning session on Friday - Erstifruhstuck

n till Sat Oct. 13 23:59

á inform via email whether you do a presentation or a project

Artem Sokolov | 8 October 2018 5 / 48

Page 11: Introduction into Imitation Learning

Schedule

n 2nd week, Oct. 15-19

á article presentations and project Q&Aá consultations on the articles/projects

n rest of the semester

á work on projects and writeupsá updates via email/skype

Artem Sokolov | 8 October 2018 6 / 48

Page 12: Introduction into Imitation Learning

Presentation

n a detailed explanation of a state-of-the-art method

n or it’s algorithmic (or theoretical or heuristic) extension

Setup:

n 1 personn task:

á pick an article from the list or propose your owná present the approach, include the glossed-over partsá try to formulate possible extensionsá answer additional questions

n timelineá before Oct. 13: decide on the articleá Oct. 15-18: time to discuss/ask questionsá Oct. 15-19: in-class presentation

(except 3 easier articles that, if selected, should be presented this week)á end of semester: final write-up due

try to formulate and defend possible extensionsinclude analysis of extensions if anycomparison to other methods and limitations

Artem Sokolov | 8 October 2018 7 / 48

Page 13: Introduction into Imitation Learning

Presentation

n a detailed explanation of a state-of-the-art method

n or it’s algorithmic (or theoretical or heuristic) extension

Setup:

n 1 person

n task:á pick an article from the list or propose your owná present the approach, include the glossed-over partsá try to formulate possible extensionsá answer additional questions

n timelineá before Oct. 13: decide on the articleá Oct. 15-18: time to discuss/ask questionsá Oct. 15-19: in-class presentation

(except 3 easier articles that, if selected, should be presented this week)á end of semester: final write-up due

try to formulate and defend possible extensionsinclude analysis of extensions if anycomparison to other methods and limitations

Artem Sokolov | 8 October 2018 7 / 48

Page 14: Introduction into Imitation Learning

Presentation

n a detailed explanation of a state-of-the-art method

n or it’s algorithmic (or theoretical or heuristic) extension

Setup:

n 1 personn task:

á pick an article from the list or propose your owná present the approach, include the glossed-over partsá try to formulate possible extensionsá answer additional questions

n timelineá before Oct. 13: decide on the articleá Oct. 15-18: time to discuss/ask questionsá Oct. 15-19: in-class presentation

(except 3 easier articles that, if selected, should be presented this week)á end of semester: final write-up due

try to formulate and defend possible extensionsinclude analysis of extensions if anycomparison to other methods and limitations

Artem Sokolov | 8 October 2018 7 / 48

Page 15: Introduction into Imitation Learning

Presentation

n a detailed explanation of a state-of-the-art method

n or it’s algorithmic (or theoretical or heuristic) extension

Setup:

n 1 personn task:

á pick an article from the list or propose your owná present the approach, include the glossed-over partsá try to formulate possible extensionsá answer additional questions

n timelineá before Oct. 13: decide on the articleá Oct. 15-18: time to discuss/ask questionsá Oct. 15-19: in-class presentation

(except 3 easier articles that, if selected, should be presented this week)á end of semester: final write-up due

try to formulate and defend possible extensionsinclude analysis of extensions if anycomparison to other methods and limitations

Artem Sokolov | 8 October 2018 7 / 48

Page 16: Introduction into Imitation Learning

Project

n an implementation of a state-of-the-art algorithm for an NLP task

n preferably MT

Setup:

n 1-3 persons (expectations scale accordingly)n task:

á pick a project from a list or propose your owná implement itá formulate and try out possible extensionsá compare to reasonable baselines

n timelineá before Oct. 13: decide on the projectá Oct. 15-19: time to discuss/ask questionsá Oct. 15-20: in-class project discussion (no slides necessary)á semester: bi-weekly updates if neededá end of semester: final report due

include all the experimental results and link to the codestructure like a conference paper (setting, prior work, why it matters,your approach, results, their analysis, limitations, future directions).

Artem Sokolov | 8 October 2018 8 / 48

Page 17: Introduction into Imitation Learning

Project

n an implementation of a state-of-the-art algorithm for an NLP task

n preferably MT

Setup:

n 1-3 persons (expectations scale accordingly)

n task:á pick a project from a list or propose your owná implement itá formulate and try out possible extensionsá compare to reasonable baselines

n timelineá before Oct. 13: decide on the projectá Oct. 15-19: time to discuss/ask questionsá Oct. 15-20: in-class project discussion (no slides necessary)á semester: bi-weekly updates if neededá end of semester: final report due

include all the experimental results and link to the codestructure like a conference paper (setting, prior work, why it matters,your approach, results, their analysis, limitations, future directions).

Artem Sokolov | 8 October 2018 8 / 48

Page 18: Introduction into Imitation Learning

Project

n an implementation of a state-of-the-art algorithm for an NLP task

n preferably MT

Setup:

n 1-3 persons (expectations scale accordingly)n task:

á pick a project from a list or propose your owná implement itá formulate and try out possible extensionsá compare to reasonable baselines

n timelineá before Oct. 13: decide on the projectá Oct. 15-19: time to discuss/ask questionsá Oct. 15-20: in-class project discussion (no slides necessary)á semester: bi-weekly updates if neededá end of semester: final report due

include all the experimental results and link to the codestructure like a conference paper (setting, prior work, why it matters,your approach, results, their analysis, limitations, future directions).

Artem Sokolov | 8 October 2018 8 / 48

Page 19: Introduction into Imitation Learning

Project

n an implementation of a state-of-the-art algorithm for an NLP task

n preferably MT

Setup:

n 1-3 persons (expectations scale accordingly)n task:

á pick a project from a list or propose your owná implement itá formulate and try out possible extensionsá compare to reasonable baselines

n timelineá before Oct. 13: decide on the projectá Oct. 15-19: time to discuss/ask questionsá Oct. 15-20: in-class project discussion (no slides necessary)á semester: bi-weekly updates if neededá end of semester: final report due

include all the experimental results and link to the codestructure like a conference paper (setting, prior work, why it matters,your approach, results, their analysis, limitations, future directions).

Artem Sokolov | 8 October 2018 8 / 48

Page 20: Introduction into Imitation Learning

Evaluation

n presentation

á 70% - quality of the presentation and extensionsá 30% - quality of the final write-up

n project

á 70% - quality of the implementation and resultsá 30% - quality of the final report

n more points for novelty and your ideas

n the effort is what mainly counts

Artem Sokolov | 8 October 2018 9 / 48

Page 21: Introduction into Imitation Learning

Evaluation

n presentation

á 70% - quality of the presentation and extensionsá 30% - quality of the final write-up

n project

á 70% - quality of the implementation and resultsá 30% - quality of the final report

n more points for novelty and your ideas

n the effort is what mainly counts

Artem Sokolov | 8 October 2018 9 / 48

Page 22: Introduction into Imitation Learning

Evaluation

n presentation

á 70% - quality of the presentation and extensionsá 30% - quality of the final write-up

n project

á 70% - quality of the implementation and resultsá 30% - quality of the final report

n more points for novelty and your ideas

n the effort is what mainly counts

Artem Sokolov | 8 October 2018 9 / 48

Page 23: Introduction into Imitation Learning

After the course

After this course you should be able to:

n recognize tasks solvable with imitation learning

n map NLP and structured prediction problems to imitation learning

n understand deficiencies of some straight-forward approaches to IL

n solve problems with IL

Artem Sokolov | 8 October 2018 10 / 48

Page 24: Introduction into Imitation Learning

Literature

Artem Sokolov | 8 October 2018 11 / 48

Page 25: Introduction into Imitation Learning

Literature

n reinforcement learning

á Sutton and Barto, ”Reinforcement Learning”, 2nd edition 2018http://incompleteideas.net/book/the-book-2nd.html

á Szepesvari, ”Algorithms for Reinforcement Learning”, Morgan &Claypool. 2010https://sites.ualberta.ca/~szepesva/RLBook.html

n imitation learning

á Attia and Dayan, ”Global overview of Imitation Learning”, 2018https://arxiv.org/abs/1801.06503

á Daume III, ”A Course in Machine Learning”, Chapter 18http://www.ciml.info/dl/v0_99/ciml-v0_99-ch18.pdf

á Osa et al. ”An Algorithmic Perspective on Imitation Learning”, onlyintro, 201?https://takaosa.github.io/paper/TFRo_summary.pdf

Artem Sokolov | 8 October 2018 12 / 48

Page 26: Introduction into Imitation Learning

Literature

n reinforcement learning

á Sutton and Barto, ”Reinforcement Learning”, 2nd edition 2018http://incompleteideas.net/book/the-book-2nd.html

á Szepesvari, ”Algorithms for Reinforcement Learning”, Morgan &Claypool. 2010https://sites.ualberta.ca/~szepesva/RLBook.html

n imitation learning

á Attia and Dayan, ”Global overview of Imitation Learning”, 2018https://arxiv.org/abs/1801.06503

á Daume III, ”A Course in Machine Learning”, Chapter 18http://www.ciml.info/dl/v0_99/ciml-v0_99-ch18.pdf

á Osa et al. ”An Algorithmic Perspective on Imitation Learning”, onlyintro, 201?https://takaosa.github.io/paper/TFRo_summary.pdf

Artem Sokolov | 8 October 2018 12 / 48

Page 27: Introduction into Imitation Learning

Literature

n some good tutorials:

á Hal Daume III, ”From Structured Prediction to Inverse ReinforcementLearning”, ACL’05users.umiacs.umd.edu/~hal/SPIRL/10-07-acl-spirl.pdf

á Hal Daume III and John Langford, ”Learning to Search for JointPrediction”, ICML/NAACL’16http://hunch.net/~l2s

á Yisong Yue and Hoang M. Le ”Imitation Learning”, ICML’18sites.google.com/view/icml2018-imitation-learning

á Johannes Heidecke, ”Inverse Reinforcement Learning”, 2018thinkingwires.com/posts/2018-02-13-irl-tutorial-1.html

n neural networks

á MT: Graham Neubig, ”Neural Machine Translation andSequence-to-Sequence Models: A Tutorial”, 2017https://arxiv.org/abs/1703.01619

á NLP: Yoav Goldberg, ”A Primer on Neural Network Models for NaturalLanguage Processing”, 2015https://arxiv.org/abs/1510.00726

Artem Sokolov | 8 October 2018 13 / 48

Page 28: Introduction into Imitation Learning

Literature

n some good tutorials:

á Hal Daume III, ”From Structured Prediction to Inverse ReinforcementLearning”, ACL’05users.umiacs.umd.edu/~hal/SPIRL/10-07-acl-spirl.pdf

á Hal Daume III and John Langford, ”Learning to Search for JointPrediction”, ICML/NAACL’16http://hunch.net/~l2s

á Yisong Yue and Hoang M. Le ”Imitation Learning”, ICML’18sites.google.com/view/icml2018-imitation-learning

á Johannes Heidecke, ”Inverse Reinforcement Learning”, 2018thinkingwires.com/posts/2018-02-13-irl-tutorial-1.html

n neural networks

á MT: Graham Neubig, ”Neural Machine Translation andSequence-to-Sequence Models: A Tutorial”, 2017https://arxiv.org/abs/1703.01619

á NLP: Yoav Goldberg, ”A Primer on Neural Network Models for NaturalLanguage Processing”, 2015https://arxiv.org/abs/1510.00726

Artem Sokolov | 8 October 2018 13 / 48

Page 29: Introduction into Imitation Learning

What is Imitation Learning?

Artem Sokolov | 8 October 2018 14 / 48

Page 30: Introduction into Imitation Learning

Definition

The purpose of imitation learning is to efficiently learn a desired behaviorby imitating an expert’s behavior.

Sounds familiar..

Artem Sokolov | 8 October 2018 15 / 48

Page 31: Introduction into Imitation Learning

Definition

The purpose of imitation learning is to efficiently learn a desired behaviorby imitating an expert’s behavior.

Sounds familiar..

Artem Sokolov | 8 October 2018 15 / 48

Page 32: Introduction into Imitation Learning

Definition

The purpose of imitation learning is to efficiently learn a desired behaviorby imitating an expert’s behavior.

Sounds familiar..

n is it supervised learning?

Artem Sokolov | 8 October 2018 15 / 48

Page 33: Introduction into Imitation Learning

Definition

The purpose of imitation learning is to efficiently learn a desired behaviorby imitating an expert’s behavior.

Sounds familiar..

n is it reinforcement learning?

Artem Sokolov | 8 October 2018 15 / 48

Page 34: Introduction into Imitation Learning

Definition

The purpose of imitation learning is to efficiently learn a desired behaviorby imitating an expert’s behavior.

Imitation learning is a fusion between supervised learning andreinforcement learning:

Artem Sokolov | 8 October 2018 16 / 48

Page 35: Introduction into Imitation Learning

Supervision Strength

In the order of goal specification:

n reinforcement learning(weak: no explicit goals, butintermediate rewards)

n imitation learning(stronger: no explicit goals, butsome examples how to reachthem)

n supervised learning(full: explicit goals even forintermediate steps)

n unsupervised learning(none: no goals, no rewards)

In the view of different supervision, it makes sense to develop dedicatedmethods

Artem Sokolov | 8 October 2018 17 / 48

Page 36: Introduction into Imitation Learning

Supervision Strength

In the order of goal specification:

n reinforcement learning(weak: no explicit goals, butintermediate rewards)

n imitation learning(stronger: no explicit goals, butsome examples how to reachthem)

n supervised learning(full: explicit goals even forintermediate steps)

n unsupervised learning(none: no goals, no rewards)

In the view of different supervision, it makes sense to develop dedicatedmethods

Artem Sokolov | 8 October 2018 17 / 48

Page 37: Introduction into Imitation Learning

Supervision Strength

In the order of goal specification:

n reinforcement learning(weak: no explicit goals, butintermediate rewards)

n imitation learning(stronger: no explicit goals, butsome examples how to reachthem)

n supervised learning(full: explicit goals even forintermediate steps)

n unsupervised learning(none: no goals, no rewards)

In the view of different supervision, it makes sense to develop dedicatedmethods

Artem Sokolov | 8 October 2018 17 / 48

Page 38: Introduction into Imitation Learning

Supervision Strength

In the order of goal specification:

n reinforcement learning(weak: no explicit goals, butintermediate rewards)

n imitation learning(stronger: no explicit goals, butsome examples how to reachthem)

n supervised learning(full: explicit goals even forintermediate steps)

n unsupervised learning(none: no goals, no rewards)

In the view of different supervision, it makes sense to develop dedicatedmethods

Artem Sokolov | 8 October 2018 17 / 48

Page 39: Introduction into Imitation Learning

Supervision Strength

In the order of goal specification:

n reinforcement learning(weak: no explicit goals, butintermediate rewards)

n imitation learning(stronger: no explicit goals, butsome examples how to reachthem)

n supervised learning(full: explicit goals even forintermediate steps)

n unsupervised learning(none: no goals, no rewards)

In the view of different supervision, it makes sense to develop dedicatedmethods

Artem Sokolov | 8 October 2018 17 / 48

Page 40: Introduction into Imitation Learning

Applications of Imitation Learning

Artem Sokolov | 8 October 2018 18 / 48

Page 41: Introduction into Imitation Learning

Applications

n reinforcement learning – impressive but relatively few successes inunrestricted environments

á board/computer games, roboticsá except: bandit learning really shines e.g. in ad placement,

recommendation

n imitation learning – borrow concepts from reinforcement learningwhile not throwing away supervised learning

á also games (many RL successes are due to imitation learningcomponents)

á navigationá self-driving cars

n supervised learning – overwhelming majority of all successes ofmachine learning

á vision, music, NLP, ...á all kinds of predictive data analysis

n unsupervised learning – there is progress, but it has yet to catch up

Artem Sokolov | 8 October 2018 19 / 48

Page 42: Introduction into Imitation Learning

Applications

n reinforcement learning – impressive but relatively few successes inunrestricted environments

á board/computer games, roboticsá except: bandit learning really shines e.g. in ad placement,

recommendation

n imitation learning – borrow concepts from reinforcement learningwhile not throwing away supervised learning

á also games (many RL successes are due to imitation learningcomponents)

á navigationá self-driving cars

n supervised learning – overwhelming majority of all successes ofmachine learning

á vision, music, NLP, ...á all kinds of predictive data analysis

n unsupervised learning – there is progress, but it has yet to catch up

Artem Sokolov | 8 October 2018 19 / 48

Page 43: Introduction into Imitation Learning

Successful applications

n self-driving vehicles

n games and bots

n website optimization

n structured prediction and NLP

Artem Sokolov | 8 October 2018 20 / 48

Page 44: Introduction into Imitation Learning

ALVINN 1988

[Pomerlau’88]Artem Sokolov | 8 October 2018 21 / 48

Page 45: Introduction into Imitation Learning

Super Tux 2009

[Ross & Russell’10]Artem Sokolov | 8 October 2018 22 / 48

Page 46: Introduction into Imitation Learning

Mario Bros 2009

[Ross & Russell’11]Artem Sokolov | 8 October 2018 23 / 48

Page 47: Introduction into Imitation Learning

Lip movement transfer 2018

[Yisong et al.’18]Artem Sokolov | 8 October 2018 24 / 48

Page 48: Introduction into Imitation Learning

Drone obstacle avoiding

Artem Sokolov | 8 October 2018 25 / 48

Page 49: Introduction into Imitation Learning

Why Imitation Learning?

Several ways to argue why we need IL:

n supervision strength

n avoiding some of hurdles in reinforcement learning

n relaxing assumptions of supervised learning

n sample and time efficiency

n deployment safety

Artem Sokolov | 8 October 2018 26 / 48

Page 50: Introduction into Imitation Learning

Problems with Reinforcement Learning

Artem Sokolov | 8 October 2018 27 / 48

Page 51: Introduction into Imitation Learning

Why not just use RL?

Artem Sokolov | 8 October 2018 28 / 48

Page 52: Introduction into Imitation Learning

Why not just use RL?

[Irpan’18]

Artem Sokolov | 8 October 2018 28 / 48

Page 53: Introduction into Imitation Learning

Why not just use RL?

[Ng&Russel’00]

[T]he entire field of reinforcement learning is founded on thepresupposition that the reward function,. . . is the most succinct, robust,and transferable definition of the task.

n reward specification, which is often hard to specify exactly

n even good reward functions can be gamed

n RL is often sample inefficient

n reproducability

n often cannot allow unrestricted exploration

Artem Sokolov | 8 October 2018 29 / 48

Page 54: Introduction into Imitation Learning

Why not just use RL?

[Ng&Russel’00]

[T]he entire field of reinforcement learning is founded on thepresupposition that the reward function,. . . is the most succinct, robust,and transferable definition of the task.

n reward specification, which is often hard to specify exactly

á unless we set the rules of the environment (games!)

n even good reward functions can be gamed

n RL is often sample inefficient

n reproducability

n often cannot allow unrestricted exploration

Artem Sokolov | 8 October 2018 29 / 48

Page 55: Introduction into Imitation Learning

Example: Misspecification of RL rewards

n goal: finish the race

n rewards given also for collecting powerups to the race faster

n farming the powerups gives more points than finishing the race!

[Irpan’18]

Artem Sokolov | 8 October 2018 30 / 48

Page 56: Introduction into Imitation Learning

Example: Gaming good RL rewards

Amazon/HDU bandit MT learning competition:

n task: adapt to a new domain only using BLEU scores on owntranslations

n reward: BLEU (sentence level)n evaluation: BLEU (corpus level)

n different winners depending on what you look at

Artem Sokolov | 8 October 2018 31 / 48

Page 57: Introduction into Imitation Learning

Example: Gaming good RL rewards

Amazon/HDU bandit MT learning competition:

n task: adapt to a new domain only using BLEU scores on owntranslations

n reward: BLEU (sentence level)n evaluation: BLEU (corpus level)

0.07

0.08

0.09

0.10

0.11

0.12

0.13

0.14

0.15

0.16

0.17

0.18

0.19

0.20

0.21

0.22

0 1 2 3 4 5 6 7 8 9 10 11

corp

us-B

LEU

check points

SMT-staticSMT-EL-CV-ADAM

SMT-SZO-CV-ADAMBNMT-static

WNMT-ELBNMT-EL

UMD-dom-adapt

n different winners depending on what you look at

Artem Sokolov | 8 October 2018 31 / 48

Page 58: Introduction into Imitation Learning

Example: Gaming good RL rewards

Amazon/HDU bandit MT learning competition:

n task: adapt to a new domain only using BLEU scores on owntranslations

n reward: BLEU (sentence level)n evaluation: BLEU (corpus level)

0.08

0.09

0.10

0.11

0.12

0.13

0.14

0.15

0.16

0.17

0.18

0.19

0.20

0 1 2 3 4 5 6 7 8 9 10 11

aver

age

sent

ence

-BLE

U

check points

SMT-staticSMT-EL-CV-ADAM

SMT-SZO-CV-ADAMBNMT-static

WNMT-ELBNMT-EL

UMD-dom-adapt

n different winners depending on what you look atArtem Sokolov | 8 October 2018 31 / 48

Page 59: Introduction into Imitation Learning

Example: Good rewards can be gamed

Abstractive summarization with the ROUGE score as reward

[Paulus, 2017]

Button was denied his 100th race for McLaren after an ERS preventedhim from making it to the start-line. It capped a miserable weekend forthe Briton. Button has out-qualified. Finished ahead of Nico Rosberg atBahrain. Lewis Hamilton has. In 11 races. . The race. To lead 2,000laps. . In. . . And.

Artem Sokolov | 8 October 2018 32 / 48

Page 60: Introduction into Imitation Learning

Example: RL is sample-inefficient

n Amazon/HDU bandit MT learning competition:

n 1M sentences to learn how to translate in a new domain

n humans would learn it from few dozens of examples

Artem Sokolov | 8 October 2018 33 / 48

Page 61: Introduction into Imitation Learning

Example: RL is sample-inefficient

n Atari games

n 83 hours of play for RL

n humans learn it in minutes

Artem Sokolov | 8 October 2018 34 / 48

Page 62: Introduction into Imitation Learning

Value Alignment Problem

[Weiner’60]

If we use, to achieve our purposes, a mechanical agency with whoseoperation we cannot efficiently interfere once we have started it, becausethe action is so fast and irrevocable that we have not the data tointervene before the action is complete, then we had better be quite surethat the purpose put into the machine is the purpose which we reallydesire...

Artem Sokolov | 8 October 2018 35 / 48

Page 63: Introduction into Imitation Learning

Value Alignment Problem

[Samuel’60]

A machine is not a genie, it does not work by magic, [...]The“intentions” which the machine seems to manifest are the intentions ofthe human programmer, as specified in advance, ...To believe otherwiseis either to believe in magic [...]

An apparent exception to these conclusions might be claimed forprojected machines of the so-called “neural net” type [...] Since theinternal connections would be unknown, the precise behavior of the netswould be unpredictable and, therefore, potentially dangerous.

Artem Sokolov | 8 October 2018 35 / 48

Page 64: Introduction into Imitation Learning

Value Alignment Problem

[Samuel’60]

A machine is not a genie, it does not work by magic, [...]The“intentions” which the machine seems to manifest are the intentions ofthe human programmer, as specified in advance, ...To believe otherwiseis either to believe in magic [...]An apparent exception to these conclusions might be claimed forprojected machines of the so-called “neural net” type [...] Since theinternal connections would be unknown, the precise behavior of the netswould be unpredictable and, therefore, potentially dangerous.

Artem Sokolov | 8 October 2018 35 / 48

Page 65: Introduction into Imitation Learning

Reproducability of RL

[Andreas’17]

Deep RL is popular because it’s the only area in ML where it’s sociallyacceptable to train on the test set.

n exaggeration, but partially true

n raises related concerns for conventional supervised learning

Artem Sokolov | 8 October 2018 36 / 48

Page 66: Introduction into Imitation Learning

Problems with Supervised Learning

Artem Sokolov | 8 October 2018 37 / 48

Page 67: Introduction into Imitation Learning

Why not just use SL?

By conventional ML we mean supervised batch-learning

n relies on the hard-to-guarantee i.i.d. assumptions

n when used blindly on non-i.i.d data can go wrong

n slow inference for structured SL

n uncertainty guarantees

Artem Sokolov | 8 October 2018 38 / 48

Page 68: Introduction into Imitation Learning

i.i.d assumption

Convenient paradigm, but

n training data often will not contain important cases

n big data: no-one checks the validity of i.i.d

n in multi-step decision processes the data distribution is dependent onthe agent

Artem Sokolov | 8 October 2018 39 / 48

Page 69: Introduction into Imitation Learning

Example: non-i.i.d and missing data

Artem Sokolov | 8 October 2018 40 / 48

Page 70: Introduction into Imitation Learning

Example: non-i.i.d and missing data

Artem Sokolov | 8 October 2018 41 / 48

Page 71: Introduction into Imitation Learning

i.i.d assumption

Convenient paradigm, but despite it’s invalidness:

n supervised learning is often works for IL!

n is very simple

n always should be tried first

Artem Sokolov | 8 October 2018 42 / 48

Page 72: Introduction into Imitation Learning

What does Imitation Learning add?

Artem Sokolov | 8 October 2018 43 / 48

Page 73: Introduction into Imitation Learning

Recall

Imitation learning is a fusion between supervised learning andreinforcement learning:

Question: Given this relation what could be said about ways to solve IL?

Artem Sokolov | 8 October 2018 44 / 48

Page 74: Introduction into Imitation Learning

Recall

Imitation learning is a fusion between supervised learning andreinforcement learning:

Question: Given this relation what could be said about ways to solve IL?

Artem Sokolov | 8 October 2018 44 / 48

Page 75: Introduction into Imitation Learning

Imitation Learning

The purpose of imitation learning is to efficiently learn a desired behaviorby imitating an expert’s behavior.

Can be achieved via:

n directly replicating the expert’s behavior

– ‘behavioral cloning’(reduction to SP)

n learning hidden objectives of the expert’s behavior

– ‘inverse RL’(reduction to RL)

n letting teacher interfere to correct bad behaviour –‘data aggregation’ (better reduction to SP+online learning)

n observing teacher solving an unfinished task –‘policy aggregation’ (better reduction to SP+online learning)

Artem Sokolov | 8 October 2018 45 / 48

Page 76: Introduction into Imitation Learning

Imitation Learning

The purpose of imitation learning is to efficiently learn a desired behaviorby imitating an expert’s behavior.

Can be achieved via:

n directly replicating the expert’s behavior

– ‘behavioral cloning’(reduction to SP)

n learning hidden objectives of the expert’s behavior

– ‘inverse RL’(reduction to RL)

n letting teacher interfere to correct bad behaviour –‘data aggregation’ (better reduction to SP+online learning)

n observing teacher solving an unfinished task –‘policy aggregation’ (better reduction to SP+online learning)

Artem Sokolov | 8 October 2018 45 / 48

Page 77: Introduction into Imitation Learning

Imitation Learning

The purpose of imitation learning is to efficiently learn a desired behaviorby imitating an expert’s behavior.

Can be achieved via:

n directly replicating the expert’s behavior – ‘behavioral cloning’(reduction to SP)

n learning hidden objectives of the expert’s behavior

– ‘inverse RL’(reduction to RL)

n letting teacher interfere to correct bad behaviour –‘data aggregation’ (better reduction to SP+online learning)

n observing teacher solving an unfinished task –‘policy aggregation’ (better reduction to SP+online learning)

Artem Sokolov | 8 October 2018 45 / 48

Page 78: Introduction into Imitation Learning

Imitation Learning

The purpose of imitation learning is to efficiently learn a desired behaviorby imitating an expert’s behavior.

Can be achieved via:

n directly replicating the expert’s behavior – ‘behavioral cloning’(reduction to SP)

n learning hidden objectives of the expert’s behavior

– ‘inverse RL’(reduction to RL)

n letting teacher interfere to correct bad behaviour –‘data aggregation’ (better reduction to SP+online learning)

n observing teacher solving an unfinished task –‘policy aggregation’ (better reduction to SP+online learning)

Artem Sokolov | 8 October 2018 45 / 48

Page 79: Introduction into Imitation Learning

Imitation Learning

The purpose of imitation learning is to efficiently learn a desired behaviorby imitating an expert’s behavior.

Can be achieved via:

n directly replicating the expert’s behavior – ‘behavioral cloning’(reduction to SP)

n learning hidden objectives of the expert’s behavior – ‘inverse RL’(reduction to RL)

n letting teacher interfere to correct bad behaviour –‘data aggregation’ (better reduction to SP+online learning)

n observing teacher solving an unfinished task –‘policy aggregation’ (better reduction to SP+online learning)

Artem Sokolov | 8 October 2018 45 / 48

Page 80: Introduction into Imitation Learning

Imitation Learning

The purpose of imitation learning is to efficiently learn a desired behaviorby imitating an expert’s behavior.

Can be achieved via:

n directly replicating the expert’s behavior – ‘behavioral cloning’(reduction to SP)

n learning hidden objectives of the expert’s behavior – ‘inverse RL’(reduction to RL)

n letting teacher interfere to correct bad behaviour –‘data aggregation’ (better reduction to SP+online learning)

n observing teacher solving an unfinished task –‘policy aggregation’ (better reduction to SP+online learning)

Artem Sokolov | 8 October 2018 45 / 48

Page 81: Introduction into Imitation Learning

Key differences to RL

n not using rewards (at least not relying on them directly)

n use of examples or querying an interactive teacher

Artem Sokolov | 8 October 2018 46 / 48

Page 82: Introduction into Imitation Learning

Key differences to conventional ML

n structural constraints

n distribution shift

n cost of gathering examples

Artem Sokolov | 8 October 2018 47 / 48

Page 83: Introduction into Imitation Learning

Afternoon lecture

n presentation

n projects

Artem Sokolov | 8 October 2018 48 / 48