¥BOOKER,LASHON LANGFORD,JOHN ¥PROPER,SCOTT Idea for ...mlittman/rl3/rl2/fs04-intro.pdf · Taxi...

8
Introduction to Real-Life Reinforcement Learning Michael L. Littman Rutgers University Department of Computer Science Brief History Idea for symposium came out of a discussion I had with Satinder Singh @ ICML 2003 (DC). Both were starting new labs. Wanted to highlight an important challenge in RL. Felt we could help create some momentum by bringing together like minded researchers. Attendees (Part I) • ABRAMSON,MYRIAM GREENWALD,LLOYD • BAGNELL,JAMES GROUNDS,MATTHEW • BENTIVEGNA,DARRIN JONG,NICHOLAS • BLANK,DOUGLAS LANE,TERRAN • BOOKER,LASHON LANGFORD,JOHN • DIUK,CARLOS LEROUX,DAVE • FAGG,ANDREW LITTMAN,MICHAEL • FIDELMAN,PEGGY MCGLOHON,MARY • FOX,DIETER MCGOVERN,AMY • GORDON,GEOFFREY MEEDEN,LISA Attendees (More) • MIKKULAINEN,RISTO • MUSLINER,DAVID • PETERS,JAN • PINEAU,JOELLE PROPER,SCOTT

Transcript of ¥BOOKER,LASHON LANGFORD,JOHN ¥PROPER,SCOTT Idea for ...mlittman/rl3/rl2/fs04-intro.pdf · Taxi...

Page 1: ¥BOOKER,LASHON LANGFORD,JOHN ¥PROPER,SCOTT Idea for ...mlittman/rl3/rl2/fs04-intro.pdf · Taxi World The Plan Talks, Panels Talk slot: 30 minutes, shoot for 25 minutes to leave

Introduction to Real-LifeReinforcement Learning

Michael L. LittmanRutgers University

Department of Computer Science

Brief History

Idea for symposium came out of a discussion Ihad with Satinder Singh @ ICML 2003 (DC).

Both were starting new labs. Wanted tohighlight an important challenge in RL.

Felt we could help create some momentum bybringing together like minded researchers.

Attendees (Part I)

• ABRAMSON,MYRIAM GREENWALD,LLOYD

• BAGNELL,JAMES GROUNDS,MATTHEW

• BENTIVEGNA,DARRIN JONG,NICHOLAS

• BLANK,DOUGLAS LANE,TERRAN

• BOOKER,LASHON LANGFORD,JOHN

• DIUK,CARLOS LEROUX,DAVE

• FAGG,ANDREW LITTMAN,MICHAEL

• FIDELMAN,PEGGY MCGLOHON,MARY

• FOX,DIETER MCGOVERN,AMY

• GORDON,GEOFFREY MEEDEN,LISA

Attendees (More)

• MIKKULAINEN,RISTO

• MUSLINER,DAVID

• PETERS,JAN

• PINEAU,JOELLE

• PROPER,SCOTT

Page 2: ¥BOOKER,LASHON LANGFORD,JOHN ¥PROPER,SCOTT Idea for ...mlittman/rl3/rl2/fs04-intro.pdf · Taxi World The Plan Talks, Panels Talk slot: 30 minutes, shoot for 25 minutes to leave

Definitions

What is “reinforcement learning”?

• Decision making driven to maximize ameasurable performance objective.

What is “real life”?

• “Measured” experience. Data doesn’tcome from a model with known or pre-defined properties/assumptions.

Multiple Lives

• Real-life learning (us): use real data,possibly small (even toy) problems

• Life-sized learning (Kaelbling): large statespaces, possibly artificial problems

• Life-long learning (Thrun): Same learningsystem, different problems (somewhatorthogonal)

Find The Ball

Learn:

• which way to turn

• to minimize steps

• to see goal (ball)

• from camera input

• given experience.

The RL Problem

Input: <s1, a1, s2, r1>, <s2, a2, s3, r2>, …, st

Output: ats to maximize discounted sum of ris.

, right, , +1

Page 3: ¥BOOKER,LASHON LANGFORD,JOHN ¥PROPER,SCOTT Idea for ...mlittman/rl3/rl2/fs04-intro.pdf · Taxi World The Plan Talks, Panels Talk slot: 30 minutes, shoot for 25 minutes to leave

Problem Formalization: MDP

Most popular formalization: Markov decision process

Assume:

• States/sensations, actions discrete.

• Transitions, rewards stationary and Markov.

• Transition function: Pr(s’|s,a) = T(s,a,s’).

• Reward function: E[r|s,a] = R(s,a).

Then:

• Optimal policy !*(s) = argmaxa Q*(s,a)

• where Q*(s,a) = R(s,a) + " #s’ T(s,a,s’) maxa’ Q*(s’,a’)

Find the Ball: MDP Version

• Actions: rotate left/right

• States: orientation

• Reward: +1 for facing ball, 0 otherwise

It Can Be Done: Q-learning

Since optimal Q function is sufficient, useexperience to estimate it (Watkins & Dayan 92)

Given <s, a, s’, r>:Q(s,a) $ Q(s,a) + %t(r + " maxa’ Q(s’,a’) – Q(s,a) )

If:

• all (s,a) pairs updated infinitely often

• Pr(s’|s,a) = T(s,a,s’), E[r|s,a] = R(s,a)

• #%t = !, #%t 2 < !

Then: Q(s,a) & Q*(s,a)

Real-Life Reinforcement Learning

Emphasize learning with real* data.

Q-learning good, but might not be right here…

Mismatches to “Find the Ball” MDP:

• Efficient exploration: data is expensive

• Rich sensors: never see the same thing twice

• Aliasing: different states can look similar

• Non-stationarity: details change over time

* Or, if simulated, from simulators developed outsidethe AI community

Page 4: ¥BOOKER,LASHON LANGFORD,JOHN ¥PROPER,SCOTT Idea for ...mlittman/rl3/rl2/fs04-intro.pdf · Taxi World The Plan Talks, Panels Talk slot: 30 minutes, shoot for 25 minutes to leave

RL2: A Spectrum

Unmodified physical world

Controlled physical world

Electronic-only world

Pure math world

Detailed simulation

Lab-created simulation

RLRL

RL

RLRL gray zone

Unmodified Physical World

weight loss (BodyMedia)

helicopter (Bagnell)

Controlled Physical World

Mahadevan and Connell, 1990

Electronic-only World

Recovery from corruptednetwork interfaceconfiguration.

Java/Windows XP:Minimize time to repair.

Littman, Ravi, Fenson,Howard, 2004

After 95 failure episodes

Learning to sort fastLittman & Lagoudakis

Page 5: ¥BOOKER,LASHON LANGFORD,JOHN ¥PROPER,SCOTT Idea for ...mlittman/rl3/rl2/fs04-intro.pdf · Taxi World The Plan Talks, Panels Talk slot: 30 minutes, shoot for 25 minutes to leave

Pure Math World

backgammon (Tesauro)

Detailed Simulation

• Independently developed

elevator control (Crites, Barto)

RARS video game

Robocup Simulator

Lab-created Simulation

Car on the Hill

Taxi World

The Plan

Talks, Panels

Talk slot: 30 minutes, shoot for 25 minutes toleave time for switchover, questions, etc.

Try plugging in during a break.

Panel slot: 5 minutes per panelist (slidesoptional), will use the discussion time

Page 6: ¥BOOKER,LASHON LANGFORD,JOHN ¥PROPER,SCOTT Idea for ...mlittman/rl3/rl2/fs04-intro.pdf · Taxi World The Plan Talks, Panels Talk slot: 30 minutes, shoot for 25 minutes to leave

Friday, October 22nd, AM

9:00 Michael Littman, Introduction to Real-lifeReinforcement-learning

9:30 Darrin Bentivegna, Learning From Observationand Practice Using Primitives

10:00 Jan Peters, Learning Motor Primitives with Reinforcement Learning

10:30 break

11:00 Dave LeRoux, Instance-Based Reinforcement Learning on the Sony Aibo Robot

11:30 Bill Smart, Applying Reinforcement Learning toReal Robots: Problems and Possible Solutions

12:00 HUMAN-LEVEL AI PANEL, Roy

12:30 lunch break

Friday, October 22nd, PM

2:00 Andy Fagg, Learning Dexterous ManipulationSkills Using the Control Basis

2:30 Dan Stronger, Simultaneous Calibration of Action and Sensor Models on a Mobile Robot

3:00 Dieter Fox, Reinforcement Learning for SensingStrategies

3:30 break

4:00 Roberto Santiago, What is Real Life? Using Simulation to Mature Reinforcement Learning

4:30 OTHER MODELS PANEL, Diuk, Greenwald, Lane

5:00 Gerry Tesauro, RL-Based Online Resource Allocation in Multi-Workload Computing Systems

5:30 session ends

Saturday, October 23rd, AM

9:00 Drew Bagnell, Practical Policy Search

9:30 John Moody, Learning to Trade via Direct Reinforcement

10:00 Risto Miikkulainen, Learning Robust Control andComplex Behavior Through Neuroevolution

10:30 break

11:00 Michael Littman, Real Life MultiagentReinforcement Learning

11:30 MULTIAGENT PANEL, Stone, Reidmiller, Moody, Bowling

12:00 HIERARCHY/STRUCTURED REPRESENTATIONSPANEL, Tadepalli, McGovern, Jong, Grounds

12:30 lunch break

Join

t w

ith A

rtific

ial

Mul

ti-A

gent

Learn

ing

Saturday, October 23rd, PM

2:00 Lisa Meeden, Self-Motivated, Task-IndependentReinforcement Learning for Robots

2:30 Marge Skubic and David Noelle, A BiologicalInspired Adaptive Working Memory for Robots

3:00 COGNITIVE ROBOTICS PANEL, Blank, Noelle,Booksbaum

3:30 break

4:00 Peggy Fidelman, Learning Ball Acquisition andFast Quadrupedal Locomotion on a Physical Robot

4:30 John Langford, Real World Reinforcement Learning Theory

5:00 OTHER TOPICS PANEL, Abramson, Proper, Pineau

5:30 session ends

Join

t w

ith

Cognitiv

e R

obotics

Page 7: ¥BOOKER,LASHON LANGFORD,JOHN ¥PROPER,SCOTT Idea for ...mlittman/rl3/rl2/fs04-intro.pdf · Taxi World The Plan Talks, Panels Talk slot: 30 minutes, shoot for 25 minutes to leave

Sunday, October 24th, AM

9:00 Satinder Singh, RL for Human Level AI

9:30 Geoff Gordon, Learning Valid Predictive Representations

10:00 Yasutake Takahashi, Abstraction of State/Actionbased on State Value Function

10:30 break

11:00 Martin Reidmiller/Stephan Timmer, RL for technical process control

11:30 Matthew Taylor, Speeding Up Reinforcement Learning with Behavior Transfer

12:00 Discussion: Wrap Up, Future Plans

12:30 symposium ends

Plenary

Saturday (tomorrow) night

6pm-7:30pm Plenary

Each symposium gets a 10-minute slot

Ours: Video. I need today’s speakers to joinme for lunch and also immediately after thesession today.

Darrin’s Summary

• extract features

• domain knowledge

• function approximators

• bootstrap learning/behavior transfer

• improve current skill

• learn skill initially using other methods

• start with low-level skills

What Next?

• Collect successes to point to– Contribute to newly created page:

http://neuromancer.eecs.umich.edu/cgi-bin/twiki/view/Main/SuccessesOfRL

– We’re already succeeding (ideas are spreading)

– rejoice: control theorists are scared of us

• Sources of information– This workshop web site:

http://www.cs.rutgers.edu/~mlittman/rl3/rl2/ .

– Will include pointers to slides, papers

– Can include twiki links or a pointer from RL repository.

– Michael requesting slides / URLs / videos (up front).

– Newly created Myth Page:http://neuromancer.eecs.umich.edu/cgi-bin/twiki/view/Main/MythsofRL

Page 8: ¥BOOKER,LASHON LANGFORD,JOHN ¥PROPER,SCOTT Idea for ...mlittman/rl3/rl2/fs04-intro.pdf · Taxi World The Plan Talks, Panels Talk slot: 30 minutes, shoot for 25 minutes to leave

Other Activities

• Possible Publication Activities– special issue of a journal (JMLR? JAIR?)

– editted book

– other workshops

– guidebook for newbies

– textbook?

• Benchmarks– Upcoming NIPS workshop on benchmarks

– We need to push for including real-life examples

– greater set of domains, make an effort to widenapplications

Future Challenges

• How can we better talk about the inherent problemdifficulty? Problem classes?

• Can we clarify the distinction between controltheory and AI problems?

• Stress making sequential decisions (outside roboticsas well).

• What about structure? Can we say more?

• Need to encourage a fresh perspective.

• Help convey how to see problems as RL problems.