Chapter 9: Planning and Learning CSE 190: Reinforcement...

8
CSE 190: Reinforcement Learning: An Introduction Chapter 9: Planning and Learning Acknowledgment: Acknowledgment: A good number of these slides A good number of these slides are cribbed from Rich Sutton are cribbed from Rich Sutton 2 CSE 190: Reinforcement Learning, Lecture CSE 190: Reinforcement Learning, Lecture on Chapter on Chapter 9 Chapter 9: Planning and Learning Use of environment models to speed learning Integration of planning and learning methods Objectives of this chapter: 3 CSE 190: Reinforcement Learning, Lecture CSE 190: Reinforcement Learning, Lecture on Chapter on Chapter 9 The Original Idea Sutton, 1990 4 CSE 190: Reinforcement Learning, Lecture CSE 190: Reinforcement Learning, Lecture on Chapter on Chapter 9 The Original Idea Sutton, 1990

Transcript of Chapter 9: Planning and Learning CSE 190: Reinforcement...

Page 1: Chapter 9: Planning and Learning CSE 190: Reinforcement ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_9.pdf · CSE 190: Reinforcement Learning, Lectureon Chapter95 Models •Model: anything

CSE 190: Reinforcement Learning:An Introduction

Chapter 9: Planning and Learning

Acknowledgment:Acknowledgment:A good number of these slidesA good number of these slides are cribbed from Rich Suttonare cribbed from Rich Sutton

22CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Chapter 9: Planning and Learning

• Use of environment models to speed learning

• Integration of planning and learning methods

Objectives of this chapter:

33CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

The Original Idea

Sutton, 199044CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

The Original Idea

Sutton, 1990

Page 2: Chapter 9: Planning and Learning CSE 190: Reinforcement ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_9.pdf · CSE 190: Reinforcement Learning, Lectureon Chapter95 Models •Model: anything

55CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Models

• Model: anything the agent can use to predict how theenvironment will respond to its actions

• Distribution model: description of all possibilities and theirprobabilities• e.g.,

• Sample model: produces sample experiences• e.g., a simulation model

• Both types of models can be used to produce simulatedexperience

• Often sample models are much easier to come by

Ps ! s a and Rs ! s

a for all s, ! s , and a "A(s)

66CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Planning

• Planning: any computational process that uses amodel to create or improve a policy

• Planning in AI:• state-space planning

• plan-space planning (e.g., partial-order planner)

• We take the following (unusual) view:• all state-space planning methods involve computing value

functions, either explicitly or implicitly

• they all apply backups to simulated experience

77CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Planning Cont.

Random-Sample One-Step Tabular Q-Planning

• Classical DP methods are state-space planning methods

• Heuristic search methods are state-space planningmethods

• A planning method based on Q-learning:

88CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Learning, Planning, and Acting

• Two uses of real experience:• model learning: to improve the

model

• direct RL: to directly improvethe value function and policy

• Improving value functionand/or policy via a model issometimes called indirect RL ormodel-based R !L. Here, we call itplanning.

• In most of reinforcementlearning community these aresimply called

Model-based vs. model-free RL

Page 3: Chapter 9: Planning and Learning CSE 190: Reinforcement ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_9.pdf · CSE 190: Reinforcement Learning, Lectureon Chapter95 Models •Model: anything

99CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Direct vs. Indirect RL

• Indirect methods

• (Model based)• make fuller use of

experience: get betterpolicy with fewerenvironmentinteractions

• Direct methods

• (Model-free)• simpler

• not affected by badmodels

But they are very closely related and can be usefully combined:planning, acting, model learning, and direct RL can occursimultaneously and in parallel

1010CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

The Dyna Architecture (Sutton 1990)

1111CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

The Dyna-Q Algorithm

model learning

planning

directRL

1212CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Dyna-Q on a Simple Maze

rewards = 0 untilgoal, when =1

Page 4: Chapter 9: Planning and Learning CSE 190: Reinforcement ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_9.pdf · CSE 190: Reinforcement Learning, Lectureon Chapter95 Models •Model: anything

1313CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Dyna-Q Snapshots:Midway in 2nd Episode

1414CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

When the Model is Wrong: Blocking Maze

The changed envirnoment is harderDyna-AC Dyna-AC ====

Dyna Dyna Actor-CriticActor-Critic

Dyna-QDyna-Q+ will be+ will beExplainedExplained

momentarilymomentarily

1515CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Shortcut MazeThe changed environment is easier

1616CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

What is Dyna-Q+?

• Uses an “exploration bonus”:

• Keeps track of time since each state-action pairwas tried for real

• An extra reward is added for transitions caused bystate-action pairs related to how long ago theywere tried: the longer unvisited, the more rewardfor visiting

• The agent actually “plans” how to visit longunvisited states

Page 5: Chapter 9: Planning and Learning CSE 190: Reinforcement ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_9.pdf · CSE 190: Reinforcement Learning, Lectureon Chapter95 Models •Model: anything

1717CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Prioritized Sweeping• Which states or state-action pairs should be

generated during planning?• Work backwards from states whose values have

just changed:• Maintain a queue of state-action pairs whose values

would change a lot if backed up, prioritized by the size ofthe change

• When a new backup occurs, insert predecessorsaccording to their priorities

• Always perform backups from first in queue• Moore and Atkeson 1993; Peng and Williams, 1993

1818CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Prioritized Sweeping

1919CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Prioritized Sweeping

This means: For I=1 to N doThis means: For I=1 to N doif (not(empty(if (not(empty(PqeuePqeue))) then))) then

2020CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Prioritized Sweeping vs. Dyna-Q

Both use N=5backups perenvironmentalinteraction

Page 6: Chapter 9: Planning and Learning CSE 190: Reinforcement ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_9.pdf · CSE 190: Reinforcement Learning, Lectureon Chapter95 Models •Model: anything

2121CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Rod Maneuvering (Moore andAtkeson 1993)

2222CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Some dimensions of combining P&L

• These have been some ideas of how to combine planning(using models) and learning

• Next we consider some components of this process:• Full backups (considering everything that might happen) vs.

sample back-ups (considering a single trajectory)

• On-policy (computing the value of this policy) versus

off-policy (computing the value of the optimal policy)

• State-action (Q) values vs state (V) values

• Trajectory versus Heuristic search

2323CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Ways to combine learning and planning:Full and Sample (One-Step) Backups

• Dimensions of backups: Full (dynamic programming) vs. sample (temporal difference) On-policy vs. off-policy State values vs state-action values

(2x2x2 = 8, but one doesn’tseem to be reasonable…)

2424CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Are full backups always preferable?• They are better estimators, but they require lots ofcomputation. Let’s do a comparison in the following context: Learning Q*, table look-up, and a completedistribution model (P and R). The full back-up is:

the sample back-up is:

Page 7: Chapter 9: Planning and Learning CSE 190: Reinforcement ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_9.pdf · CSE 190: Reinforcement Learning, Lectureon Chapter95 Models •Model: anything

2525CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Are full backups always preferable?•The full back-up is:

the sample back-up is:

This will make the most difference depending on thebranching factor, b to next states.

2626CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Full vs. Sample Backups

b successor states, equally likely; initial error = 1;assume all next states’ values are correct

Bottom line: sample backups can be a big win in computation!

2727CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Trajectory Sampling

• Trajectory sampling: perform backups along simulatedtrajectories (consider other possibilities: why are they bad?)

• This samples from the on-policy distributionon-policy distribution

• Advantages when function approximation is used (Chapter8: on-policy is the only one provably convergent)

• Focusing of computation: can cause vast uninteresting partsof the state space to be (usefully) ignored:

Initialstates

Reachable under optimal control

Irrelevant states

2828CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Trajectory Sampling Experiment:Is on policy always better?

• one-step full tabular backups• uniform: cycled through all state-

action pairs• on-policy: backed up along

simulated trajectories• 200 randomly generated

undiscounted episodic tasks• 2 actions for each state, each with

b equally likely next states• .1 prob of transition to terminal

state• expected reward on each

transition selected from mean 0variance 1 Gaussian

Page 8: Chapter 9: Planning and Learning CSE 190: Reinforcement ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_9.pdf · CSE 190: Reinforcement Learning, Lectureon Chapter95 Models •Model: anything

2929CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Trajectory Sampling Experiment

• Not conclusive:• Applied to a particular,

randomly generatedproblem

• Short term: big win foron-policy

• Long term: uniformsampling is better

• But - for large statespaces - on policy isprobably best - becauseit focuses the search.

3030CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Heuristic Search

• Used for action selection, not for changing a value function(=heuristic evaluation function)

• Backed-up values are computed, but typically discarded

• Extension of the idea of a greedy policy — only deeper

• Also suggests ways to select states to backup: smart focusing:

3131CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Summary

• Emphasized close relationship between planning andlearning

• Important distinction between distribution models andsample models

• Looked at some ways to integrate planning and learning• synergy among planning, acting, model learning

• Distribution of backups: focus of the computation• trajectory sampling: backup along trajectories

• prioritized sweeping

• heuristic search

• Size of backups: full vs. sample; deep vs. shallowEND