Chapter 9: Planning and Learning CSE 190: Reinforcement...

CSE 190: Reinforcement Learning:An Introduction

Chapter 9: Planning and Learning

Acknowledgment:Acknowledgment:A good number of these slidesA good number of these slides are cribbed from Rich Suttonare cribbed from Rich Sutton

22CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

Chapter 9: Planning and Learning

• Use of environment models to speed learning

• Integration of planning and learning methods

Objectives of this chapter:


The Original Idea

Sutton, 199044CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99

The Original Idea

Sutton, 1990


Models

• Model: anything the agent can use to predict how theenvironment will respond to its actions

• Distribution model: description of all possibilities and theirprobabilities• e.g.,

• Sample model: produces sample experiences• e.g., a simulation model

• Both types of models can be used to produce simulatedexperience

• Often sample models are much easier to come by

Ps ! s a and Rs ! s

a for all s, ! s , and a "A(s)


Planning

• Planning: any computational process that uses amodel to create or improve a policy

• Planning in AI:• state-space planning

• plan-space planning (e.g., partial-order planner)

• We take the following (unusual) view:• all state-space planning methods involve computing value

functions, either explicitly or implicitly

• they all apply backups to simulated experience


Planning Cont.

Random-Sample One-Step Tabular Q-Planning

• Classical DP methods are state-space planning methods

• Heuristic search methods are state-space planningmethods

• A planning method based on Q-learning:


Learning, Planning, and Acting

• Two uses of real experience:• model learning: to improve the

model

• direct RL: to directly improvethe value function and policy

• Improving value functionand/or policy via a model issometimes called indirect RL ormodel-based R !L. Here, we call itplanning.

• In most of reinforcementlearning community these aresimply called

Model-based vs. model-free RL


Direct vs. Indirect RL

• Indirect methods

• (Model based)• make fuller use of

experience: get betterpolicy with fewerenvironmentinteractions

• Direct methods

• (Model-free)• simpler

• not affected by badmodels

But they are very closely related and can be usefully combined:planning, acting, model learning, and direct RL can occursimultaneously and in parallel


The Dyna Architecture (Sutton 1990)


The Dyna-Q Algorithm

model learning

planning

directRL


Dyna-Q on a Simple Maze

rewards = 0 untilgoal, when =1


Dyna-Q Snapshots:Midway in 2nd Episode


When the Model is Wrong: Blocking Maze

The changed envirnoment is harderDyna-AC Dyna-AC ====

Dyna Dyna Actor-CriticActor-Critic

Dyna-QDyna-Q+ will be+ will beExplainedExplained

momentarilymomentarily


Shortcut MazeThe changed environment is easier


What is Dyna-Q+?

• Uses an “exploration bonus”:

• Keeps track of time since each state-action pairwas tried for real

• An extra reward is added for transitions caused bystate-action pairs related to how long ago theywere tried: the longer unvisited, the more rewardfor visiting

• The agent actually “plans” how to visit longunvisited states


Prioritized Sweeping• Which states or state-action pairs should be

generated during planning?• Work backwards from states whose values have

just changed:• Maintain a queue of state-action pairs whose values

would change a lot if backed up, prioritized by the size ofthe change

• When a new backup occurs, insert predecessorsaccording to their priorities

• Always perform backups from first in queue• Moore and Atkeson 1993; Peng and Williams, 1993


Prioritized Sweeping


Prioritized Sweeping

This means: For I=1 to N doThis means: For I=1 to N doif (not(empty(if (not(empty(PqeuePqeue))) then))) then


Prioritized Sweeping vs. Dyna-Q

Both use N=5backups perenvironmentalinteraction


Rod Maneuvering (Moore andAtkeson 1993)


Some dimensions of combining P&L

• These have been some ideas of how to combine planning(using models) and learning

• Next we consider some components of this process:• Full backups (considering everything that might happen) vs.

sample back-ups (considering a single trajectory)

• On-policy (computing the value of this policy) versus

off-policy (computing the value of the optimal policy)

• State-action (Q) values vs state (V) values

• Trajectory versus Heuristic search


Ways to combine learning and planning:Full and Sample (One-Step) Backups

• Dimensions of backups: Full (dynamic programming) vs. sample (temporal difference) On-policy vs. off-policy State values vs state-action values

(2x2x2 = 8, but one doesn’tseem to be reasonable…)


Are full backups always preferable?• They are better estimators, but they require lots ofcomputation. Let’s do a comparison in the following context: Learning Q*, table look-up, and a completedistribution model (P and R). The full back-up is:

the sample back-up is:


Are full backups always preferable?•The full back-up is:

the sample back-up is:

This will make the most difference depending on thebranching factor, b to next states.


Full vs. Sample Backups

b successor states, equally likely; initial error = 1;assume all next states’ values are correct

Bottom line: sample backups can be a big win in computation!


Trajectory Sampling

• Trajectory sampling: perform backups along simulatedtrajectories (consider other possibilities: why are they bad?)

• This samples from the on-policy distributionon-policy distribution

• Advantages when function approximation is used (Chapter8: on-policy is the only one provably convergent)

• Focusing of computation: can cause vast uninteresting partsof the state space to be (usefully) ignored:

Initialstates

Reachable under optimal control

Irrelevant states


Trajectory Sampling Experiment:Is on policy always better?

• one-step full tabular backups• uniform: cycled through all state-

action pairs• on-policy: backed up along

simulated trajectories• 200 randomly generated

undiscounted episodic tasks• 2 actions for each state, each with

b equally likely next states• .1 prob of transition to terminal

state• expected reward on each

transition selected from mean 0variance 1 Gaussian


Trajectory Sampling Experiment

• Not conclusive:• Applied to a particular,

randomly generatedproblem

• Short term: big win foron-policy

• Long term: uniformsampling is better

• But - for large statespaces - on policy isprobably best - becauseit focuses the search.


Heuristic Search

• Used for action selection, not for changing a value function(=heuristic evaluation function)

• Backed-up values are computed, but typically discarded

• Extension of the idea of a greedy policy — only deeper

• Also suggests ways to select states to backup: smart focusing:


Summary

• Emphasized close relationship between planning andlearning

• Important distinction between distribution models andsample models

• Looked at some ways to integrate planning and learning• synergy among planning, acting, model learning

• Distribution of backups: focus of the computation• trajectory sampling: backup along trajectories

• prioritized sweeping

• heuristic search

• Size of backups: full vs. sample; deep vs. shallowEND

Chapter 9: Planning and Learning CSE 190: Reinforcement...

Documents

Transcript of Chapter 9: Planning and Learning CSE 190: Reinforcement...