Advanced Computer Graphics CSE 190 [Spring 2015], Lecture 5 Ravi Ramamoorthi ravir.
Chapter 9: Planning and Learning CSE 190: Reinforcement...
Transcript of Chapter 9: Planning and Learning CSE 190: Reinforcement...
CSE 190: Reinforcement Learning:An Introduction
Chapter 9: Planning and Learning
Acknowledgment:Acknowledgment:A good number of these slidesA good number of these slides are cribbed from Rich Suttonare cribbed from Rich Sutton
22CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Chapter 9: Planning and Learning
• Use of environment models to speed learning
• Integration of planning and learning methods
Objectives of this chapter:
33CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
The Original Idea
Sutton, 199044CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
The Original Idea
Sutton, 1990
55CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Models
• Model: anything the agent can use to predict how theenvironment will respond to its actions
• Distribution model: description of all possibilities and theirprobabilities• e.g.,
• Sample model: produces sample experiences• e.g., a simulation model
• Both types of models can be used to produce simulatedexperience
• Often sample models are much easier to come by
Ps ! s a and Rs ! s
a for all s, ! s , and a "A(s)
66CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Planning
• Planning: any computational process that uses amodel to create or improve a policy
• Planning in AI:• state-space planning
• plan-space planning (e.g., partial-order planner)
• We take the following (unusual) view:• all state-space planning methods involve computing value
functions, either explicitly or implicitly
• they all apply backups to simulated experience
77CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Planning Cont.
Random-Sample One-Step Tabular Q-Planning
• Classical DP methods are state-space planning methods
• Heuristic search methods are state-space planningmethods
• A planning method based on Q-learning:
88CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Learning, Planning, and Acting
• Two uses of real experience:• model learning: to improve the
model
• direct RL: to directly improvethe value function and policy
• Improving value functionand/or policy via a model issometimes called indirect RL ormodel-based R !L. Here, we call itplanning.
• In most of reinforcementlearning community these aresimply called
Model-based vs. model-free RL
99CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Direct vs. Indirect RL
• Indirect methods
• (Model based)• make fuller use of
experience: get betterpolicy with fewerenvironmentinteractions
• Direct methods
• (Model-free)• simpler
• not affected by badmodels
But they are very closely related and can be usefully combined:planning, acting, model learning, and direct RL can occursimultaneously and in parallel
1010CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
The Dyna Architecture (Sutton 1990)
1111CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
The Dyna-Q Algorithm
model learning
planning
directRL
1212CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Dyna-Q on a Simple Maze
rewards = 0 untilgoal, when =1
1313CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Dyna-Q Snapshots:Midway in 2nd Episode
1414CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
When the Model is Wrong: Blocking Maze
The changed envirnoment is harderDyna-AC Dyna-AC ====
Dyna Dyna Actor-CriticActor-Critic
Dyna-QDyna-Q+ will be+ will beExplainedExplained
momentarilymomentarily
1515CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Shortcut MazeThe changed environment is easier
1616CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
What is Dyna-Q+?
• Uses an “exploration bonus”:
• Keeps track of time since each state-action pairwas tried for real
• An extra reward is added for transitions caused bystate-action pairs related to how long ago theywere tried: the longer unvisited, the more rewardfor visiting
• The agent actually “plans” how to visit longunvisited states
1717CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Prioritized Sweeping• Which states or state-action pairs should be
generated during planning?• Work backwards from states whose values have
just changed:• Maintain a queue of state-action pairs whose values
would change a lot if backed up, prioritized by the size ofthe change
• When a new backup occurs, insert predecessorsaccording to their priorities
• Always perform backups from first in queue• Moore and Atkeson 1993; Peng and Williams, 1993
1818CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Prioritized Sweeping
1919CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Prioritized Sweeping
This means: For I=1 to N doThis means: For I=1 to N doif (not(empty(if (not(empty(PqeuePqeue))) then))) then
2020CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Prioritized Sweeping vs. Dyna-Q
Both use N=5backups perenvironmentalinteraction
2121CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Rod Maneuvering (Moore andAtkeson 1993)
2222CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Some dimensions of combining P&L
• These have been some ideas of how to combine planning(using models) and learning
• Next we consider some components of this process:• Full backups (considering everything that might happen) vs.
sample back-ups (considering a single trajectory)
• On-policy (computing the value of this policy) versus
off-policy (computing the value of the optimal policy)
• State-action (Q) values vs state (V) values
• Trajectory versus Heuristic search
2323CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Ways to combine learning and planning:Full and Sample (One-Step) Backups
• Dimensions of backups: Full (dynamic programming) vs. sample (temporal difference) On-policy vs. off-policy State values vs state-action values
(2x2x2 = 8, but one doesn’tseem to be reasonable…)
2424CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Are full backups always preferable?• They are better estimators, but they require lots ofcomputation. Let’s do a comparison in the following context: Learning Q*, table look-up, and a completedistribution model (P and R). The full back-up is:
the sample back-up is:
2525CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Are full backups always preferable?•The full back-up is:
the sample back-up is:
This will make the most difference depending on thebranching factor, b to next states.
2626CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Full vs. Sample Backups
b successor states, equally likely; initial error = 1;assume all next states’ values are correct
Bottom line: sample backups can be a big win in computation!
2727CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Trajectory Sampling
• Trajectory sampling: perform backups along simulatedtrajectories (consider other possibilities: why are they bad?)
• This samples from the on-policy distributionon-policy distribution
• Advantages when function approximation is used (Chapter8: on-policy is the only one provably convergent)
• Focusing of computation: can cause vast uninteresting partsof the state space to be (usefully) ignored:
Initialstates
Reachable under optimal control
Irrelevant states
2828CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Trajectory Sampling Experiment:Is on policy always better?
• one-step full tabular backups• uniform: cycled through all state-
action pairs• on-policy: backed up along
simulated trajectories• 200 randomly generated
undiscounted episodic tasks• 2 actions for each state, each with
b equally likely next states• .1 prob of transition to terminal
state• expected reward on each
transition selected from mean 0variance 1 Gaussian
2929CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Trajectory Sampling Experiment
• Not conclusive:• Applied to a particular,
randomly generatedproblem
• Short term: big win foron-policy
• Long term: uniformsampling is better
• But - for large statespaces - on policy isprobably best - becauseit focuses the search.
3030CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Heuristic Search
• Used for action selection, not for changing a value function(=heuristic evaluation function)
• Backed-up values are computed, but typically discarded
• Extension of the idea of a greedy policy — only deeper
• Also suggests ways to select states to backup: smart focusing:
3131CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 99
Summary
• Emphasized close relationship between planning andlearning
• Important distinction between distribution models andsample models
• Looked at some ways to integrate planning and learning• synergy among planning, acting, model learning
• Distribution of backups: focus of the computation• trajectory sampling: backup along trajectories
• prioritized sweeping
• heuristic search
• Size of backups: full vs. sample; deep vs. shallowEND