Reinforcement Learning: A Tutorial Rich Sutton AT&T Labs -- Research sutton.
-
Upload
emerson-madlock -
Category
Documents
-
view
220 -
download
2
Transcript of Reinforcement Learning: A Tutorial Rich Sutton AT&T Labs -- Research sutton.
Reinforcement Learning:A Tutorial
Rich SuttonAT&T Labs -- Research
www.research.att.com/~sutton
Outline
• Core RL Theory– Value functions– Markov decision processes– Temporal-Difference Learning– Policy Iteration
• The Space of RL Methods– Function Approximation– Traces– Kinds of Backups
• Lessons for Artificial Intelligence
• Concluding Remarks
RL is Learning from Interaction
Environment
actionstate
rewardAgent
• complete agent• temporally situated• continual learning & planning• object is to affect environment• environment stochastic & uncertain
Ball-Ballancing Demo
• RL applied to a real physical system
• Illustrates online learning
• An easy problem...but learns as you watch!
• Courtesy of GTE Labs:Hamid Benbrahim
Judy Franklin
John Doleac
Oliver Selfridge
• Same learning algorithm as early pole-balancer
Markov Decision Processes (MDPs)
In RL, the environment is a modeled as an MDP, defined by
S – set of states of the environment
A(s) – set of actions possible in state s SP(s,s',a) – probability of transition from s to s' given aR(s,s',a) – expected reward on transition s to s' given a – discount rate for delayed reward
discrete time, t = 0, 1, 2, . . .
t
. . . st art +1 st +1
t +1art +2 st +2
t +2art +3 st +3
. . .t +3a
Find a policy sS aA(s) (could be stochastic)
that maximizes the value (expected future reward) of each s :
and each s,a pair:
The Objective is to Maximize Long-term Total Discounted Reward
V (s) = E {r + r + r + s =s, }
rewards
These are called value functions - cf. evaluation functions
t +1 t +2 t +3 t. . .2
Q (s,a) = E {r + r + r + s =s, a =a, }t +1 t +2 t +3 t t. . .2
Optimal Value Functions and Policies
There exist optimal value functions:
And corresponding optimal policies:
V * (s) =max
V (s) Q* (s, a) =max
Q (s,a)
* (s) = arg maxa
Q* (s, a)
* is the greedy policy with respect to Q*
1-Step Tabular Q-Learning
(Watkins, 1989)
On each state transition:
Q(st ,at ) ← Q(st,at ) + α rt+1 +maxa
Q(st+1,a) −Q(st ,at)[ ]
TD Error
limt→ ∞
Q(s,a) → Q* (s,a)
limt→ ∞
t → *
a tableentry
st at
rt +1 st +1
Update:
Assumes finite MDP
Optimal behavior found without a model of the environment!
Example of Value Functions
4Grid Example
T
T
T
T
T
T
T
T
T
T
T
T 0.00.0 0.00.0 0.00.0 0.00.0 0.00.0 0.00.0 0.00.0
-1.0 -1.0 -1.0-1.0 -1.0 -1.0 -1.0-1.0 -1.0 -1.0 -1.0-1.0 -1.0 -1.0
-1.7 -2.0 -2.0-1.7 -2.0 -2.0 -2.0-2.0 -2.0 -2.0 -1.7-2.0 -2.0 -1.7
-2.4 -2.9 -3.0-2.4 -2.9 -3.0 -2.9-2.9 -3.0 -2.9 -2.4-3.0 -2.9 -2.4
-6.1 -8.4 -9.0-6.1 -7.7 -8.4 -8.4-8.4 -8.4 -7.7 -6.1-9.0 -8.4 -6.1
-14. -20. -22.-14. -18. -20. -20.-20. -20. -18. -14.-22. -20. -14.
Vk for therandom policy
Greedy Policy. . w r tVk
=0k
=1k
=2k
=10k
=k ∞
=3k
optimalpolicy
randompolicy
Vk+1(s)=E{ r+Vk(s)' }
Policy Iteration
Summary of RL Theory:Generalized Policy Iteration
Policy
Value
Function
V, Q
policyevaluation
policyimprovement
* V* , Q*
value learning
greedification
Outline
• Core RL Theory– Value functions– Markov decision processes– Temporal-Difference Learning– Policy Iteration
• The Space of RL Methods– Function Approximation– Traces– Kinds of Backups
• Lessons for Artificial Intelligence
• Concluding Remarks
Sarsa Backup
st ,at
st +1
rt +1
Backup!
The value of this pair
Is updated from the value of the next pair
TD Error
Q(st ,at ) ← Q(st,at ) + α rt+1 +Q(st+1,at+1) −Q(st,at)[ ]
at+1
st at
rt +1 st +1On transition:at+1
Update:
Rummery & NiranjanHollandSutton
Dimensions of RL algorithms
1. Action Selection and Exploration
2. Function Approximation
3. Eligibility Trace
4. On-line Experience vs Simulated Experience
5. Kind of Backups
.
.
.
1. Action Selection (Exploration)
Ok, you've learned a value function, How do you pick Actions?
Q ≈Q*
Greedy Action Selection:Always select the action that looks best:
-Greedy Action Selection:Be greedy most of the timeOccasionally take a random action
Surprisingly, this is the state of the art.
Exploitationvs
Exploration!
(s) = arg maxa
Q(s,a)
2. Function Approximation
FunctionApproximator
s
a
Could be:• table• Backprop Neural Network• Radial-Basis-Function Network• Tile Coding (CMAC)• Nearest Neighbor, Memory Based • Decision Tree
gradient-descentmethods
targets or errors
Q(s,a)
Adjusting Network Weights
estimated value
w ← w+ α rt+1 + Q(st+1,at+1 )−Q(st ,at)[ ] ∇w f(st ,at,w)
Q(s,a) = f (s,a,w)
e.g., gradient-descent Sarsa:
target value
weight vector
standardbackprop gradient
Sparse Coarse Coding
fixed expansiveRe-representation
Linearlast layer
Coarse: Large receptive fieldsSparse: Few features present at one time
features
.
.
.
.
.
.
.
.
.
.
.
The Mountain Car Problem
Minimum-Time-to-Goal Problem
Moore, 1990 Goal
Gravity wins
SITUATIONS: car's position and velocity
ACTIONS: three thrusts: forward, reverse, none
REWARDS: always –1 until car reaches the goal
No Discounting
RL Applied to the Mountain Car
1. TD Method = Sarsa Backup
2. Action Selection = Greedy
with initial optimistic values
3. Function approximator = CMAC
4. Eligibility traces = Replacing
(Sutton, 1995)Video Demo
Q0 (s, a) =0
Tile Coding applied to Mountain Cara.k.a. CMACs (Albus, 1980)
Tiling #1
Car Position Tiling #2 Car Position
Shape/Size of tiles ⇒ Generalization
#Tilings Resolution of final approximation⇒
5 5x grid
(10)
The Acrobot Problem
e.g., Dejong & Spong, 1994
Sutton, 1995
Minimum–Time–to–Goal:
4 state variables: 2 joint angles 2 angular velocities
CMAC of 48 layers
RL same as Mountain Car
θ 1
θ 2
: Goal Raise tip above line
Torque
applied
here
tip
=-1 Reward per time step
(TD error) • (eligibility of at time t )
3. Eligibility Traces (*)
• The in TD(), Q(), Sarsa() . . .
• A more primitive way of dealing with delayed reward
• The link to Monte Carlo methods
• Passing credit back farther than one action
• The general form is:
change in at time t
Q(s,a) ~ s, a
global specific to s, a
e.g., was taken at t ? or shortly before t ?
s, a
at time t
visits to
Eligibility Traces
TIME
s, a
accumulatingtrace
replacetrace
Complex TD Backups (*)
trial primitive TD backups
1–
(1–)
(1–)2
3
= 1
e.g. TD ( )
Incrementally computes a weighted mixture of these backups as states are visited
= 0 Simple TD
= 1 Simple Monte Carlo
Kinds of Backups
Dynamic programming
Temporal-differencelearning
Monte Carlo
Exhaustivesearch
bootstrapping,
fullbackups
samplebackups
shallowbackups
deepbackups
Outline
• Core RL Theory– Value functions– Markov decision processes– Temporal-Difference Learning– Policy Iteration
• The Space of RL Methods– Function Approximation– Traces– Kinds of Backups
• Lessons for Artificial Intelligence
• Concluding Remarks
World-Class Applications of RL
• TD-Gammon and Jellyfish Tesauro, DahlWorld's best backgammon player
• Elevator Control Crites & BartoWorld's best down-peak elevator controller
• Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis10-15% improvement over industry standard methods
• Dynamic Channel Assignment Singh & Bertsekas, Nie & HaykinWorld's best assigner of radio channels to mobile telephone calls
Lesson #1:Approximate the Solution, Not the Problem
• All these applications are large, stochastic, optimal control problems
– Too hard for conventional methods to solve exactly
– Require problem to be simplified
• RL just finds an approximate solution– . . . which can be much better!
Backgammon
SITUATIONS: configurations
of the playing board
(about 10 )
ACTIONS: moves
REWARDS: win: +1
lose: –1
else: 0
20
Pure delayed reward
Tesauro, 1992,1995
TD-Gammon
Tesauro, 1992–1995
Start with a random network
Play millions of games against self
Learn a value function from this simulated experience
This produces arguably the best player in the world
Action selectionby 2–3 ply search
Value
TD errorVt+1 −Vt
Neurogammonsame network, buttrained from 15,000
expert-labeled examples
Lesson #2:The Power of Learning from Experience
# hidden units
performanceagainst
gammontool
50%
70%
0 10 20 40 80
TD-Gammonself-play Tesauro, 1992
Expert examples are expensive and scarceExperience is cheap and plentiful!And teaches the real solution
Lesson #3:The Central Role of Value Functions
...of modifiable moment-by-moment estimates of
how well things are going
All RL methods learn value functions
All state-space planners compute value functions
Both are based on "backing up" value
Recognizing and reacting to the ups and downs of
life is an important part of intelligence
Lesson #4:Learning and Planning
can be radically similar
Historically, planning and trial-and-error
learning have been seen as opposites
But RL treats both as processing of experience
environment
environmentmodel
experience value/policy
on-lineinteraction
simulation
learning
planning
1-Step Tabular Q-Planning
1. Generate a state, s, and action, a
2. Consult model for next state, s', and reward, r
3. Learn from the experience, s,a,r,s' :
4. go to 1
Q(s,a) ← Q(s,a) + α r + maxa'
Q(s',a')−Q(s,a)⎡⎣
⎤⎦
With function approximation and cleverness in search control (Step 1), this is a viable, perhaps even superior, approach to state-space planning
Lesson #5: Accretive Computation
"Solves" Planning Dilemmas
Processes only loosely coupled
Can proceed in parallel, asynchronously
Quality of solution accumulates in value & model memories
Reactivity/Deliberation dilemma"solved" simply by not opposing search and
memoryIntractability of planning
"solved" by anytime improvement of solution
value/policy
model experience
acting
modellearning
directRL
planning
Dyna Algorithm
1. s current state
2. Choose an action, a, and take it3. Receive next state, s’, and reward, r4. Apply RL backup to s, a, s’, r
e.g., Q-learning update
5. Update Model(s, a) with s’, r6. Repeat k times:
- select a previously seen state-action pair s,a- s’, r Model(s, a)
- Apply RL backup to s, a, s’, r7. Go to 1
Actions –> Behaviors
MDPs seems too too flat, low-level
Need to learn/plan/design agents at a higher level
at the level of behaviors
rather than just low-level actions
e.g., open-the-door rather than twitch-muscle-17 walk-to-work rather than 1-step-forward
Behaviors are based on entire policies, directed toward subgoals
enable abstraction, hierarchy, modularity, transfer...
Rooms Example
HALLWAYS
O2
O1
4 rooms
4 hallways
8 multi-step options
Given goal location, quickly plan shortest route
up
down
rightleft
(to each room's 2 hallways)
G?
G?
4 unreliable primitive actions
Fail 33% of the time
Goal states are givena terminal value of 1 = .9
All rewards zero
ROOM
Planning by Value Iteration
Iteration #1 Iteration #2 Iteration #3
with primitive actions (cell-to-cell)
Iteration #1 Iteration #2 Iteration #3
with behaviors (room-to-room)
V (goal )=1
V (goal )=1
Behaviors aremuch like Primitive Actions
• Define a higher level Semi-MDP, overlaid on the original MDP
• Can be used with same planning methods- faster planning- guaranteed correct results
• Interchangeable with primitive actions
• Models and policies can be learned by TD methods
Lesson #6: Generality is no impediment to working with higher-level knowledge
Provide a clear semantics for multi-level, re-usable knowledge for the general MDP context
Summary of Lessons
1. Approximate the Solution, Not the Problem
2. The Power of Learning from Experience
3. The Central Role of Value Functions in finding optimal sequential behavior
4. Learning and Planning can be Radically Similar
5. Accretive Computation "Solves Dilemmas"
6. A General Approach need not be Flat, Low-level
All stem from trying to solve MDPs
Outline• Core RL Theory
– Value functions– Markov decision processes– Temporal-Difference Learning– Policy Iteration
• The Space of RL Methods– Function Approximation– Traces– Kinds of Backups
• Lessons for Artificial Intelligence
• Concluding Remarks
Strands of History of RL
Trial-and-errorlearning
Temporal-differencelearning
Optimal control,value functions
Thorndike ()1911
Minsky
Klopf
Barto et al
Secondary reinforcement ()
Samuel
Witten
Sutton
Hamilton (Physics)1800s
Shannon
Bellman/Howard (OR)
Werbos
Watkins
Holland
Radical Generalityof the RL/MDP problem
Determinism
Goals of achievement
Complete knowledge
Closed world
Unlimited deliberation
General stochastic dynamics
General goals
Uncertainty
Reactive decision-making
ClassicalRL/MDPs
Getting the degree of abstraction right
• Actions can be low level (e.g. voltages to motors), high level (e.g., accept a job offer), "mental" (e.g., shift in focus of attention)
• Situations: direct sensor readings, or symbolic descriptions, or "mental" situations (e.g., the situation of being lost)
• An RL agent is not like a whole animal or robot, which consist of many RL agents as well as other components
• The environment is not necessarily unknown to the RL agent, only incompletely controllable
• Reward computation is in the RL agent's environment because the RL agent cannot change it arbitrarily
Some of the Open Problems
• Incomplete state information
• Exploration
• Structured states and actions
• Incorporating prior knowledge
• Using teachers
• Theory of RL with function approximators
• Modular and hierarchical architectures
• Integration with other problem–solving
and planning methods
Dimensions of RL Algorithms
• TD --- Monte Carlo ()
• Model --- Model free
• Kind of Function Approximator
• Exploration method
• Discounted --- Undiscounted
• Computed policy --- Stored policy
• On-policy --- Off-policy
• Action values --- State values
• Sample model --- Distribution model
• Discrete actions --- Continuous actions
A Unified View
• There is a space of methods, with tradeoffs and mixtures- rather than discrete choices
• Value function are key- all the methods are about learning, computing, or
estimating values- all the methods store a value function
• All methods are based on backups- different methods just have backups of different
shapes and sizes, sources and mixtures
Psychology and RL
PAVLOVIAN CONDITIONING
The single most common and important kind of learningUbiquitous - from people to paramecialearning to predict and anticipate, learning affect
TD methods are a compelling match to the dataMost modern, real-time methods are based on Temporal-
Difference Learning
INSTRUMENTAL LEARNING (OPERANT CONDITIONING)"Actions followed by good outcomes are repeated"Goal-directed learning, The Law of EffectBiological version of Reinforcement Learning
But careful comparisons to data have not yet been made
Hawkins & KandelHopfield & TankKlopfMoore et al.Byrne et al.Sutton & BartoMalaka
Neuroscience and RL
RL interpretations of brain structures are being explored in several areas:
Basal GangliaDopamine NeuronsValue Systems
Final Summary
• RL is about a problem:- goal-directed, interactive, real-time decision making- all hail the problem!
• Value functions seem to be the key to solving it,and TD methods the key new algorithmic idea
• There are a host of immediate applications- large-scale stochastic optimal control and
scheduling
• There are also a host of scientific problems
• An exciting time
BibliographySutton, R.S., Barto, A.G., Reinforcement Learning: An
Introduction. MIT Press 1998. Kaelbling, L.P., Littman, M.L., and Moore, A.W.
"Reinforcement learning: A survey". Journal of Artificial Intelligence Research, 4, 1996.
Special issues on Reinforcement Learning: Machine Learning, 8(3/4),1992, and 22(1/2/3), 1996.
Neuroscience: Schultz, W., Dayan, P., and Montague, P.R. "A neural substrate of prediction and reward". Science, 275:1593--1598, 1997.
Animal Learning: Sutton, R.S. and Barto, A.G. "Time-derivative models of Pavlovian reinforcement". In Gabriel, M. and Moore, J., Eds., Learning and Computational Neuroscience, pp. 497--537. MIT Press, 1990.
See also web pages, e.g., starting from http://www.cs.umass.edu/~rich.