Reinforcement Learning
Based on slides by Avi Pfeffer and David Parkes
Closed Loop InteractionsEnvironment
Sensors Actuators
Reward
Agent
Reinforcement Learning
• When mechanism(=model) is unknown
• When mechanism is known, but model is too hard to solve
Basic Idea
• Select an action using some sort of action selection process
• If it leads to a reward, reinforce taking that action in future
• If it leads to a punishment, avoid taking that action in future
But It’s Not So Simple
• Rewards and punishments may be delayed– credit assignment problem: how do you
figure out which actions were responsible?• How do you choose an action?
– exploration versus exploitation• What if the state space is very large so
you can’t visit all states?
Model-Based RL
Model-Based Reinforcement Learning
• Mechanism is an MDP • Approach:
– learn the MDP– solve it to determine the optimal policy
• Works when model is unknown, but it is not too large to store and solve
Learning the MDP• We need to learn the parameters of the reward and
transition models• We assume the agent plays every action in every
state a number of times• Let Ra
i = total reward received for playing a in state i• Let Na
i = number of times played a in state i• Let Na
ij = number of times j was reached when played a in state i
• R(i,a) = Rai / Na
i
• Taij = Na
ij / Nai
Note
• Learning and solving the MDP need not be a one-off thing
• Instead, we can repeatedly solve the MDP to get better and better policies
• How often should we solve the MDP?– depends how expensive it is compared to
acting in the world
Model-Based Reinforcement Learning Algorithm
Let 0 be arbitraryk 0Experience Repeat k k + 1 Begin in state i For a while: Choose action a based on k-1
Receive reward r and transition to j Experience Experience < i, a, r, j > i j Learn MDP M from Experience Solve M to obtain k
Credit Assignment• How does model-based RL deal with the
credit assignment problem?
• By learning the MDP, the agent knows which states lead to which other states
• Solving the MDP ensures that the agent plans ahead and takes the long run effects of actions into account
• So the problem is solved optimally
Action Selection
• The line in the algorithm Choose action a based on k-1
is not specific• How do we choose the action?
Action Selection
• The line in the algorithm Choose action a based on k-1
is not specific• How do we choose the action?• Obvious answer: the policy tells us the
action to perform• But is that always what we want to do?
Exploration versus Exploitation
• Exploit: use your learning results to play the action that maximizes your expected utility, relative to the model you have learned
• Explore: play an action that will help you learn the model better
Questions• When to explore• How to explore
– simple answer: play an action you haven’t played much yet in the current state
– more sophisticated: play an action that will probably lead you to part of the space you haven’t explored much
• How to exploit– we know the answer to this: follow the learned
policy
Conditions for Optimality
To ensure that the optimal policy will eventually be reached, we need to ensure that
1. Every action is taken in every state infinitely often in the long run
2. The probability of exploitation tends to 1
Possible Exploration Strategies: 1
• Explore until time T, then exploit• Why is this bad?
Possible Exploration Strategies: 1
• Explore until time T, then exploit• Why is this bad?
– We may not explore long enough to get an accurate model
– As a result, the optimal policy will not be reached
Possible Exploration Strategies: 1
• Explore until time T, then exploit• Why is this bad?
– We may not explore long enough to get an accurate model
– As a result, the optimal policy will not be reached
• But it works well if we’re planning to learn the MDP once, then solve it, then play according to the learned policy
Possible Exploration Strategies: 1
• Explore until time T, then exploit• Why is this bad?
– We may not explore long enough to get an accurate model
– As a result, the optimal policy will not be reached• But it works well if we’re planning to learn the
MDP once, then solve it, then play according to the learned policy
• Works well for learning from simulation and performing in the real world
Possible Exploration Strategies: 2
• Explore with a fixed probability of p• Why is this bad?
Possible Exploration Strategies: 2
• Explore with a fixed probability of p• Why is this bad?
– Does not fully exploit when learning has converged to optimal policy
Possible Exploration Strategies: 2
• Explore with a fixed probability of p• Why is this bad?
– Does not fully exploit when learning has converged to optimal policy
• When could this approach be useful?
Possible Exploration Strategies: 2
• Explore with a fixed probability of p• Why is this bad?
– Does not fully exploit when learning has converged to optimal policy
• When could this approach be useful?– If world is changing gradually
Boltzmann Exploration• In state i, choose action a
with probability
• T is a temperature• High temperature: more exploration• T should be cooled down to reduce amount of
exploration over time • Sensitive to cooling schedule
'
)',(
),(
a
TaiQ
TaiQ
e
e
Guarantee
• If:– every action is taken in every state infinitely
often– probability of exploration tends to zero
• Then:– Model-based reinforcement learning will
converge to the optimal policy with probability 1
Pros and Cons
• Pro: – makes maximal use of experience– solves model optimally given experience
• Con: – assumes model is small enough to solve– requires expensive solution procedure
R-Max• Assume R(s,a)=R-max (the maximal possible
reward– Called optimism bias
• Assume any transition probability• Solve and act optimally• When Na
i > c, update R(i,a)• After each update, resolve• If you choose c properly, converges to the
optimal policy
Model-Free RL
Monte Carlo Sampling• If we want to estimate y = Ex~D[f(x)] we can
– Generate random samples x1,…,xN from D– Estimate
– Guaranteed to converge to correct estimate with sufficient samples
– Requires keeping count of # of samples• Alternative, update average:
– Generate random samples x1,…,xN from D– Estimate
N
i
ixfN
y1
)(1ˆ
Estimating the Value of a Policy
• Fix a policy • When starting in state i, taking action a
according to , getting reward r and transitioning to j, we get a sample of
• So we can updateV(i) (1-)V(i) + (r + V(j))
• But where does V(j) comes from?– Guess (this is called bootstrapping)
Temporal Difference Algorithm
For each state i: V(i) 0Begin in state iRepeat: Apply action a based on current policy
Receive reward r and transition to j i j
))(()()1()( jVriViV
Credit Assignment
• By linking values to those of the next state, rewards and punishments are eventually propagated backwards
• We wait until end of game and then propagate backwards in reverse order
But how do learn to act
• To improve our policy, we need to have an idea of how good it is to use a different policy
• TD learns the value function– Similar to the value determination step of policy
iteration, but without a model• To improve, we need an estimate of the Q
function:
TD for Control: SARSA
Initialize Q(s,a) arbitrarilyRepeat (for each episode): Initialize s Choose a from s using policy derived from Q (e.g., ε-greedy) Repeat (for each step of episode): Take action a, observe r, Choose a’ from s’ using policy derived from Q (e.g., ε-greedy) s s’, aa’ until s is terminal
Off-Policy vs. On-Policy• On-policy learning: learn only the value of
actions used in the current policy. SARSA is an example of an on-policy method. Of course, learning of the value of the policy is combined with gradual change of the policy
• Off-policy learning: can learn the value of a policy/action different than the one used – separating learning from control. Q-learning is an example. It learns about the optimal policy by using a different policy (e.g., e-greedy policy).
Q Learning
• Don’t learn the model, learn the Q function directly
• Works particularly well when model is too large to store, to solve or to learn– size of model: O(|States|2)– cost of solution by policy iteration: O(|States|3)– size of Q function: O(|Actions|*|States|)
Recursive Formulation of Q Function
j
aij jVTaiRaiQ )(),(),(
),(max),( bjQTaiRj b
aij
j j b
aij
aij bjQTaiRT ),(max),(
j b
aij bjQaiRT )),(max),((
)],(max),([E~
bjQaiRbTj a
i
Learning the Q Values
• We don’t know Tai and we don’t want to
learn it
Learning the Q Values
• We don’t know Tai and we don’t want to
learn it• If only we knew that our future Q values
were accurate…• …every time we played a in state i and
transitioned to j, receiving reward r, we would get a sample of R(i,a)+maxbQ(j,b)
Learning the Q Values
• We don’t know Tai and we don’t want to
learn it• If only we knew that our future Q values
were accurate…• …every time we played a in state i and
transitioned to j, receiving reward r, we would get a sample of R(i,a)+maxbQ(j,b)
• So we pretend that they are accurate– (after all, they get more and more accurate)
Q Learning Update Rule• On transitioning from i to j, taking action a,
receiving reward r, update
)),(max(),()1(),( bjQraiQaiQb
Q Learning Update Rule• On transitioning from i to j, taking action a,
receiving reward r, update
• is the learning rate• Large :
– learning is quicker– but may not converge
• is often decreased over the course of learning
)),(max(),()1(),( bjQraiQaiQb
Q Learning Algorithm
For each state i and action a: Q(i,a) 0Begin in state iRepeat: Choose action a based on the Q values for state
i for all actions Receive reward r and transition to j
i j)),(max(),()1(),( bjQraiQaiQ
b
Choosing Which Action to Take
• Once you have learned the Q function, you can use it to determine the policy– in state i, choose action a that has highest
estimated Q(i,a)• But we need to combine exploitation with
exploration– same methods as before
Guarantee• If:
– every action is taken in every state infinitely often– is sufficiently small
• Then Q learning will converge to the optimal Q values with probability 1
• If also:– probability of exploration tends to zero
• Then Q learning will converge to the optimal policy with probability 1
Credit Assignment
• By linking Q values to those of the next state, rewards and punishments are eventually propagated backwards
• But may take a long time• Idea: wait until end of game and then
propagate backwards in reverse order
Q-learning ( = 1)
S1
S2 S3 S4 S5
S6 S7 S8 S9
a
b
a,b a,b a,b
a,b a,b a,b
After playing aaaa:Q(S4,a) = 1Q(S4,b) = 0Q(S3,a) = 1Q(S3,b) = 0
Q(S2,a) = 1Q(S2,b) = 0Q(S1,a) = 1Q(S1,b) = 0
After playing bbbb:Q(S8,a) = 0Q(S8,b) = -1Q(S7,a) = 0Q(S7,b) = 0
Q(S6,a) = 0Q(S6,b) = 0Q(S1,a) = 1Q(S1,b) = 0
00 0 1
0 0 -1
0
Bottom Line
• Q learning makes optimistic assumption about the future
• Rewards will be propagated back in linear time, but punishments may take exponential time to be propagated
• But eventually, Q learning will converge to optimal policy
Issue: Generalization
• What if state space is very large?• Then we can’t visit every state• We need to generalize from states we
have seen to states we haven’t seen• This is just like learning from a training set
and generalizing to the future
67
State Space And Variables
• When we looked at reinforcement learning, state space was monolithic– e.g. in darts, just a number
• In many domains, state consists of a number of variables– e.g. in backgammon, number of pieces at each
location• Size of state space is exponential in number
of variables• We also need to consider continuous state
spaces– e.g. helicopter
68
Value Function Approximation
• Define features X1,…,Xn of the state
69
Value Function Approximation
• Define features X1,…,Xn of the state• Instead of learning V(s) for every state,
learn an approximation )(ˆ sV
70
Value Function Approximation
• Define features X1,…,Xn of the state• Instead of learning V(s) for every state,
learn an approximation• depends only on the features
)(ˆ sV)(ˆ sV
71
Value Function Approximation
• Define features X1,…,Xn of the state• Instead of learning V(s) for every state,
learn an approximation• depends only on the features• Represent compactly
– e.g. using neural network
)(ˆ sV)(ˆ sV
)(ˆ sV
72
Value Function Approximation
• Define features X1,…,Xn of the state• Instead of learning V(s) for every state,
learn an approximation• depends only on the features• Represent compactly
– e.g. using neural network • Works when state space is large but
mechanism is known– e.g. backgammon
)(ˆ sV)(ˆ sV
)(ˆ sV
73
Q Function Approximation
• Define features X1,…,Xn of the state
74
Q Function Approximation
• Define features X1,…,Xn of the state• Instead of learning Q(s,a) for every
state, learn an approximation ),(ˆ asQ
75
Q Function Approximation
• Define features X1,…,Xn of the state• Instead of learning Q(s,a) for every
state, learn an approximation• depends only on the features
),(ˆ asQ
),(ˆ asQ
76
Q Function Approximation
• Define features X1,…,Xn of the state• Instead of learning Q(s,a) for every
state, learn an approximation• depends only on the features• Represent compactly
– e.g. using neural network
),(ˆ asQ
),(ˆ asQ),(ˆ asQ
77
Q Function Approximation
• Define features X1,…,Xn of the state• Instead of learning Q(s,a) for every
state, learn an approximation• depends only on the features• Represent compactly
– e.g. using neural network• Works when state space is large and
mechanism is unknown– e.g. helicopter
),(ˆ asQ),(ˆ asQ
),(ˆ asQ
78
Value Function Approximation Update Rule
• On transitioning from i to j, taking action a, receiving reward r:
• Create a training instance in which– Inputs are features of i– Output is
• Run forward propagation and back propagation on this instance
)(ˆ jVr
79
Basic Approach• Define features that summarize the state
– state represented by features X1,…,Xn
• Assume that the value of a state approximately depends only on the features– V’(s) = f(x1,…,xn)
• Assume that f can be compactly represented• Learn f from experience
– how to learn such a function will be a major topic of this course
E.g. Samuel’s Checkers Player
• Features:– x1: number of black pieces on board– x2: number of red pieces on board– x3: number of black kings on board– x4: number of red kings on board– x5: number of black pieces threatened– x6: number of red pieces threatened
• f(x1,…,x6) = w1x1+w2x2+w3x3+w4x4+w5x5+w6x6
• w1,…,w6 are learnable parameters
Training Data
• Each time agent transitions from i to j, taking action a and receiving reward r, we get an estimate v = r + V’(j) for V(i)
• Let the features of i be x1,…,xn
• We get a training instance <x1,…,xn,v>• We use this instance to update our model
of f
Applications of MDPs, POMDPs and Reinforcement Learning
• TD-Gammon: world champion level backgammon player
• Robotics and control: e.g. helicopter
• Industrial: e.g. job shop scheduling• Business: e.g. internet advertising• Military: e.g. target identification• Medical: e.g. testing and diagnosis
Top Related