Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic...
-
Upload
marvin-mcbride -
Category
Documents
-
view
219 -
download
0
Transcript of 1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic...
11
ECE-517 Reinforcement LearningECE-517 Reinforcement Learningin Artificial Intelligence in Artificial Intelligence
Lecture 7: Finite Horizon MDPs,Lecture 7: Finite Horizon MDPs,Dynamic ProgrammingDynamic Programming
Dr. Itamar ArelDr. Itamar Arel
College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science
The University of TennesseeThe University of TennesseeFall 2011Fall 2011
September 15, 2011September 15, 2011
ECE 517 - Reinforcement Learning in AI 22
OutlineOutline
Finite Horizon MDPsFinite Horizon MDPs
Dynamic ProgrammingDynamic Programming
ECE 517 - Reinforcement Learning in AI 33
Finite Horizon MDPs – Value FunctionsFinite Horizon MDPs – Value Functions
The duration, or expected duration, of the process is The duration, or expected duration, of the process is finitefinite
Let’s consider the following return functions:Let’s consider the following return functions: The The expected sum of rewardsexpected sum of rewards
The The expected discounted sum of rewardsexpected discounted sum of rewards
a sufficient condition for the above to converge isa sufficient condition for the above to converge is rrt t < < rrmax max ……
)(lim
|lim)( 01
sVE
ssrEsV
NN
N
tt
N
10 |lim)( 01
1
ssrEsV
N
tt
t
N
expected sumexpected sumof rewardsof rewardsfor for NN steps steps
ECE 517 - Reinforcement Learning in AI 44
Return functionsReturn functions
If If rrt t < < rrmaxmax holds, then holds, then
note that this bound is very sensitive to the value of note that this bound is very sensitive to the value of The The expected average rewardexpected average reward
Note that the above limit does not always exist! Note that the above limit does not always exist!
1)( max
1max
1 rrsV
N
t
t
)(1
lim
|1
lim)( 01
sVN
ssrEN
sM
NN
N
tt
N
ECE 517 - Reinforcement Learning in AI 55
Relationship between andRelationship between and
Consider a finite horizon problem where the horizon is Consider a finite horizon problem where the horizon is random, i.e.random, i.e.
Let’s also assume that the final value for all states is Let’s also assume that the final value for all states is zerozero
Let Let NN be geometrically distributed with parameter be geometrically distributed with parameter , ,
such that the probability of stopping at the such that the probability of stopping at the NNthth step is step is
LemmaLemma: we’ll show that: we’ll show that
under the assumption that |under the assumption that |rrt t || < < rrmaxmax
)(sV )(sVN
ssrEEsVN
ttNN 0
1
|)(
11Pr nnN
)()( sVsV N
ECE 517 - Reinforcement Learning in AI 66
Relationship between and (cont.)Relationship between and (cont.))(sV )(sVN
)(
11
1
1)(
1
1
1
1
1
1
1 1
1
sV
rE
rE
rE
rEsV
tt
t
t
tt
tn
n
tt
n
n
tt
nN
ProofProof::
∆∆
ECE 517 - Reinforcement Learning in AI 77
OutlineOutline
Finite Horizon MDPs (cont.)Finite Horizon MDPs (cont.)
Dynamic ProgrammingDynamic Programming
ECE 517 - Reinforcement Learning in AI 88
0
Example of a finite horizon MDPExample of a finite horizon MDP
Consider the following state diagram:Consider the following state diagram:
S1 S2
a11
{5,0.5}
a22
{-1,1)
a11
{5,0.5}
a12
{10,1}
a21
ECE 517 - Reinforcement Learning in AI 99
Why do we need DP techniques ?Why do we need DP techniques ?
Explicitly solving the Bellman Optimality equation is Explicitly solving the Bellman Optimality equation is hardhard Computing the optimal policy Computing the optimal policy solve the RL problem solve the RL problem
Relies on the following three assumptionsRelies on the following three assumptions We have perfect knowledge of the dynamics of the We have perfect knowledge of the dynamics of the
environmentenvironment We have enough computational resourcesWe have enough computational resources The Markov property holdsThe Markov property holds
In reality, all three are problematicIn reality, all three are problematic
e.g. Backgammon game: first and last conditions are ok, e.g. Backgammon game: first and last conditions are ok, but computational resources are insufficientbut computational resources are insufficient Approx. 10Approx. 102020 state state
In many cases we have to settle for approximate In many cases we have to settle for approximate solutions (much more on that later …)solutions (much more on that later …)
ECE 517 - Reinforcement Learning in AI 1010
Big Picture: Elementary Solution MethodsBig Picture: Elementary Solution Methods
During the next few weeks we’ll talk about techniques for During the next few weeks we’ll talk about techniques for solving the RL problemsolving the RL problem Dynamic programmingDynamic programming – well developed, – well developed,
mathematically, but requires an accurate model of the mathematically, but requires an accurate model of the environmentenvironment
Monte Carlo methodsMonte Carlo methods – do not require a model, but are – do not require a model, but are not suitable for step-by-step incremental computationnot suitable for step-by-step incremental computation
Temporal difference learningTemporal difference learning – methods that do not – methods that do not need a model and are fully incrementalneed a model and are fully incremental
More complex to analyzeMore complex to analyze
Launched the revisiting of RL as a pragmatic Launched the revisiting of RL as a pragmatic framework (1988) framework (1988)
The methods also differ in efficiency and speed of The methods also differ in efficiency and speed of convergence to the optimal solutionconvergence to the optimal solution
ECE 517 - Reinforcement Learning in AI 1111
Dynamic ProgrammingDynamic Programming
Dynamic programmingDynamic programming is the collection of algorithms is the collection of algorithms that can be used to compute optimal policies given a that can be used to compute optimal policies given a perfect model of the environment as an MDPperfect model of the environment as an MDP
DP constitutes a theoretically optimal methodologyDP constitutes a theoretically optimal methodology
In reality often limited since DP is computationally In reality often limited since DP is computationally expensiveexpensive
Important to understand Important to understand reference to other models reference to other models Do “just as well” as DPDo “just as well” as DP Require less computationsRequire less computations Possibly require less memoryPossibly require less memory
Most schemes will strive to achieve the same effect as Most schemes will strive to achieve the same effect as DP, without the computational complexity involvedDP, without the computational complexity involved
ECE 517 - Reinforcement Learning in AI 1212
Dynamic Programming (cont.)Dynamic Programming (cont.)
We will assume finite MDPs (states and actions)We will assume finite MDPs (states and actions)
The agent has knowledge of transition probabilities and The agent has knowledge of transition probabilities and expected immediate rewards, i.e.expected immediate rewards, i.e.
The key idea of DP (as in RL) is the use of value functions The key idea of DP (as in RL) is the use of value functions to derive optimal/good policiesto derive optimal/good policies
We’ll focus on the manner by which values are computedWe’ll focus on the manner by which values are computed
ReminderReminder: an optimal policy is easy to derive once the : an optimal policy is easy to derive once the optimal value function (or action-value function) is optimal value function (or action-value function) is attainedattained
aassssP ttta
ss ,|'Pr 1'
',,| 11' ssaassrER ttttass
ECE 517 - Reinforcement Learning in AI 1313
Dynamic Programming (cont.)Dynamic Programming (cont.)
Employing the Bellman equation to the optimal value/ Employing the Bellman equation to the optimal value/ action-value function, yieldsaction-value function, yields
DP algorithms are obtained by turning the Bellman DP algorithms are obtained by turning the Bellman equations into update rules equations into update rules These rules help improve the approximations of the These rules help improve the approximations of the desired value functionsdesired value functionsWe will discuss two main approaches: We will discuss two main approaches: policy iterationpolicy iteration and and value iterationvalue iteration
)','(max),(
)'(max)(
*'
''
*
*'
''
*
asQRPasQ
sVRPsV
a
ass
s
ass
ass
s
ass
a
ECE 517 - Reinforcement Learning in AI 1414
Method #1: Policy IterationMethod #1: Policy Iteration
Technique for obtaining the optimal policyTechnique for obtaining the optimal policy
Comprises of two complementing stepsComprises of two complementing steps Policy evaluationPolicy evaluation – updating the value function in view of – updating the value function in view of
current policy (which can be sub-optimal)current policy (which can be sub-optimal) Policy improvementPolicy improvement – updating the policy given the – updating the policy given the
current value function (which can be sub-optimal)current value function (which can be sub-optimal)
The process converges by “bouncing” between these The process converges by “bouncing” between these two stepstwo steps
)(1 sV
** ),( sV
12
)(2 sV
ECE 517 - Reinforcement Learning in AI 1515
Policy EvaluationPolicy Evaluation
We’ll consider how to compute the state-value We’ll consider how to compute the state-value function for an arbitrary policyfunction for an arbitrary policy
Recall thatRecall that
(assumes that policy (assumes that policy is always followed) is always followed)
The existence of a unique solution is guaranteed as The existence of a unique solution is guaranteed as long as either long as either <1<1 or eventual termination is or eventual termination is guaranteed from all states under the policyguaranteed from all states under the policy
The Bellman equation translates into The Bellman equation translates into |S||S| simultaneous simultaneous equations with equations with |S||S| unknowns (the values) unknowns (the values)
Assuming we have an initial guess, we can use the Assuming we have an initial guess, we can use the Bellman equation as an update ruleBellman equation as an update rule
0
ECE 517 - Reinforcement Learning in AI 1616
Policy Evaluation (cont.)Policy Evaluation (cont.)
We can writeWe can write
The sequence The sequence {{VVkk}} converges to the correct value function as converges to the correct value function as
kkIn each iteration, all state-values are updated In each iteration, all state-values are updated a.k.a. a.k.a. full backupfull backup
A similar method can be applied to state-action A similar method can be applied to state-action ((QQ((ss,,aa)))) functions functions
An underlying assumption: An underlying assumption: all states are visited each timeall states are visited each time Scheme is computationally heavyScheme is computationally heavy Can be distributed – given sufficient resources Can be distributed – given sufficient resources (Q: How?)(Q: How?)
““In-place” schemesIn-place” schemes – use a single array and update values based – use a single array and update values based on new estimateson new estimates Also converge to the correct solutionAlso converge to the correct solution Order in which states are backed up determines rate of convergenceOrder in which states are backed up determines rate of convergence
)'(),()( ''
'1 sVRPassV kass
s
ass
ak
ECE 517 - Reinforcement Learning in AI 1717
0
Iterative Policy Evaluation algorithmIterative Policy Evaluation algorithm
A key consideration is the termination conditionA key consideration is the termination condition
Typical stopping condition for iterative policy Typical stopping condition for iterative policy evaluation isevaluation is )()(max 1 sVsV kk
Ss
ECE 517 - Reinforcement Learning in AI 1818
Policy ImprovementPolicy Improvement
Policy evaluation deals with finding the value function Policy evaluation deals with finding the value function under a given policyunder a given policyHowever, we don’t know if the policy (and hence the However, we don’t know if the policy (and hence the value function) is optimalvalue function) is optimalPolicy improvementPolicy improvement has to do with the above, and with has to do with the above, and with updating the policy if non-optimal values are reachedupdating the policy if non-optimal values are reached
Suppose that for some arbitrary policy, Suppose that for some arbitrary policy, , we’ve , we’ve computed the value function (using policy evaluation)computed the value function (using policy evaluation)
Let policy Let policy ’ ’ be defined such that in each state be defined such that in each state ss it it selects action selects action aa that maximizes the first-step value, i.e. that maximizes the first-step value, i.e.
It can be shown that It can be shown that ’’ is is at leastat least as good as as good as , and if , and if they are equal they are both the optimal policy.they are equal they are both the optimal policy.
)'(maxarg)(' ''
' sVRPs ass
s
ass
a
def
ECE 517 - Reinforcement Learning in AI 1919
Policy Improvement (cont.)Policy Improvement (cont.)
Consider a greedy policy, Consider a greedy policy, ’’, that selects the action , that selects the action that would yield the highest expected single-step returnthat would yield the highest expected single-step return
Then, by definition,Then, by definition,
this is the condition for the policy improvement this is the condition for the policy improvement theorem.theorem.
The above states that following the new policy one step The above states that following the new policy one step is enough to prove that it is a better policy, i.e. that is enough to prove that it is a better policy, i.e. that
)'(maxarg
),(maxarg)('
''
' sVRP
asQs
ass
s
ass
a
a
)()(', sVssQ
)()(' sVsV
ECE 517 - Reinforcement Learning in AI 2020
0
Proof of the Policy Improvement TheoremProof of the Policy Improvement Theorem