1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic...

11

ECE-517 Reinforcement LearningECE-517 Reinforcement Learningin Artificial Intelligence in Artificial Intelligence

Lecture 7: Finite Horizon MDPs,Lecture 7: Finite Horizon MDPs,Dynamic ProgrammingDynamic Programming

Dr. Itamar ArelDr. Itamar Arel

College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science

The University of TennesseeThe University of TennesseeFall 2011Fall 2011

September 15, 2011September 15, 2011

ECE 517 - Reinforcement Learning in AI 22

OutlineOutline

Finite Horizon MDPsFinite Horizon MDPs

Dynamic ProgrammingDynamic Programming


Finite Horizon MDPs – Value FunctionsFinite Horizon MDPs – Value Functions

The duration, or expected duration, of the process is The duration, or expected duration, of the process is finitefinite

Let’s consider the following return functions:Let’s consider the following return functions: The The expected sum of rewardsexpected sum of rewards

The The expected discounted sum of rewardsexpected discounted sum of rewards

a sufficient condition for the above to converge isa sufficient condition for the above to converge is rrt t < < rrmax max ……

)(lim

|lim)( 01

sVE

ssrEsV

NN

N

tt

N

10 |lim)( 01

1

ssrEsV

N

tt

t

N

expected sumexpected sumof rewardsof rewardsfor for NN steps steps


Return functionsReturn functions

If If rrt t < < rrmaxmax holds, then holds, then

note that this bound is very sensitive to the value of note that this bound is very sensitive to the value of The The expected average rewardexpected average reward

Note that the above limit does not always exist! Note that the above limit does not always exist!

1)( max

1max

1 rrsV

N

t

t

)(1

lim

|1

lim)( 01

sVN

ssrEN

sM

NN

N

tt

N


Relationship between andRelationship between and

Consider a finite horizon problem where the horizon is Consider a finite horizon problem where the horizon is random, i.e.random, i.e.

Let’s also assume that the final value for all states is Let’s also assume that the final value for all states is zerozero

Let Let NN be geometrically distributed with parameter be geometrically distributed with parameter , ,

such that the probability of stopping at the such that the probability of stopping at the NNthth step is step is

LemmaLemma: we’ll show that: we’ll show that

under the assumption that |under the assumption that |rrt t || < < rrmaxmax

)(sV )(sVN

ssrEEsVN

ttNN 0

1

|)(

11Pr nnN

)()( sVsV N


Relationship between and (cont.)Relationship between and (cont.))(sV )(sVN

)(

11

1

1)(

1

1

1

1

1

1

1 1

1

sV

rE

rE

rE

rEsV

tt

t

t

tt

tn

n

tt

n

n

tt

nN

ProofProof::

∆∆


OutlineOutline

Finite Horizon MDPs (cont.)Finite Horizon MDPs (cont.)



0

Example of a finite horizon MDPExample of a finite horizon MDP

Consider the following state diagram:Consider the following state diagram:

S1 S2

a11

{5,0.5}

a22

{-1,1)

a11

{5,0.5}

a12

{10,1}

a21


Why do we need DP techniques ?Why do we need DP techniques ?

Explicitly solving the Bellman Optimality equation is Explicitly solving the Bellman Optimality equation is hardhard Computing the optimal policy Computing the optimal policy solve the RL problem solve the RL problem

Relies on the following three assumptionsRelies on the following three assumptions We have perfect knowledge of the dynamics of the We have perfect knowledge of the dynamics of the

environmentenvironment We have enough computational resourcesWe have enough computational resources The Markov property holdsThe Markov property holds

In reality, all three are problematicIn reality, all three are problematic

e.g. Backgammon game: first and last conditions are ok, e.g. Backgammon game: first and last conditions are ok, but computational resources are insufficientbut computational resources are insufficient Approx. 10Approx. 102020 state state

In many cases we have to settle for approximate In many cases we have to settle for approximate solutions (much more on that later …)solutions (much more on that later …)


Big Picture: Elementary Solution MethodsBig Picture: Elementary Solution Methods

During the next few weeks we’ll talk about techniques for During the next few weeks we’ll talk about techniques for solving the RL problemsolving the RL problem Dynamic programmingDynamic programming – well developed, – well developed,

mathematically, but requires an accurate model of the mathematically, but requires an accurate model of the environmentenvironment

Monte Carlo methodsMonte Carlo methods – do not require a model, but are – do not require a model, but are not suitable for step-by-step incremental computationnot suitable for step-by-step incremental computation

Temporal difference learningTemporal difference learning – methods that do not – methods that do not need a model and are fully incrementalneed a model and are fully incremental

More complex to analyzeMore complex to analyze

Launched the revisiting of RL as a pragmatic Launched the revisiting of RL as a pragmatic framework (1988) framework (1988)

The methods also differ in efficiency and speed of The methods also differ in efficiency and speed of convergence to the optimal solutionconvergence to the optimal solution



Dynamic programmingDynamic programming is the collection of algorithms is the collection of algorithms that can be used to compute optimal policies given a that can be used to compute optimal policies given a perfect model of the environment as an MDPperfect model of the environment as an MDP

DP constitutes a theoretically optimal methodologyDP constitutes a theoretically optimal methodology

In reality often limited since DP is computationally In reality often limited since DP is computationally expensiveexpensive

Important to understand Important to understand reference to other models reference to other models Do “just as well” as DPDo “just as well” as DP Require less computationsRequire less computations Possibly require less memoryPossibly require less memory

Most schemes will strive to achieve the same effect as Most schemes will strive to achieve the same effect as DP, without the computational complexity involvedDP, without the computational complexity involved


Dynamic Programming (cont.)Dynamic Programming (cont.)

We will assume finite MDPs (states and actions)We will assume finite MDPs (states and actions)

The agent has knowledge of transition probabilities and The agent has knowledge of transition probabilities and expected immediate rewards, i.e.expected immediate rewards, i.e.

The key idea of DP (as in RL) is the use of value functions The key idea of DP (as in RL) is the use of value functions to derive optimal/good policiesto derive optimal/good policies

We’ll focus on the manner by which values are computedWe’ll focus on the manner by which values are computed

ReminderReminder: an optimal policy is easy to derive once the : an optimal policy is easy to derive once the optimal value function (or action-value function) is optimal value function (or action-value function) is attainedattained

aassssP ttta

ss ,|'Pr 1'

',,| 11' ssaassrER ttttass


Dynamic Programming (cont.)Dynamic Programming (cont.)

Employing the Bellman equation to the optimal value/ Employing the Bellman equation to the optimal value/ action-value function, yieldsaction-value function, yields

DP algorithms are obtained by turning the Bellman DP algorithms are obtained by turning the Bellman equations into update rules equations into update rules These rules help improve the approximations of the These rules help improve the approximations of the desired value functionsdesired value functionsWe will discuss two main approaches: We will discuss two main approaches: policy iterationpolicy iteration and and value iterationvalue iteration

)','(max),(

)'(max)(

*'

''

*

*'

''

*

asQRPasQ

sVRPsV

a

ass

s

ass

ass

s

ass

a


Method #1: Policy IterationMethod #1: Policy Iteration

Technique for obtaining the optimal policyTechnique for obtaining the optimal policy

Comprises of two complementing stepsComprises of two complementing steps Policy evaluationPolicy evaluation – updating the value function in view of – updating the value function in view of

current policy (which can be sub-optimal)current policy (which can be sub-optimal) Policy improvementPolicy improvement – updating the policy given the – updating the policy given the

current value function (which can be sub-optimal)current value function (which can be sub-optimal)

The process converges by “bouncing” between these The process converges by “bouncing” between these two stepstwo steps

)(1 sV

** ),( sV

12

)(2 sV


Policy EvaluationPolicy Evaluation

We’ll consider how to compute the state-value We’ll consider how to compute the state-value function for an arbitrary policyfunction for an arbitrary policy

Recall thatRecall that

(assumes that policy (assumes that policy is always followed) is always followed)

The existence of a unique solution is guaranteed as The existence of a unique solution is guaranteed as long as either long as either <1<1 or eventual termination is or eventual termination is guaranteed from all states under the policyguaranteed from all states under the policy

The Bellman equation translates into The Bellman equation translates into |S||S| simultaneous simultaneous equations with equations with |S||S| unknowns (the values) unknowns (the values)

Assuming we have an initial guess, we can use the Assuming we have an initial guess, we can use the Bellman equation as an update ruleBellman equation as an update rule

0


Policy Evaluation (cont.)Policy Evaluation (cont.)

We can writeWe can write

The sequence The sequence {{VVkk}} converges to the correct value function as converges to the correct value function as

kkIn each iteration, all state-values are updated In each iteration, all state-values are updated a.k.a. a.k.a. full backupfull backup

A similar method can be applied to state-action A similar method can be applied to state-action ((QQ((ss,,aa)))) functions functions

An underlying assumption: An underlying assumption: all states are visited each timeall states are visited each time Scheme is computationally heavyScheme is computationally heavy Can be distributed – given sufficient resources Can be distributed – given sufficient resources (Q: How?)(Q: How?)

““In-place” schemesIn-place” schemes – use a single array and update values based – use a single array and update values based on new estimateson new estimates Also converge to the correct solutionAlso converge to the correct solution Order in which states are backed up determines rate of convergenceOrder in which states are backed up determines rate of convergence

)'(),()( ''

'1 sVRPassV kass

s

ass

ak


0

Iterative Policy Evaluation algorithmIterative Policy Evaluation algorithm

A key consideration is the termination conditionA key consideration is the termination condition

Typical stopping condition for iterative policy Typical stopping condition for iterative policy evaluation isevaluation is )()(max 1 sVsV kk

Ss


Policy ImprovementPolicy Improvement

Policy evaluation deals with finding the value function Policy evaluation deals with finding the value function under a given policyunder a given policyHowever, we don’t know if the policy (and hence the However, we don’t know if the policy (and hence the value function) is optimalvalue function) is optimalPolicy improvementPolicy improvement has to do with the above, and with has to do with the above, and with updating the policy if non-optimal values are reachedupdating the policy if non-optimal values are reached

Suppose that for some arbitrary policy, Suppose that for some arbitrary policy, , we’ve , we’ve computed the value function (using policy evaluation)computed the value function (using policy evaluation)

Let policy Let policy ’ ’ be defined such that in each state be defined such that in each state ss it it selects action selects action aa that maximizes the first-step value, i.e. that maximizes the first-step value, i.e.

It can be shown that It can be shown that ’’ is is at leastat least as good as as good as , and if , and if they are equal they are both the optimal policy.they are equal they are both the optimal policy.

)'(maxarg)(' ''

' sVRPs ass

s

ass

a

def


Policy Improvement (cont.)Policy Improvement (cont.)

Consider a greedy policy, Consider a greedy policy, ’’, that selects the action , that selects the action that would yield the highest expected single-step returnthat would yield the highest expected single-step return

Then, by definition,Then, by definition,

this is the condition for the policy improvement this is the condition for the policy improvement theorem.theorem.

The above states that following the new policy one step The above states that following the new policy one step is enough to prove that it is a better policy, i.e. that is enough to prove that it is a better policy, i.e. that

)'(maxarg

),(maxarg)('

''

' sVRP

asQs

ass

s

ass

a

a

)()(', sVssQ

)()(' sVsV


0

Proof of the Policy Improvement TheoremProof of the Policy Improvement Theorem


0

Policy IterationPolicy Iteration

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic...

Documents

Transcript of 1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic...