0.0 -6.1 -7.7 -8.4 -8.4 -2.9 -3.0 -2.9 -2.4 optimal k = 10 ...

3
Worksheet C1M4 CMPUT 397 October 2, 2020 1. (Exercise 4.1 S&B) Consider the 4x4 gridworld below, where actions that would take the agent off the grid leave the state unchanged. The task is episodic with γ = 1 and the terminal states are the shaded blocks. Using the precomputed values for the equiprobable policy below, what is q π (11, down)? What is q π (7, down)? -14. -20. -22. -14. -18. -20. -20. -20. -20. -18. -14. -22. -20. -14. k = ! 0.0 0.0 Convergence of iterative policy evaluation on a smal 2. In iterative policy evaluation, we seek to find the value function for a policy π by applying the Bellman equation many times to generate a sequence of value functions v k that will eventually converge to the true value function v π . How can we modify the update below to generate a sequence of action value functions q k ? v k+1 (s)= X a π(a|s) X s 0 ,r p(s 0 ,r|s, a)[r + γv k (s 0 )] 1

Transcript of 0.0 -6.1 -7.7 -8.4 -8.4 -2.9 -3.0 -2.9 -2.4 optimal k = 10 ...

Page 1: 0.0 -6.1 -7.7 -8.4 -8.4 -2.9 -3.0 -2.9 -2.4 optimal k = 10 ...

Worksheet C1M4CMPUT 397

October 2, 2020

1. (Exercise 4.1 S&B) Consider the 4x4 gridworld below, where actions that would take theagent off the grid leave the state unchanged. The task is episodic with γ = 1 and theterminal states are the shaded blocks. Using the precomputed values for the equiprobablepolicy below, what is qπ(11, down)? What is qπ(7, down)?

4.2. Policy Improvement 77

0.0 0.0 0.0

0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0

0.0 0.0 0.0

-1.0 -1.0 -1.0

-1.0 -1.0 -1.0 -1.0

-1.0 -1.0 -1.0 -1.0

-1.0 -1.0 -1.0

-1.7 -2.0 -2.0

-1.7 -2.0 -2.0 -2.0

-2.0 -2.0 -2.0 -1.7

-2.0 -2.0 -1.7

-2.4 -2.9 -3.0

-2.4 -2.9 -3.0 -2.9

-2.9 -3.0 -2.9 -2.4

-3.0 -2.9 -2.4

-6.1 -8.4 -9.0

-6.1 -7.7 -8.4 -8.4

-8.4 -8.4 -7.7 -6.1

-9.0 -8.4 -6.1

-14. -20. -22.

-14. -18. -20. -20.

-20. -20. -18. -14.

-22. -20. -14.

Vk for the

Random Policy

Greedy Policy

w.r.t. Vk

k = 0

k = 1

k = 2

k = 10

k = !

k = 3

optimal policy

random policy

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

vk for the

random policyvk greedy policy

w.r.t. vk

Figure 4.1: Convergence of iterative policy evaluation on a small gridworld. The left column isthe sequence of approximations of the state-value function for the random policy (all actionsequally likely). The right column is the sequence of greedy policies corresponding to the valuefunction estimates (arrows are shown for all actions achieving the maximum, and the numbersshown are rounded to two significant digits). The last policy is guaranteed only to be animprovement over the random policy, but in this case it, and all policies after the third iteration,are optimal.

2. In iterative policy evaluation, we seek to find the value function for a policy π by applyingthe Bellman equation many times to generate a sequence of value functions vk that willeventually converge to the true value function vπ. How can we modify the update below togenerate a sequence of action value functions qk?

vk+1(s) =∑

a

π(a|s)∑

s′,r

p(s′, r|s, a) [r + γvk(s′)]

1

Page 2: 0.0 -6.1 -7.7 -8.4 -8.4 -2.9 -3.0 -2.9 -2.4 optimal k = 10 ...

Worksheet C1M4CMPUT 397

October 2, 2020

3. (Challenge Question) A gambler has the opportunity to make bets on the outcomes ofa sequence of coin flips. If the coin comes up heads, she wins as many dollars as she hasstaked on that flip; if it is tails, she loses her stake. The game ends when the gamblerwins by reaching her goal of $100, or loses by running out of money. On each flip, thegambler must decide what portion of her capital to stake, in integer numbers of dollars.This problem can be formulated as an undiscounted, episodic, finite MDP. The state is thegambler’s capital, s ∈ {1, 2, ..., 99} and the actions are stakes, a ∈ {0, 1, ...,min(s, 100−s)}.The reward is +1 when reaching the goal of $100 and zero on all other transitions. Theprobability of seeing heads is ph = 0.4.

(a) What does the value of a state mean in this problem? For example, in a gridworld wherethe value of 1 per step, the value represents the expected number of steps to goal. Whatdoes the value of state mean in the gambler’s problem? Think about the minimum andmaximum possible values, and think about the values of state 50 (which is 0.4) and state99 (which is near 0.95).

(b) Modify the pseudocode for value iteration to more efficiently solve this specific problem,by exploiting your knowledge of the dynamics. Hint: Not all states transition to everyother state. For example, can you transition from state 1 to state 99?

4.4. Value Iteration 83

case is when policy evaluation is stopped after just one sweep (one update of each state).This algorithm is called value iteration. It can be written as a particularly simple updateoperation that combines the policy improvement and truncated policy evaluation steps:

vk+1(s).= max

aE[Rt+1 + �vk(St+1) | St =s, At =a]

= maxa

X

s0,r

p(s0, r |s, a)hr + �vk(s0)

i, (4.10)

for all s 2 S. For arbitrary v0, the sequence {vk} can be shown to converge to v⇤ underthe same conditions that guarantee the existence of v⇤.

Another way of understanding value iteration is by reference to the Bellman optimalityequation (4.1). Note that value iteration is obtained simply by turning the Bellmanoptimality equation into an update rule. Also note how the value iteration update isidentical to the policy evaluation update (4.5) except that it requires the maximum to betaken over all actions. Another way of seeing this close relationship is to compare thebackup diagrams for these algorithms on page 59 (policy evaluation) and on the left ofFigure 3.4 (value iteration). These two are the natural backup operations for computingv⇡ and v⇤.

Finally, let us consider how value iteration terminates. Like policy evaluation, valueiteration formally requires an infinite number of iterations to converge exactly to v⇤. Inpractice, we stop once the value function changes by only a small amount in a sweep.The box below shows a complete algorithm with this kind of termination condition.

Value Iteration, for estimating ⇡ ⇡ ⇡⇤

Algorithm parameter: a small threshold ✓ > 0 determining accuracy of estimationInitialize V (s), for all s 2 S+, arbitrarily except that V (terminal) = 0

Loop:| � 0| Loop for each s 2 S:| v V (s)| V (s) maxa

Ps0,r p(s0, r |s, a)

⇥r + �V (s0)

| � max(�, |v � V (s)|)until � < ✓

Output a deterministic policy, ⇡ ⇡ ⇡⇤, such that⇡(s) = argmaxa

Ps0,r p(s0, r |s, a)

⇥r + �V (s0)

Value iteration e↵ectively combines, in each of its sweeps, one sweep of policy evaluationand one sweep of policy improvement. Faster convergence is often achieved by interposingmultiple policy evaluation sweeps between each policy improvement sweep. In general,the entire class of truncated policy iteration algorithms can be thought of as sequencesof sweeps, some of which use policy evaluation updates and some of which use valueiteration updates. Because the max operation in (4.10) is the only di↵erence between

2

Page 3: 0.0 -6.1 -7.7 -8.4 -8.4 -2.9 -3.0 -2.9 -2.4 optimal k = 10 ...

Worksheet C1M4CMPUT 397

October 2, 2020

4. (Challenge Question) (Exercise 4.4 S&B) The policy iteration algorithm on page 80 hasa subtle bug in that it may never terminate if the policy continually switches between twoor more policies that are equally good. This is ok for pedagogy, but not for actual use.Modify the pseudocode so that convergence is guaranteed. Note that there is more thanone approach to solve this problem.

80 Chapter 4: Dynamic Programming

s 2 S, illustrating policy improvement. Although in this case the new policy ⇡0 happensto be optimal, in general only an improvement is guaranteed.

4.3 Policy Iteration

Once a policy, ⇡, has been improved using v⇡ to yield a better policy, ⇡0, we can thencompute v⇡0 and improve it again to yield an even better ⇡00. We can thus obtain asequence of monotonically improving policies and value functions:

⇡0E�! v⇡0

I�! ⇡1E�! v⇡1

I�! ⇡2E�! · · · I�! ⇡⇤

E�! v⇤,

whereE�! denotes a policy evaluation and

I�! denotes a policy improvement . Eachpolicy is guaranteed to be a strict improvement over the previous one (unless it is alreadyoptimal). Because a finite MDP has only a finite number of policies, this process mustconverge to an optimal policy and optimal value function in a finite number of iterations.

This way of finding an optimal policy is called policy iteration. A complete algorithm isgiven in the box below. Note that each policy evaluation, itself an iterative computation,is started with the value function for the previous policy. This typically results in a greatincrease in the speed of convergence of policy evaluation (presumably because the valuefunction changes little from one policy to the next).

Policy Iteration (using iterative policy evaluation) for estimating ⇡ ⇡ ⇡⇤

1. InitializationV (s) 2 R and ⇡(s) 2 A(s) arbitrarily for all s 2 S

2. Policy EvaluationLoop:

� 0Loop for each s 2 S:

v V (s)V (s) P

s0,r p(s0, r |s,⇡(s))⇥r + �V (s0)

� max(�, |v � V (s)|)until � < ✓ (a small positive number determining the accuracy of estimation)

3. Policy Improvementpolicy-stable trueFor each s 2 S:

old-action ⇡(s)⇡(s) argmaxa

Ps0,r p(s0, r |s, a)

⇥r + �V (s0)

If old-action 6= ⇡(s), then policy-stable falseIf policy-stable, then stop and return V ⇡ v⇤ and ⇡ ⇡ ⇡⇤; else go to 2

3