Ideas and Motivation Background Recognizersszepesva/Talks/pitpolytalk-slidesonly.pdf · Policy...

25
Policy iteration is strongly polynomial Based on Yinyu Ye’s working paper http://www.stanford.edu/ ˜ yyye/simplexmdp1.pdf Csaba Szepesv ´ ari University of Alberta E-mail: [email protected] Tea-Time Talk, July 29, 2010 Cs. Szepesv ´ ari (UofA) On policy iteration July 29, 2010 1 / 25

Transcript of Ideas and Motivation Background Recognizersszepesva/Talks/pitpolytalk-slidesonly.pdf · Policy...

Policy iteration is strongly polynomialBased on Yinyu Ye’s working paper

http://www.stanford.edu/˜yyye/simplexmdp1.pdf

Csaba Szepesvari

University of AlbertaE-mail: [email protected]

Tea-Time Talk, July 29, 2010

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 1 / 25

Contributions! !"# !"$ %&'()*+

!

%, -.*/0/)1)(2

34+5)(2

67+'()*+5

80.94(

:*1)'2;<)(=;

.4'*9+)>4.

?4=0@)*.;:*1)'2

80.94(;

:*1)'2;<A*

;.4'*9+)>4.

Off-policy learning with options and recognizersDoina Precup, Richard S. Sutton, Cosmin Paduraru, Anna J. Koop, Satinder Singh

McGill University, University of Alberta, University of Michigan

Options

Distinguished

region

Ideas and Motivation Background Recognizers Off-policy algorithm for options Learning w/o the Behavior Policy

Wall

Options• A way of behaving for a period of time

Models of options• A predictive model of the outcome of following the option• What state will you be in?• Will you still control the ball?• What will be the value of some feature?• Will your teammate receive the pass?• What will be the expected total reward along the way?• How long can you keep control of the ball?

Dribble Keepaway Pass

Options for soccer players could be

Options in a 2D world

The red and blue options

are mostly executed.

Surely we should be able

to learn about them from

this experience!

Experienced

trajectory

Off-policy learning• Learning about one policy while behaving according to another• Needed for RL w/exploration (as in Q-learning)• Needed for learning abstract models of dynamical systems

(representing world knowledge)• Enables efficient exploration• Enables learning about many ways of behaving at the same time

(learning models of options)

! a policy! a stopping condition

Non-sequential example

Problem formulation w/o recognizers

Problem formulation with recognizers

• One state

• Continuous action a ! [0, 1]

• Outcomes zi = ai

• Given samples from policy b : [0, 1] " #+

• Would like to estimate the mean outcome for a sub-region of the

action space, here a ! [0.7, 0.9]

Target policy ! : [0, 1] ! "+ is uniform within the region of interest

(see dashed line in figure below). The estimator is:

m! =1n

nX

i=1

!(ai)

b(ai)zi.

Theorem 1 Let A = {a1, . . . ak} ! A be a subset of all the

possible actions. Consider a fixed behavior policy b and let !A be

the class of policies that only choose actions from A, i.e., if!(a) > 0 then a " A. Then the policy induced by b and the binaryrecognizer cA is the policy with minimum-variance one-step

importance sampling corrections, among those in !A:

! as given by (1) = arg minp!!A

Eb

"

!(ai)

b(ai)

«2#

(2)

Proof: Using Lagrange multipliers

Theorem 2 Consider two binary recognizers c1 and c2, such that

µ1 > µ2. Then the importance sampling corrections for c1 have

lower variance than the importance sampling corrections for c2.

Off-policy learning

Let the importance sampling ratio at time step t be:

!t ="(st, at)

b(st, at)

The truncated n-step return, R(n)t , satisfies:

R(n)t = !t[rt+1 + (1 ! #t+1)R

(n!1)t+1 ].

The update to the parameter vector is proportional to:

!$t =h

R!t ! yt

i

""yt!0(1 ! #1) · · · !t!1(1 ! #t).

Theorem 3 For every time step t ! 0 and any initial state s,

Eb[!!t|s] = E![!!t|s].

Proof: By induction on n we show that

Eb{R(n)t |s} = E!{R

(n)t |s}

which implies that Eb{R"t |s} = E!(R"

t |s}. The rest of the proof isalgebraic manipulations (see paper).

Implementation of off-policy learning for options

In order to avoid!! ! 0, we use a restart function g : S ! [0, 1](like in the PSD algorithm). The forward algorithm becomes:

!!t = (R!t " yt)#"yt

tX

i=0

gi"i..."t!1(1 " #i+1)...(1 " #t),

where gt is the extent of restarting in state st.

The incremental learning algorithm is the following:

• Initialize !0 = g0, e0 = !0!!y0

• At every time step t:

"t = #t (rt+1 + (1 " $t+1)yt+1) " yt

%t+1 = %t + &"tet

!t+1 = #t!t(1 " $t+1) + gt+1

et+1 = '#t(1 " $t+1)et + !t+1!!yt+1

References

Off-policy learning is tricky

• The Bermuda triangle

! Temporal-difference learning! Function approximation (e.g., linear)! Off-policy

• Leads to divergence of iterative algorithms! Q-learning diverges with linear FA! Dynamic programming diverges with linear FA

Baird's Counterexample

Vk(s) =

!(7)+2!(1)

terminal

state99%

1%

100%

Vk(s) =

!(7)+2!(2)

Vk(s) =

!(7)+2!(3)

Vk(s) =

!(7)+2!(4)

Vk(s) =

!(7)+2!(5)

Vk(s) =

2!(7)+!(6)

0

5

10

0 1000 2000 3000 4000 5000

10

10

/ -10

Iterations (k)

510

1010

010

-

-

Parametervalues, !k(i)

(log scale,

broken at !1)

!k(7)

!k(1) – !k(5)

!k(6)

Precup, Sutton & Dasgupta (PSD) algorithm

• Uses importance sampling to convert off-policy case to on-policy case• Convergence assured by theorem of Tsitsiklis & Van Roy (1997)• Survives the Bermuda triangle!

BUT!

• Variance can be high, even infinite (slow learning)• Difficult to use with continuous or large action spaces• Requires explicit representation of behavior policy (probability distribution)

Option formalism

An option is defined as a triple o = !I,!, ""

• I # S is the set of states in which the option can be initiated

• ! is the internal policy of the option

• " : S $ [0, 1] is a stochastic termination condition

We want to compute the reward model of option o:

Eo{R(s)} = E{r1 + r2 + . . . + rT |s0 = s, !, "}

We assume that linear function approximation is used to represent

the model:

Eo{R(s)} % #T $s = y

Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function

approximation. In Proceedings of ICML.

Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference

learning with function approximation. In Proceedings of ICML.

Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A

framework for temporal abstraction in reinforcement learning. Artificial

Intelligence, vol . 112, pp. 181–211.

Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings

of NIPS-17.

Sutton R.S., Rafols E. and Koop, A. (2006). Temporal abstraction in

temporal-difference networks”. In Proceedings of NIPS-18.

Tadic, V. (2001). On the convergence of temporal-difference learning with linear

function approximation. In Machine learning vol. 42.

Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning

with function approximation. IEEE Transactions on Automatic Control 42.

Acknowledgements

Theorem 4 If the following assumptions hold:

• The function approximator used to represent the model is a

state aggregrator

• The recognizer behaves consistently with the function

approximator, i.e., c(s, a) = c(p, a), !s " p

• The recognition probability for each partition, µ(p) is estimatedusing maximum likelihood:

µ(p) =N(p, c = 1)

N(p)

Then there exists a policy ! such that the off-policy learning

algorithm converges to the same model as the on-policy algorithm

using !.

Proof: In the limit, w.p.1, µ converges toP

s db(s|p)P

a c(p, a)b(s, a) where db(s|p) is the probability ofvisiting state s from partition p under the stationary distribution of b.

Let ! be defined to be the same for all states in a partition p:

!(p, a) = "(p, a)X

s

db(s|p)b(s, a)

! is well-defined, in the sense thatP

a !(s, a) = 1. Using Theorem3, off-policy updates using importance sampling corrections " willhave the same expected value as on-policy updates using !.

The authors gratefully acknowledge the ideas and encouragement

they have received in this work from Eddie Rafols, Mark Ring,

Lihong Li and other members of the rlai.net group. We thank Csaba

Szepesvari and the reviewers of the paper for constructive

comments. This research was supported in part by iCore, NSERC,

Alberta Ingenuity, and CFI.

The target policy ! is induced by a recognizer function

c : [0, 1] !" #+:

!(a) =c(a)b(a)

P

x c(x)b(x)=

c(a)b(a)µ

(1)

(see blue line below). The estimator is:

m! =1n

nX

i=1

zi!(ai)b(ai)

=1n

nX

i=1

zic(ai)b(ai)

µ

1b(ai)

=1n

nX

i=1

zic(ai)

µ

!" !"" #"" $"" %"" &"""

'&

!

!'& ()*+,+-./01.,+.2-345.13,.630780#""04.)*/301.,+.2-349

:+;<7=;0,3-762+>3,

:+;<0,3-762+>3,

?=)@3,07804.)*/30.-;+724

McGill

The importance sampling corrections are:

!(s, a) ="(s, a)b(s, a)

=c(s, a)µ(s)

where µ(s) depends on the behavior policy b. If b is unknown,instead of µ we will use a maximum likelihood estimate

µ : S ! [0, 1], and importance sampling corrections will be definedas:

!(s, a) =c(s, a)

µ(s)

On-policy learning

If ! is used to generate behavior, then the reward model of anoption can be learned using TD-learning.

The n-step truncated return is:

R(n)t = rt+1 + (1 ! "t+1)R

(n!1)t+1 .

The #-return is defined as usual:

R!t = (1 ! #)

"X

n=1

#n!1R(n)t .

The parameters of the function approximator are updated on every

step proportionally to:

!$t =h

R!t ! yt

i

""yt(1 ! "1) · · · (1 ! "t).

• Recognizers reduce variance

• First off-policy learning algorithm for option models

• Off-policy learning without knowledge of the behavior

distribution

• Observations

– Options are a natural way to reduce the variance of

importance sampling algorithms (because of the termination

condition)

– Recognizers are a natural way to define options, especially

for large or continuous action spaces.

Contributions! !"# !"$ %&'()*+

!

%, -.*/0/)1)(2

34+5)(2

67+'()*+5

80.94(

:*1)'2;<)(=;

.4'*9+)>4.

?4=0@)*.;:*1)'2

80.94(;

:*1)'2;<A*

;.4'*9+)>4.

Off-policy learning with options and recognizersDoina Precup, Richard S. Sutton, Cosmin Paduraru, Anna J. Koop, Satinder Singh

McGill University, University of Alberta, University of Michigan

Options

Distinguished

region

Ideas and Motivation Background Recognizers Off-policy algorithm for options Learning w/o the Behavior Policy

Wall

Options• A way of behaving for a period of time

Models of options• A predictive model of the outcome of following the option• What state will you be in?• Will you still control the ball?• What will be the value of some feature?• Will your teammate receive the pass?• What will be the expected total reward along the way?• How long can you keep control of the ball?

Dribble Keepaway Pass

Options for soccer players could be

Options in a 2D world

The red and blue options

are mostly executed.

Surely we should be able

to learn about them from

this experience!

Experienced

trajectory

Off-policy learning• Learning about one policy while behaving according to another• Needed for RL w/exploration (as in Q-learning)• Needed for learning abstract models of dynamical systems

(representing world knowledge)• Enables efficient exploration• Enables learning about many ways of behaving at the same time

(learning models of options)

! a policy! a stopping condition

Non-sequential example

Problem formulation w/o recognizers

Problem formulation with recognizers

• One state

• Continuous action a ! [0, 1]

• Outcomes zi = ai

• Given samples from policy b : [0, 1] " #+

• Would like to estimate the mean outcome for a sub-region of the

action space, here a ! [0.7, 0.9]

Target policy ! : [0, 1] ! "+ is uniform within the region of interest

(see dashed line in figure below). The estimator is:

m! =1n

nX

i=1

!(ai)

b(ai)zi.

Theorem 1 Let A = {a1, . . . ak} ! A be a subset of all the

possible actions. Consider a fixed behavior policy b and let !A be

the class of policies that only choose actions from A, i.e., if!(a) > 0 then a " A. Then the policy induced by b and the binaryrecognizer cA is the policy with minimum-variance one-step

importance sampling corrections, among those in !A:

! as given by (1) = arg minp!!A

Eb

"

!(ai)

b(ai)

«2#

(2)

Proof: Using Lagrange multipliers

Theorem 2 Consider two binary recognizers c1 and c2, such that

µ1 > µ2. Then the importance sampling corrections for c1 have

lower variance than the importance sampling corrections for c2.

Off-policy learning

Let the importance sampling ratio at time step t be:

!t ="(st, at)

b(st, at)

The truncated n-step return, R(n)t , satisfies:

R(n)t = !t[rt+1 + (1 ! #t+1)R

(n!1)t+1 ].

The update to the parameter vector is proportional to:

!$t =h

R!t ! yt

i

""yt!0(1 ! #1) · · · !t!1(1 ! #t).

Theorem 3 For every time step t ! 0 and any initial state s,

Eb[!!t|s] = E![!!t|s].

Proof: By induction on n we show that

Eb{R(n)t |s} = E!{R

(n)t |s}

which implies that Eb{R"t |s} = E!(R"

t |s}. The rest of the proof isalgebraic manipulations (see paper).

Implementation of off-policy learning for options

In order to avoid!! ! 0, we use a restart function g : S ! [0, 1](like in the PSD algorithm). The forward algorithm becomes:

!!t = (R!t " yt)#"yt

tX

i=0

gi"i..."t!1(1 " #i+1)...(1 " #t),

where gt is the extent of restarting in state st.

The incremental learning algorithm is the following:

• Initialize !0 = g0, e0 = !0!!y0

• At every time step t:

"t = #t (rt+1 + (1 " $t+1)yt+1) " yt

%t+1 = %t + &"tet

!t+1 = #t!t(1 " $t+1) + gt+1

et+1 = '#t(1 " $t+1)et + !t+1!!yt+1

References

Off-policy learning is tricky

• The Bermuda triangle

! Temporal-difference learning! Function approximation (e.g., linear)! Off-policy

• Leads to divergence of iterative algorithms! Q-learning diverges with linear FA! Dynamic programming diverges with linear FA

Baird's Counterexample

Vk(s) =

!(7)+2!(1)

terminal

state99%

1%

100%

Vk(s) =

!(7)+2!(2)

Vk(s) =

!(7)+2!(3)

Vk(s) =

!(7)+2!(4)

Vk(s) =

!(7)+2!(5)

Vk(s) =

2!(7)+!(6)

0

5

10

0 1000 2000 3000 4000 5000

10

10

/ -10

Iterations (k)

510

1010

010

-

-

Parametervalues, !k(i)

(log scale,

broken at !1)

!k(7)

!k(1) – !k(5)

!k(6)

Precup, Sutton & Dasgupta (PSD) algorithm

• Uses importance sampling to convert off-policy case to on-policy case• Convergence assured by theorem of Tsitsiklis & Van Roy (1997)• Survives the Bermuda triangle!

BUT!

• Variance can be high, even infinite (slow learning)• Difficult to use with continuous or large action spaces• Requires explicit representation of behavior policy (probability distribution)

Option formalism

An option is defined as a triple o = !I,!, ""

• I # S is the set of states in which the option can be initiated

• ! is the internal policy of the option

• " : S $ [0, 1] is a stochastic termination condition

We want to compute the reward model of option o:

Eo{R(s)} = E{r1 + r2 + . . . + rT |s0 = s, !, "}

We assume that linear function approximation is used to represent

the model:

Eo{R(s)} % #T $s = y

Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function

approximation. In Proceedings of ICML.

Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference

learning with function approximation. In Proceedings of ICML.

Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A

framework for temporal abstraction in reinforcement learning. Artificial

Intelligence, vol . 112, pp. 181–211.

Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings

of NIPS-17.

Sutton R.S., Rafols E. and Koop, A. (2006). Temporal abstraction in

temporal-difference networks”. In Proceedings of NIPS-18.

Tadic, V. (2001). On the convergence of temporal-difference learning with linear

function approximation. In Machine learning vol. 42.

Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning

with function approximation. IEEE Transactions on Automatic Control 42.

Acknowledgements

Theorem 4 If the following assumptions hold:

• The function approximator used to represent the model is a

state aggregrator

• The recognizer behaves consistently with the function

approximator, i.e., c(s, a) = c(p, a), !s " p

• The recognition probability for each partition, µ(p) is estimatedusing maximum likelihood:

µ(p) =N(p, c = 1)

N(p)

Then there exists a policy ! such that the off-policy learning

algorithm converges to the same model as the on-policy algorithm

using !.

Proof: In the limit, w.p.1, µ converges toP

s db(s|p)P

a c(p, a)b(s, a) where db(s|p) is the probability ofvisiting state s from partition p under the stationary distribution of b.

Let ! be defined to be the same for all states in a partition p:

!(p, a) = "(p, a)X

s

db(s|p)b(s, a)

! is well-defined, in the sense thatP

a !(s, a) = 1. Using Theorem3, off-policy updates using importance sampling corrections " willhave the same expected value as on-policy updates using !.

The authors gratefully acknowledge the ideas and encouragement

they have received in this work from Eddie Rafols, Mark Ring,

Lihong Li and other members of the rlai.net group. We thank Csaba

Szepesvari and the reviewers of the paper for constructive

comments. This research was supported in part by iCore, NSERC,

Alberta Ingenuity, and CFI.

The target policy ! is induced by a recognizer function

c : [0, 1] !" #+:

!(a) =c(a)b(a)

P

x c(x)b(x)=

c(a)b(a)µ

(1)

(see blue line below). The estimator is:

m! =1n

nX

i=1

zi!(ai)b(ai)

=1n

nX

i=1

zic(ai)b(ai)

µ

1b(ai)

=1n

nX

i=1

zic(ai)

µ

!" !"" #"" $"" %"" &"""

'&

!

!'& ()*+,+-./01.,+.2-345.13,.630780#""04.)*/301.,+.2-349

:+;<7=;0,3-762+>3,

:+;<0,3-762+>3,

?=)@3,07804.)*/30.-;+724

McGill

The importance sampling corrections are:

!(s, a) ="(s, a)b(s, a)

=c(s, a)µ(s)

where µ(s) depends on the behavior policy b. If b is unknown,instead of µ we will use a maximum likelihood estimate

µ : S ! [0, 1], and importance sampling corrections will be definedas:

!(s, a) =c(s, a)

µ(s)

On-policy learning

If ! is used to generate behavior, then the reward model of anoption can be learned using TD-learning.

The n-step truncated return is:

R(n)t = rt+1 + (1 ! "t+1)R

(n!1)t+1 .

The #-return is defined as usual:

R!t = (1 ! #)

"X

n=1

#n!1R(n)t .

The parameters of the function approximator are updated on every

step proportionally to:

!$t =h

R!t ! yt

i

""yt(1 ! "1) · · · (1 ! "t).

• Recognizers reduce variance

• First off-policy learning algorithm for option models

• Off-policy learning without knowledge of the behavior

distribution

• Observations

– Options are a natural way to reduce the variance of

importance sampling algorithms (because of the termination

condition)

– Recognizers are a natural way to define options, especially

for large or continuous action spaces.

Outline

1 Motivation

2 Background

3 Policy iteration, linear programming and the simplex method

4 The algorithm

5 Proof

6 Conclusions

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 2 / 25

Summary

ProblemGiven a finite MDP with m states, k actions, 0 ≤ γ < 1 fixed, find anoptimal policy for the expected total discounted cost criterion.

Exact arithmetics is available.What is the complexity of Howard’s policy iteration (PI) algorithm?

TheoremHoward’s PI needs

O(

m2(k − 1)

1− γlog

m2

1− γ

)iterations, so its complexity is

O(

m5k2

1− γlog

m2

1− γ

).

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 3 / 25

An improvement

TheoremConsider PI when in every iteration the policy is updated only at asingle, well-selected state (details later). This PI needs

O(

m2(k − 1)

1− γlog

m2

1− γ

)iterations, and its complexity is

O(

m4k2

1− γlog

m2

1− γ

).

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 4 / 25

MDPs

Definition (MDP)MDP data: (P1, . . . ,Pk, c1, . . . , ck, γ). Here:

Pi ∈ Rm×m is the transition matrix for action k

Pi is column stochastic (i.e., eTPi = e), e = (1, . . . , 1)> ∈ Rm

ci ∈ Rm is a cost vector0 ≤ γ < 1 is a discount factor

(Pi)pq is the probability of arriving at p given that action i is chosen atstate q.

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 5 / 25

Polynomial vs. strongly polynomial

DefinitionAn algorithm is polynomial, if its running time (≡ number of arithmeticoperations) is polynomial in the total bit-size L of the MDP.

DefinitionAn algorithm is strongly polynomial, if its running time is polynomial inm + k, i.e., the running time can be bounded as a polynomial of m andk, independently of L

Strongly polynomial =⇒ no difficult instances.

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 6 / 25

Previous work

Bellman (1957): value iteration (VI)Howard (1960): policy iteration (PI)Papadimitrious & Tsitsiklis (1987): When all transitions aredeterministic, there exists a strongly polynomial algorithm(“minimum-mean-cost-cycle problem”)Tseng (1990): VI is polynomial; based on Bertsekas (1987)Puterman (1994): PI requires no more iterations than VIYe (2005): Interior point algorithm which is strongly polynomial

Previous results for k = 2Value iteration Policy iteration Linear progr. Interior point

m2L1−γ min

(2mm2, m2L

1−γ

)m3L m4 log( 1

1−γ )

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 7 / 25

Linear programming

LP Standard form

c>x→ min

s.t. Ax = b

x ≥ 0.

A ∈ Rm×n, rank(A) = m, c ∈ Rn,b ∈ Rm.

LP Dual

b>y→ max

s.t. s = c− A>y

s ≥ 0.

s: dual slack vector

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 8 / 25

The LP formulation for MDPs

de Ghellinck (1960), D’Epeoux (1960), Manne (1960)

MDP LP-Dual

k∑i=1

c>i xi → min

s.t.k∑

i=1

(I − γPi)xi = e

x1, . . . , xk ≥ 0.

Note: WithA = [I − γP1, . . . , I − γPk], b = e,c> = (c>1 , . . . , c

>k ), this is a

standard LP.

MDP LP-Primal

e>y→ max

s.t. (I − γP1)>y + s1 = c1

...

(I − γPk)>y + sk = ck

s1, . . . , sk ≥ 0.

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 9 / 25

Interpretation of MDP LP-Dual

Take any feasible solution x to the MDP LP-Dual. Let π = πx bedefined by

π(i | j) =xij∑k

i′=1 xi′, j.

Follow π in the MDP, where X0 is drawn uniformly. Then

xij =∑

j′E

[ ∞∑t=0

γt I{Xt=j,At=i}|X0 = j′],

i.e, xij is the expected discounted frequency of visiting state-actionpair (j, i) when using π = πx.The above transformation gives an optimal policy when startingfrom the optimal solution x∗.

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 10 / 25

Interpretation of MDP LP-Primal

The primal is called the “primal” due to its connection to optimalvalues:

I Let Ti be defined by Tiv = ci + γP>i v. Then, T∗ = mini Ti.I Take any v such that v ≤ Tv. Then since Tkv→ v∗, we have

v ≤ v∗ = Tv∗.I Thus, v∗ is the unique solution to v ≤ Tv, eTv→ max.I v ≤ Tv = mini Tiv⇔ v ≤ Tiv holds for all i ∈ {1, . . . , k}.I Thus, v∗ is the unique solution to

e>v→ max,s.t. v + si = Tiv, i = 1, . . . , k,

s1, . . . , sk ≥ 0.

I This is exactly our MDP LP-PrimalI The MDP LP-Dual is the dual of the primal

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 11 / 25

Prepation

Consider k = 2 for simplicity!

c>1 x1 + c>2 x2 → min

s.t. (I − γP1)x1 + (I − γP2)x2 = e

x1, x2 ≥ 0

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 12 / 25

The algorithm

Step t:

1 We are given a partition of the column indexes:{1, . . . , 2m} = Bt ∪∗ Bt, e.g.,

Action 1 Action 2Bt 0 0 1 0 1 1 0 1

2 Find xt: xtBt = 0, solve for xt

Bt :

xtBt = (I − γPBt)−1e.

3 Compute ct: ctBt = cBt − (I − γPBt)>(I − γPBt)−TcBt , and ct

Bt = 0.4 it = argmini ct

i5 If ct

it < 0:I Add it and remove its “counterpart”, e.g., if it = 1,

Action 1 Action 2Bt+1 1 0 1 0 0 1 0 1

I Increment t, continue with Step 16 Else: stop!

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 13 / 25

A little explanation

This is basically the simplex method with themost-negative-reduced-cost pivoting ruleWhat is the idea??Given a guess at what coefficients should be zero . . .

I Why is not it limiting to assume that some coefficients will be zero?from the constraint Ax = e, compute the rest! (Step 2).Then we should check if the solution is optimal: Given any feasiblex, any Bt, we must have xBt = (I − γPBt)−1

[e− (I − γPBt)xBt

].

Hence,1

cTx = c>Bt xBt + c>Bt xBt

= (cBt − (I − γPBt)>(I − γPBt)−TcBt)>xBt + c>Bt(I − γPBt)−1e

= (ctBt)>xBt + c>Bt xt

Bt . (*)

Hence, if ctBt ≥ 0, the optimal solution is xBt = 0.

1PB selects the columns with index i ∈ B from P = [P1, . . . ,Pk].Cs. Szepesvari (UofA) On policy iteration July 29, 2010 14 / 25

Quiz

Q: How does the algorithm look like for k > 2?A: Bt represents the policy: for each state, exactly one actionshould be selected.Q: How much work do we do in an iteration?A: O(km3) [O(km2)??]Q: Again, what is the mysterious stopping condition, mini ct

i ≥ 0, or

cBt − (I − γPBt)>(I − γPBt)−TcBt ≥ 0 ?

A: ⇔ The value function of the current policy (Bt) is not worse thanif we change any of the actions in any of the statesQ: What does policy iteration correspond to?A: Change all the actions with ct

i < 0.

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 15 / 25

Lemma 0Consider an LP and its dual. Let s∗ be an optimal slack vector and letz∗ be the optimal value of the primal. Then, for any feasible x

c>x = z∗ + (s∗)>x.

Proof.Remember:

The LP: c>x→ min s.t. Ax = b, x ≥ 0.The dual LP: b>y→ max s.t. s = c− A>y, s ≥ 0.

Let (y∗, s∗) be the optimal solution to the dual LP and let x∗ be anoptimal solution to the primal LP. Then complementary slackness givesthat (s∗)>x∗ = 0. Since s∗ = c− A>y∗, and, by assumption, Ax = Ax∗,

z∗ = c>x∗ = (y∗)>Ax∗ = (y∗)>Ax

= (A>y∗)>x = (A>y∗ − c)>x + c>x

= −(s∗)>x + c>x.

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 16 / 25

Lemma 1Part 1: If x is feasible then

e>x =m

1− γ.

Part 2: If x is a pure solution (one action for each state) then

1 ≤ xi ≤m

1− γ

holds for any i with xi 6= 0.

Proof of Part 1.Part 1: Follows since any feasible x corresponds to the discountedvisitation counts. But also, Ax = e, so, e>Ax = m. Since e>Pi = e. Also,e>Ax = (1− γ)

∑ki=1 e>xi = e>x (in the last equation we changed e to

have mk dimensions).

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 17 / 25

Proof Part 2.Lower bound: If x is a pure solution, x = (I − γPB)−1e where B is theset of column indeces where xi 6= 0, PB selects the appropriatecolumns of P = [P1, . . . ,Pk]. Hence,

x = (I − γPB)−1e =(I + γPB + γ2P2

B + . . .)

e ≥ e.

Upper bound: Since ‖PB‖∞ = 1 (since ‖PB‖1 is the max. column sumof PB), also ‖Pk

B‖1 = 1, hence

‖x‖∞ ≤ ‖x‖1 ≤∞∑

i=0

γi ‖PiB‖1 ‖e‖1 =

m1− γ

.

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 18 / 25

Lemma 1.5

c>xt − c>xt+1 ≥ ∆t def= −min

ict

i.

Proof.Take any feasible x. Then by (*),

c>x = (ctBt)>xBt + c>Bt xt.

Plug in xt and xt+1 and use Lemma 1, Part 2, lower bound:

c>xt = c>Bt xt,

c>xt+1 = (ctBt)>xt+1

Bt + c>Bt xt

= (ctBt)it xt+1

it + c>Bt xt

≤ −∆0 + c>Bt xt.

The second equality follows from that the only index i ∈ Bt for which xt+1i 6= 0 is it.

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 19 / 25

Lemma 2Let z∗ be the optimal value of the MDP LP-Dual (c>x→ min s.t. Ax = e,x ≥ 0). Then, for any t ≥ 0,

z∗ ≥ c>xt − m1− γ

∆t.

Further,

c>xt+1 − z∗ ≤(

1− 1− γm

) (c>xt − z∗

).

Part 1.For x feasible, by (*), c>x = (ct

Bt)>xBt + c>Bt xt. Hence,z∗ = c>x∗ = c>Bt xt + (ct

Bt)> x∗Bt . Now, (ct

Bt)>x∗Bt ≥ − m

1−γ∆t since byLemma 1 (Part 2, upper bound) x∗i ≤ m

1−γ and ∆t = mini ctBt ≤ 0.

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 20 / 25

Proof of Part 2

Part 2.By Lemma 1.5,

c>xt − c>xt+1 ≥ ∆t ≥ 1− γm

(c>xt − z∗

).

Reorder.

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 21 / 25

Lemma 3Assume (for simplicity) that B0 = {1, . . . ,m} (the initial policy uses thefirst action only). Let s∗ be an optimal dual slack vector. If the initialsolution x0 is not optimal then ∃i0 ∈ B0 s.t.

s∗i0 ≥1− γ

m2

(c>x0 − z∗

).

Further, for all t ≥ 1,

xti0 ≤

m2

1− γc>xt − z∗

c>x0 − z∗.

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 22 / 25

Part 1.By Lemma 0, and since, by definition, x0

B0 = 0,

c>x0 − z∗ = (s∗)>x0 = (s∗B0)>x0

B0 =∑i∈B0

s∗i x0i ,

so there must exists i0 ∈ B0 s.t.

s∗i0x0i0 ≥

1|B0|

(c>x0 − z∗

)=

1m

(c>x0 − z∗

).

From Lemma 1, Part 2, x0i0 ≤ m/(1− γ). Thus,

s∗i0 ≥1− γ

m2

(c>x0 − z∗

).

Furthermore, for any t ≥ 1, again by Lemma 0, and usingnonnegativity,

c>xt − z∗ = (s∗)>xt ≥ s∗i0 xti0 ,

hence

xti0 ≤

c>xt − z∗

s∗i0≤ m2

1− γc>xt − z∗

c>x0 − z∗.

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 23 / 25

By Lemma 2, after t iterations, we have

c>xt − z∗

c>x0 − z∗≤(

1− 1− γm

)t

.

Thus, for t ≥ t0 = m1−γ log(m2/(1− γ)), due to Lemma 3,

(xt1)i0 ≤

m2

1− γc>xt − z∗

c>x0 − z∗< 1.

So, in fact, by Lemma 1, we must have (xt1)i0 = 0: The index i0 is

eliminated after t0 steps.Altogether there are (mk − m) indeces that can be eliminated⇒ Themethod cannot run for more iterations than

m(k − 1)t0.

This proves the main result.

Cs. Szepesvari (UofA) On policy iteration July 29, 2010 24 / 25

Conclusions

Howard’s method converges with the same number of iterations –at most. Altogether, however, it is more expensive since evaluatinga policy takes O(m3k) operations.2

I Is Howard’s method actually faster or slower? Lower bounds?

The same result holds when γ = 1, the MDP is episodic and allpolicies are properPivoting is important (smallest index is known not to work!)What happens when γ is “input”?

2Or, with the Coppersmith-Winograd algorithm, O(m2.367... k) operations only.Cs. Szepesvari (UofA) On policy iteration July 29, 2010 25 / 25