Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy...

15
Markov Decision Processes 1 Markov Decision Processes Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted cost and average cost criteria.

Transcript of Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy...

Page 1: Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Markov Decision Processes 1

Markov Decision Processes

Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted cost and average cost criteria.

Page 2: Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Markov Decision Processes 2

Markov Decision ProcessLet X = {X0, X1, …} be a system description process on state

space E and

let D = {D0, D1, …} be a decision process with action space A.

The process (X, D) is a Markov decision process if , for j E and n = 0, 1, …,

Furthermore, for each k A, let fk be a cost vector and Pk be a one-step transition probability matrix. Then the cost fk(i) is incurred whenever Xn = i and Dn = k, and

The problem is to determine how to choose a sequence of actions in order to minimize cost.

1 0 0 1, ,..., , ,n n n n n nP X j X D X D P X j X D

1 , ,n n n kP X j X i D k P i j

Page 3: Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Markov Decision Processes 3

PoliciesA policy is a rule that specifies which action to take at each point

in time. Let denote the set of all policies.

In general, the decisions specified by a policy may– depend on the current state of the system description process

– be randomized (depend on some external random event)

– also depend on past states and/or decisions

A stationary policy is defined by a (deterministic) action function that assigns an action to each state, independent of previous states, previous actions, and time n.

Under a stationary policy, the MDP is a Markov chain.

Page 4: Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Markov Decision Processes 4

Cost Minimization CriteriaSince a MDP goes on indefinitely, it is likely that the total

cost will be infinite. In order to meaningfully compare policies, two criteria are commonly used:

1. Expected total discounted cost computes the present worth of future costs using a discount factor < 1, such that one dollar obtained at time n = 1 has a present value of at time n = 0. Typically, if r is the rate of return, then = 1/(1 + r). The expected total discounted cost is

2. The long run average cost is

0 n

nD nn

E f X

1

0

1lim

n

m

D nnmf X

m

Page 5: Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Markov Decision Processes 5

Optimization with Stationary PoliciesIf the state space E is finite, there exists a stationary policy that solves the problem to minimize the discounted cost:

If every stationary policy results in an irreducible Markov chain, there exists a stationary policy that solves the problem to minimize the average cost:

00min , where

n

nd d d D nnd

v i v i v i E f X X i

D

1*

0

1min , where lim

n

m

d d D nnd mf X

m

D

Page 6: Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Markov Decision Processes 6

Computing Expected Discounted CostsLet X = {X0, X1, …} be a Markov chain with one-step transition probability matrix P, let f be a cost function that assigns a cost to each state of the M.C., and let (0 < < 1) be a discount factor. Then the expected total discounted cost is

Why? Starting from state i, the expected discounted cost can be found recursively as

Note that the expected discounted cost always depends on the initial state, while for the average cost criterion the initial state is unimportant.

1

00

nnn

g i E f X X i f i

I P

, orijjg i f i P g j

g f Pg

Page 7: Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Markov Decision Processes 7

Solution Procedures for Discounted CostsLet v be the (vector) optimal value function whose ith component is

For each i E,

These equations uniquely determine v .

If we can somehow obtain the values v that satisfy the above equations, then the optimal policy is the vector a, where

min ,k kj Ek Av i f i P i j v j

min dd

v i v i

D

arg min ,k kj Ek A

a i f i P i j v j

“arg min” is “the argument that minimizes”

Page 8: Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Markov Decision Processes 8

Value Iteration for Discounted CostsMake a guess … keep applying the optimal value equations

until the fixed point is reached.

Step 1. Choose > 0, set n = 0, let v0(i) = 0 for each i in E.

Step 2. For each i in E, find vn+1(i) as

Step 3. Let

Step 4. If < , stop with v = vn+1. Otherwise, set n = n+1 and return to Step 2.

1 min ,n k k nj Ek Av i f i P i j v j

1max n ni E

v i v i

Page 9: Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Markov Decision Processes 9

Policy Improvement for Discounted CostsStart myopic, then consider longer-term consequences.

Step 1. Set n = 0 and let a0(i) =

Step 2. Adopt the cost vector and transition matrix:

Step 3. Find the value function

Step 4. Re-optimize:

Step 5. If an+1(i) = an(i), then stop with v = v and a = an(i). Otherwise, set n = n + 1 and return to Step 2.

arg mink A kf i

, ,n na i a if i f i P i j P i j

1v f I P

1 arg min ,n k kj Ek A

a i f i P i j v j

Page 10: Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Markov Decision Processes 10

Linear Programming for Discounted CostsConsider the linear program:

The optimal value of u(i) will be v(i), and the optimal policy is identified by the constraints that hold as equalities in the optimal solution (slack variables equal 0).

Note: the decision variables are unrestricted in sign!

max

s.t. , for each ,

i E

k kj E

u i

u i f i P i j u j i k

Page 11: Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Markov Decision Processes 11

Long Run Average Cost per Period

For a given policy d, its long run average cost could be found from its cost vector fd and one-step transition probability matrix Pd: First, find the limiting probabilities by solving

Then

So, in principle we could simply enumerate all policies and choose the one with the smallest average cost… not practical if A and E are large.

, , ; 1j i d ji E j E

P i j j E

1

0lim n

m

nd Xnd d j

mj E

f Xf j

m

Page 12: Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Markov Decision Processes 12

Recursive Equation for Average Cost

Assume that every stationary policy yields an irreducible Markov chain. There exists a scalar and a vector h such that for all states i in E,

The scalar is the optimal average cost and the optimal policy is found by choosing for each state the action that achieves the minimum on the right-hand-side.

The vector h is unique up to an additive constant … as we will see, the difference between h(i) - h(j) represents the increase in total cost from starting out in state i rather than j.

* min ,k kj Ek Ah i f i P i j h j

Page 13: Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Markov Decision Processes 13

Relationships between Discounted Cost and Long Run Average Cost

• If a cost of c is incurred each period and is the discount factor, then the total discounted cost is

• Therefore, a total discounted cost v is equivalent to an average cost of c = (1-)v per period, so

• Let v be the optimal discounted cost vector, * be the optimal average cost and h be the mystery vector from the previous slide.

0 1n

n

cv c

*

1lim 1 v i

1

lim v i v j h i h j

Page 14: Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Markov Decision Processes 14

Policy Improvement for Average Costs

Designate one state in E to be “state number 1”

Step 1. Set n = 0 and let a0(i) =

Step 2. Adopt the cost vector and transition matrix:

Step 3. With h(1) = 0, solve

Step 4. Re-optimize:

Step 5. If an+1(i) = an(i), then stop with * = and a*(i) = an(i). Otherwise, set n = n + 1 and return to Step 2.

arg mink A kf i

, ,n na i a if i f i P i j P i j

f h Ph

1 arg min ,n k kj Ek A

a i f i P i j h j

Page 15: Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Markov Decision Processes 15

Linear Programming for Average Costs

Consider randomized policies: let wi(k) = P{Dn = k | Xn = i}. A stationary policy has wi(k) = 1 for each k=a(i) and 0 otherwise. The decision variables are x(i,k) = wi(k)(i).

The objective is to minimize the expected value of the average cost (expectation taken over the randomized policy):

min ,

s.t. , , , for each

, 1

ki E k A

kk A i E k A

i E k A

x i k f i

x j k x i k P i j j E

x i k

Note that one constraint will be redundant and may be dropped.