Markov Decision Processesxial/Teaching/2020SAI/slides/...ØMarkov decision processes •search with...

Lirong Xia

Markov Decision Processes

ØMarkov decision processes• search with uncertain moves and “infinite”

space

ØComputing optimal policy• value iteration

• policy iteration

1

Today

Grid World

2

ØThe agent lives in a gridØWalls block the agent’s pathØThe agent’s actions do not

always go as planned:• 80% of the time, the action North takes

the agent North (if there is no wall there)• 10% of the time, North takes the agent

West; 10% East• If there is a wall in the direction the agent

would have taken, the agent stays for this turn

ØSmall “living” reward (or cost) each step

ØBig rewards come at the endØGoal: maximize sum of reward

Grid Feature

3

Deterministic Grid World Stochastic Grid World

Markov Decision Processes

4

ØAn MDP is defined by:• A set of states s∈S• A set of actions a∈A• A transition function T(s, a, s’)

• Prob that a from s leads to s’• i.e., p(s’|s, a)• sometimes called the model

• A reward function R(s, a, s’)• Sometimes just R(s) or R(s’)

• A start state (or distribution)• Maybe a terminal state

ØMDPs are a family of nondeterministic search problems• Reinforcement learning (next class):

MDPs where we don’t know the transition or reward functions

What is Markov about MDPs?

5

ØAndrey Markov (1856-1922)

Ø “Markov” generally means that given the present state, the future and the past are independent

ØFor Markov decision processes, “Markov” means:

p St+1 = s ' | St = st ,At = at ,St−1 = st−1,At−1 = at−1,S0 = s0( )= p St+1 = s ' | St = st ,At = at( )

Solving MDPs

6

Ø In deterministic single-agent search problems, want an optimal plan, or sequence of actions, from start to a goal

Ø In an MDP, we want an optimal policy• A policy π gives an action for each state• An optimal policy maximizes expected utility if followed

Optimal policy when R(s, a,s’) = -0.03 for all non-terminal state

* : S Ap ®

ØPlan• A path from the start to a GOAL

ØPolicy• a collection of optimal actions, one for each

state of the world

• you can start at any state

7

Plan vs. Policy

Example Optimal Policies

8

R(s) = -0.01 R(s) = -0.03

R(s) = -0.4 R(s) = -2.0

Example: High-Low

9

ØThree card type: 2,3,4Ø Infinite deck, twice as many 2’sØStart with 3 showingØAfter each card, you say “high”

or “low”Ø If you’re right, you win the

points shown on the new cardØ If tied, then skip this roundØ If you’re wrong, game endsØWhy not use expectimax?

• #1: get rewards as you go• #2: you might play forever!

MDP Search Trees

11

ØEach MDP state gives an expectimax-like search tree

Utilities of Sequences

12

Ø In order to formalize optimality of a policy, need to understand utilities of sequences of rewards

ØTypically consider stationary preferences:

ØTwo coherent ways to define stationary utilities• Additive utility:

• Discounted utility:

r,r0 ,r1,r2 ,!" #$ r,r '0 ,r '1,r '2 ,!" #$⇔ r0 ,r1,r2 ,!" #$ r '0 ,r '1,r '2 ,!" #$

U r0 ,r1,r2 ,!" #$( ) = r0 + r1 + r2 +U r0 ,r1,r2 ,!" #$( ) = r0 +γr1 +γ 2r2 +

Infinite Utilities?!

13

ØProblems: infinite state sequences have infinite rewards

ØSolutions:• Finite horizon:

• Terminate episodes after a fixed T steps (e.g. life)• Gives nonstationary policies (π depends on time left)

• Absorbing state: guarantee that for every policy, a terminal state will eventually be reached (like “done” for High-Low)

• Discounting: for 0

Discounting

14

ØTypically discount rewards by each time step• Sooner rewards have higher

utility than later rewards• Also helps the algorithms

converge

1g

Recap: Defining MDPs

15

ØMarkov decision processes:• States S• Start state s0• Actions A• Transition p(s’|s,a) (or T(s,a,s’))• Reward R(s,a,s’) (and discount )

ØMDP quantities so far:• Policy = Choice of action for each (MAX) state• Utility (or return) = sum of discounted rewards

g

Optimal UtilitiesØFundamental operation: compute the

values (optimal expectimax utilities) of states s• c.f. evaluation function in expectimax

ØDefine the value of a state s:• V*(s) = expected utility starting in s and

acting optimallyØDefine the value of a q-state (s,a):

• Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally

ØDefine the optimal policy:• *(s) = optimal action from state s

The Bellman Equations

17

ØDefinition of “optimal utility” leads to a simple one-step lookahead relationship amongst optimal utility values:

Optimal rewards = maximize over first step + value of following the optimal policy

ØFormally:

( ) ( )* *max ,a

V s Q s a=

( ) ( ) ( ) ( )* *'

, , , ' , , ' 's

Q s a T s a s R s a s V sgé ù= +ë ûå( ) ( ) ( ) ( )* *

'

max , , ' , , ' 'a s

V s T s a s R s a s V sgé ù= +ë ûå

Solving MDPs

18

ØWe want to find the optimal policyØProposal 1: modified expectimax search, starting

from each state s:

( ) ( )* *argmax ,a

s Q s ap =

( ) ( ) ( ) ( )* *'

, , , ' , , ' 's

Q s a T s a s R s a s V sgé ù= +ë ûå( ) ( )* *max ,

aV s Q s a=

*p

Why Not Search Trees?

19

ØWhy not solve with expectimax?

ØProblems:• This tree is usually infinite• Same states appear over and over• We would search once per state

Ø Idea: value iteration• Compute optimal values for all states all at

once using successive approximations

Value Estimates

20

ØCalculate estimates Vk*(s)• Not the optimal value of s!• The optimal value considering

only next k time steps (k rewards)

• As , it approaches the optimal value

ØAlmost solution: recursion (i.e. expectimax)

ØCorrect solution: dynamic programming

k®¥

ØValue iterationØPolicy iteration

21

Computing the optimal policy

Value Iteration

22

Ø Idea:• Start with V1(s) = 0• Given Vi, calculate the values for all states for depth i+1:

• This is called a value update or Bellman update• Repeat until converge• Use Vi as evaluation function when computing Vi+1

ØTheorem: will converge to unique optimal values• Basic idea: approximations get refined towards optimal

values• Policy may converge long before values do

( ) ( ) ( ) ( )1'

max , , ' , , ' 'i ia sV s T s a s R s a s V sg+ ¬ +é ùë ûå

Example: Bellman Updates

23

( ) ( ) ( ) ( )1'

max , , ' , , ' 'i ia sV s T s a s R s a s V sg+ = +é ùë ûåV2 3,3( ) = T 3,3 ,right,s '( ) R 3,3( )+0.9V1 s '( )!" #$

s '∑

= 0.9 0.8i1+0.1i0+0.1i0!" #$max happens for a=right, other actions not shown

Example: =0.9, living reward=0, noise=0.2

Example: Value Iteration

24

Ø Information propagates outward from terminal states and eventually all states have correct value estimates

Convergence

25

ØDefine the max-norm:

ØTheorem: for any two approximations U and V

• I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value iteration converges to a unique, stable, optimal solution

ØTheorem:

• I.e. once the change in our approximation is small, it must also be close to correct

( )maxsU U s=

1 1t t t tU V U Vg+ +- £ -

( )1 1 2 1t t tU U U Ue eg g+ +- £ Þ - £ -

Practice: Computing Actions

26

ØWhich action should we chose from state s:• Given optimal values V?

• Given optimal q-values Q?

( )*argmax ,aQ s a

( ) ( ) ( )*'

argmax , , ' , , ' 'a s

T s a s R s a s V sgé ù+ë ûå

Utilities for a Fixed Policy

27

ØAnother basic operation: compute the utility of a state s under a fixed (generally non-optimal) policy

ØDefine the utility of a state s, under a fixed policy π:V (s)=expected total discounted rewards

(return) starting in s and following π

ØRecursive relation (one-step lookahead / Bellman equation):

( ) ( )( ) ( )( ) ( )'

, , ' , , ' 's

V s T s s s R s s s V sp pp p gé ù= +ë ûå

Policy Evaluation

28

ØHow do we calculate the V’s for a fixed policy?

Ø Idea one: turn recursive equations into updates

Ø Idea two: it’s just a linear system, solve with Matlab(or other tools)

V1π s( ) = 0

Vi+1π s( )← T s,π s( ),s '( ) R s,π s( ),s '( )+γViπ s '( )"# $%

s '∑

Policy Iteration

29

ØAlternative approach:• Step 1: policy evaluation: calculate utilities for some fixed

policy (not optimal utilities!)• Step 2: policy improvement: update policy using one-step

look-ahead with resulting converged (but not optimal!) utilities as future values

• Repeat steps until policy converges

ØThis is policy iteration• It’s still optimal!• Can converge faster under some conditions

Policy Iteration

30

ØPolicy evaluation: with fixed current policy , find values with simplified Bellman updates:

ØPolicy improvement: with fixed utilities, find the best action according to one-step look-ahead

( ) ( )( ) ( )( ) ( )1'

, , ' , , ' 'k ki k k is

V s T s s s R s s s V sp pp p g+ é ù¬ +ë ûå

( ) ( ) ( ) ( )1'

argmax , , ' , , ' 'kka s

s T s a s R s a s V spp g+ é ù= +ë ûå

Comparison

31

ØBoth compute same thing (optimal values for all states)

Ø In value iteration:• Every iteration updates both utilities (explicitly, based on

current utilities) and policy (implicitly, based on current utilities)

• Tracking the policy isn’t necessary; we take the max

Ø In policy iteration:• Compute utilities with fixed policy• After utilities are computed, a new policy is chosen

ØBoth are dynamic programs for solving MDPs

( ) ( ) ( ) ( )1'

max , , ' , , ' 'i ia sV s T s a s R s a s V sg+ ¬ +é ùë ûå

Preview: Reinforcement Learning

32

ØReinforcement learning:• Still have an MDP:

• A set of states • A set of actions (per state) A• A model T(s,a,s’)• A reward function R(s,a,s’)

• Still looking for a policy π(s)

• New twist: don’t know T or R• I.e. don’t know which states are good or what the actions

do• Must actually try actions and states out to learn

s SÎ

Markov Decision Processesxial/Teaching/2020SAI/slides/...ØMarkov decision processes •search with...

Documents

Transcript of Markov Decision Processesxial/Teaching/2020SAI/slides/...ØMarkov decision processes •search with...