Markov Decision Processesxial/Teaching/2020SAI/slides/...ØMarkov decision processes •search with...

33
Lirong Xia Markov Decision Processes

Transcript of Markov Decision Processesxial/Teaching/2020SAI/slides/...ØMarkov decision processes •search with...

  • Lirong Xia

    Markov Decision Processes

  • ØMarkov decision processes• search with uncertain moves and “infinite”

    space

    ØComputing optimal policy• value iteration

    • policy iteration

    1

    Today

  • Grid World

    2

    ØThe agent lives in a gridØWalls block the agent’s pathØThe agent’s actions do not

    always go as planned:• 80% of the time, the action North takes

    the agent North (if there is no wall there)• 10% of the time, North takes the agent

    West; 10% East• If there is a wall in the direction the agent

    would have taken, the agent stays for this turn

    ØSmall “living” reward (or cost) each step

    ØBig rewards come at the endØGoal: maximize sum of reward

  • Grid Feature

    3

    Deterministic Grid World Stochastic Grid World

  • Markov Decision Processes

    4

    ØAn MDP is defined by:• A set of states s∈S• A set of actions a∈A• A transition function T(s, a, s’)

    • Prob that a from s leads to s’• i.e., p(s’|s, a)• sometimes called the model

    • A reward function R(s, a, s’)• Sometimes just R(s) or R(s’)

    • A start state (or distribution)• Maybe a terminal state

    ØMDPs are a family of nondeterministic search problems• Reinforcement learning (next class):

    MDPs where we don’t know the transition or reward functions

  • What is Markov about MDPs?

    5

    ØAndrey Markov (1856-1922)

    Ø “Markov” generally means that given the present state, the future and the past are independent

    ØFor Markov decision processes, “Markov” means:

    p St+1 = s ' | St = st ,At = at ,St−1 = st−1,At−1 = at−1,S0 = s0( )= p St+1 = s ' | St = st ,At = at( )

  • Solving MDPs

    6

    Ø In deterministic single-agent search problems, want an optimal plan, or sequence of actions, from start to a goal

    Ø In an MDP, we want an optimal policy• A policy π gives an action for each state• An optimal policy maximizes expected utility if followed

    Optimal policy when R(s, a,s’) = -0.03 for all non-terminal state

    * : S Ap ®

  • ØPlan• A path from the start to a GOAL

    ØPolicy• a collection of optimal actions, one for each

    state of the world

    • you can start at any state

    7

    Plan vs. Policy

  • Example Optimal Policies

    8

    R(s) = -0.01 R(s) = -0.03

    R(s) = -0.4 R(s) = -2.0

  • Example: High-Low

    9

    ØThree card type: 2,3,4Ø Infinite deck, twice as many 2’sØStart with 3 showingØAfter each card, you say “high”

    or “low”Ø If you’re right, you win the

    points shown on the new cardØ If tied, then skip this roundØ If you’re wrong, game endsØWhy not use expectimax?

    • #1: get rewards as you go• #2: you might play forever!

  • High-Low as an MDP

    10

    ØStates: 2, 3, 4, doneØActions: High, LowØModel: T(s,a,s’):

    • p(s’=4|4,low) = ¼• p(s’=3|4,low) = ¼• p(s’=2|4,low) = ½• p(s’=done|4,low) =0• p(s’=4|4,high) = ¼• p(s’=3|4,high) = 0• p(s’=2|4,high) = 0• p(s’=done|4,high) = ¾• …

    ØRewards: R(s,a,s’):• Number shown on s’ if • 0 otherwise

    ØStart: 3

    's s¹

  • MDP Search Trees

    11

    ØEach MDP state gives an expectimax-like search tree

  • Utilities of Sequences

    12

    Ø In order to formalize optimality of a policy, need to understand utilities of sequences of rewards

    ØTypically consider stationary preferences:

    ØTwo coherent ways to define stationary utilities• Additive utility:

    • Discounted utility:

    r,r0 ,r1,r2 ,!" #$ r,r '0 ,r '1,r '2 ,!" #$⇔ r0 ,r1,r2 ,!" #$ r '0 ,r '1,r '2 ,!" #$

    U r0 ,r1,r2 ,!" #$( ) = r0 + r1 + r2 +U r0 ,r1,r2 ,!" #$( ) = r0 +γr1 +γ 2r2 +

  • Infinite Utilities?!

    13

    ØProblems: infinite state sequences have infinite rewards

    ØSolutions:• Finite horizon:

    • Terminate episodes after a fixed T steps (e.g. life)• Gives nonstationary policies (π depends on time left)

    • Absorbing state: guarantee that for every policy, a terminal state will eventually be reached (like “done” for High-Low)

    • Discounting: for 0

  • Discounting

    14

    ØTypically discount rewards by each time step• Sooner rewards have higher

    utility than later rewards• Also helps the algorithms

    converge

    1g

  • Recap: Defining MDPs

    15

    ØMarkov decision processes:• States S• Start state s0• Actions A• Transition p(s’|s,a) (or T(s,a,s’))• Reward R(s,a,s’) (and discount )

    ØMDP quantities so far:• Policy = Choice of action for each (MAX) state• Utility (or return) = sum of discounted rewards

    g

  • Optimal UtilitiesØFundamental operation: compute the

    values (optimal expectimax utilities) of states s• c.f. evaluation function in expectimax

    ØDefine the value of a state s:• V*(s) = expected utility starting in s and

    acting optimallyØDefine the value of a q-state (s,a):

    • Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally

    ØDefine the optimal policy:• *(s) = optimal action from state s

  • The Bellman Equations

    17

    ØDefinition of “optimal utility” leads to a simple one-step lookahead relationship amongst optimal utility values:

    Optimal rewards = maximize over first step + value of following the optimal policy

    ØFormally:

    ( ) ( )* *max ,a

    V s Q s a=

    ( ) ( ) ( ) ( )* *'

    , , , ' , , ' 's

    Q s a T s a s R s a s V sgé ù= +ë ûå( ) ( ) ( ) ( )* *

    '

    max , , ' , , ' 'a s

    V s T s a s R s a s V sgé ù= +ë ûå

  • Solving MDPs

    18

    ØWe want to find the optimal policyØProposal 1: modified expectimax search, starting

    from each state s:

    ( ) ( )* *argmax ,a

    s Q s ap =

    ( ) ( ) ( ) ( )* *'

    , , , ' , , ' 's

    Q s a T s a s R s a s V sgé ù= +ë ûå( ) ( )* *max ,

    aV s Q s a=

    *p

  • Why Not Search Trees?

    19

    ØWhy not solve with expectimax?

    ØProblems:• This tree is usually infinite• Same states appear over and over• We would search once per state

    Ø Idea: value iteration• Compute optimal values for all states all at

    once using successive approximations

  • Value Estimates

    20

    ØCalculate estimates Vk*(s)• Not the optimal value of s!• The optimal value considering

    only next k time steps (k rewards)

    • As , it approaches the optimal value

    ØAlmost solution: recursion (i.e. expectimax)

    ØCorrect solution: dynamic programming

    k®¥

  • ØValue iterationØPolicy iteration

    21

    Computing the optimal policy

  • Value Iteration

    22

    Ø Idea:• Start with V1(s) = 0• Given Vi, calculate the values for all states for depth i+1:

    • This is called a value update or Bellman update• Repeat until converge• Use Vi as evaluation function when computing Vi+1

    ØTheorem: will converge to unique optimal values• Basic idea: approximations get refined towards optimal

    values• Policy may converge long before values do

    ( ) ( ) ( ) ( )1'

    max , , ' , , ' 'i ia sV s T s a s R s a s V sg+ ¬ +é ùë ûå

  • Example: Bellman Updates

    23

    ( ) ( ) ( ) ( )1'

    max , , ' , , ' 'i ia sV s T s a s R s a s V sg+ = +é ùë ûåV2 3,3( ) = T 3,3 ,right,s '( ) R 3,3( )+0.9V1 s '( )!" #$

    s '∑

    = 0.9 0.8i1+0.1i0+0.1i0!" #$max happens for a=right, other actions not shown

    Example: =0.9, living reward=0, noise=0.2

  • Example: Value Iteration

    24

    Ø Information propagates outward from terminal states and eventually all states have correct value estimates

  • Convergence

    25

    ØDefine the max-norm:

    ØTheorem: for any two approximations U and V

    • I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value iteration converges to a unique, stable, optimal solution

    ØTheorem:

    • I.e. once the change in our approximation is small, it must also be close to correct

    ( )maxsU U s=

    1 1t t t tU V U Vg+ +- £ -

    ( )1 1 2 1t t tU U U Ue eg g+ +- £ Þ - £ -

  • Practice: Computing Actions

    26

    ØWhich action should we chose from state s:• Given optimal values V?

    • Given optimal q-values Q?

    ( )*argmax ,aQ s a

    ( ) ( ) ( )*'

    argmax , , ' , , ' 'a s

    T s a s R s a s V sgé ù+ë ûå

  • Utilities for a Fixed Policy

    27

    ØAnother basic operation: compute the utility of a state s under a fixed (generally non-optimal) policy

    ØDefine the utility of a state s, under a fixed policy π:V (s)=expected total discounted rewards

    (return) starting in s and following π

    ØRecursive relation (one-step lookahead / Bellman equation):

    ( ) ( )( ) ( )( ) ( )'

    , , ' , , ' 's

    V s T s s s R s s s V sp pp p gé ù= +ë ûå

  • Policy Evaluation

    28

    ØHow do we calculate the V’s for a fixed policy?

    Ø Idea one: turn recursive equations into updates

    Ø Idea two: it’s just a linear system, solve with Matlab(or other tools)

    V1π s( ) = 0

    Vi+1π s( )← T s,π s( ),s '( ) R s,π s( ),s '( )+γViπ s '( )"# $%

    s '∑

  • Policy Iteration

    29

    ØAlternative approach:• Step 1: policy evaluation: calculate utilities for some fixed

    policy (not optimal utilities!)• Step 2: policy improvement: update policy using one-step

    look-ahead with resulting converged (but not optimal!) utilities as future values

    • Repeat steps until policy converges

    ØThis is policy iteration• It’s still optimal!• Can converge faster under some conditions

  • Policy Iteration

    30

    ØPolicy evaluation: with fixed current policy , find values with simplified Bellman updates:

    ØPolicy improvement: with fixed utilities, find the best action according to one-step look-ahead

    ( ) ( )( ) ( )( ) ( )1'

    , , ' , , ' 'k ki k k is

    V s T s s s R s s s V sp pp p g+ é ù¬ +ë ûå

    ( ) ( ) ( ) ( )1'

    argmax , , ' , , ' 'kka s

    s T s a s R s a s V spp g+ é ù= +ë ûå

  • Comparison

    31

    ØBoth compute same thing (optimal values for all states)

    Ø In value iteration:• Every iteration updates both utilities (explicitly, based on

    current utilities) and policy (implicitly, based on current utilities)

    • Tracking the policy isn’t necessary; we take the max

    Ø In policy iteration:• Compute utilities with fixed policy• After utilities are computed, a new policy is chosen

    ØBoth are dynamic programs for solving MDPs

    ( ) ( ) ( ) ( )1'

    max , , ' , , ' 'i ia sV s T s a s R s a s V sg+ ¬ +é ùë ûå

  • Preview: Reinforcement Learning

    32

    ØReinforcement learning:• Still have an MDP:

    • A set of states • A set of actions (per state) A• A model T(s,a,s’)• A reward function R(s,a,s’)

    • Still looking for a policy π(s)

    • New twist: don’t know T or R• I.e. don’t know which states are good or what the actions

    do• Must actually try actions and states out to learn

    s SÎ