Certain and Uncertain Utility: The Allais Paradox and Five Decision ...
Markov Decision Processesxial/Teaching/2020SAI/slides/...ØMarkov decision processes •search with...
Transcript of Markov Decision Processesxial/Teaching/2020SAI/slides/...ØMarkov decision processes •search with...
-
Lirong Xia
Markov Decision Processes
-
ØMarkov decision processes• search with uncertain moves and “infinite”
space
ØComputing optimal policy• value iteration
• policy iteration
1
Today
-
Grid World
2
ØThe agent lives in a gridØWalls block the agent’s pathØThe agent’s actions do not
always go as planned:• 80% of the time, the action North takes
the agent North (if there is no wall there)• 10% of the time, North takes the agent
West; 10% East• If there is a wall in the direction the agent
would have taken, the agent stays for this turn
ØSmall “living” reward (or cost) each step
ØBig rewards come at the endØGoal: maximize sum of reward
-
Grid Feature
3
Deterministic Grid World Stochastic Grid World
-
Markov Decision Processes
4
ØAn MDP is defined by:• A set of states s∈S• A set of actions a∈A• A transition function T(s, a, s’)
• Prob that a from s leads to s’• i.e., p(s’|s, a)• sometimes called the model
• A reward function R(s, a, s’)• Sometimes just R(s) or R(s’)
• A start state (or distribution)• Maybe a terminal state
ØMDPs are a family of nondeterministic search problems• Reinforcement learning (next class):
MDPs where we don’t know the transition or reward functions
-
What is Markov about MDPs?
5
ØAndrey Markov (1856-1922)
Ø “Markov” generally means that given the present state, the future and the past are independent
ØFor Markov decision processes, “Markov” means:
p St+1 = s ' | St = st ,At = at ,St−1 = st−1,At−1 = at−1,S0 = s0( )= p St+1 = s ' | St = st ,At = at( )
-
Solving MDPs
6
Ø In deterministic single-agent search problems, want an optimal plan, or sequence of actions, from start to a goal
Ø In an MDP, we want an optimal policy• A policy π gives an action for each state• An optimal policy maximizes expected utility if followed
Optimal policy when R(s, a,s’) = -0.03 for all non-terminal state
* : S Ap ®
-
ØPlan• A path from the start to a GOAL
ØPolicy• a collection of optimal actions, one for each
state of the world
• you can start at any state
7
Plan vs. Policy
-
Example Optimal Policies
8
R(s) = -0.01 R(s) = -0.03
R(s) = -0.4 R(s) = -2.0
-
Example: High-Low
9
ØThree card type: 2,3,4Ø Infinite deck, twice as many 2’sØStart with 3 showingØAfter each card, you say “high”
or “low”Ø If you’re right, you win the
points shown on the new cardØ If tied, then skip this roundØ If you’re wrong, game endsØWhy not use expectimax?
• #1: get rewards as you go• #2: you might play forever!
-
High-Low as an MDP
10
ØStates: 2, 3, 4, doneØActions: High, LowØModel: T(s,a,s’):
• p(s’=4|4,low) = ¼• p(s’=3|4,low) = ¼• p(s’=2|4,low) = ½• p(s’=done|4,low) =0• p(s’=4|4,high) = ¼• p(s’=3|4,high) = 0• p(s’=2|4,high) = 0• p(s’=done|4,high) = ¾• …
ØRewards: R(s,a,s’):• Number shown on s’ if • 0 otherwise
ØStart: 3
's s¹
-
MDP Search Trees
11
ØEach MDP state gives an expectimax-like search tree
-
Utilities of Sequences
12
Ø In order to formalize optimality of a policy, need to understand utilities of sequences of rewards
ØTypically consider stationary preferences:
ØTwo coherent ways to define stationary utilities• Additive utility:
• Discounted utility:
r,r0 ,r1,r2 ,!" #$ r,r '0 ,r '1,r '2 ,!" #$⇔ r0 ,r1,r2 ,!" #$ r '0 ,r '1,r '2 ,!" #$
U r0 ,r1,r2 ,!" #$( ) = r0 + r1 + r2 +U r0 ,r1,r2 ,!" #$( ) = r0 +γr1 +γ 2r2 +
-
Infinite Utilities?!
13
ØProblems: infinite state sequences have infinite rewards
ØSolutions:• Finite horizon:
• Terminate episodes after a fixed T steps (e.g. life)• Gives nonstationary policies (π depends on time left)
• Absorbing state: guarantee that for every policy, a terminal state will eventually be reached (like “done” for High-Low)
• Discounting: for 0
-
Discounting
14
ØTypically discount rewards by each time step• Sooner rewards have higher
utility than later rewards• Also helps the algorithms
converge
1g
-
Recap: Defining MDPs
15
ØMarkov decision processes:• States S• Start state s0• Actions A• Transition p(s’|s,a) (or T(s,a,s’))• Reward R(s,a,s’) (and discount )
ØMDP quantities so far:• Policy = Choice of action for each (MAX) state• Utility (or return) = sum of discounted rewards
g
-
Optimal UtilitiesØFundamental operation: compute the
values (optimal expectimax utilities) of states s• c.f. evaluation function in expectimax
ØDefine the value of a state s:• V*(s) = expected utility starting in s and
acting optimallyØDefine the value of a q-state (s,a):
• Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally
ØDefine the optimal policy:• *(s) = optimal action from state s
-
The Bellman Equations
17
ØDefinition of “optimal utility” leads to a simple one-step lookahead relationship amongst optimal utility values:
Optimal rewards = maximize over first step + value of following the optimal policy
ØFormally:
( ) ( )* *max ,a
V s Q s a=
( ) ( ) ( ) ( )* *'
, , , ' , , ' 's
Q s a T s a s R s a s V sgé ù= +ë ûå( ) ( ) ( ) ( )* *
'
max , , ' , , ' 'a s
V s T s a s R s a s V sgé ù= +ë ûå
-
Solving MDPs
18
ØWe want to find the optimal policyØProposal 1: modified expectimax search, starting
from each state s:
( ) ( )* *argmax ,a
s Q s ap =
( ) ( ) ( ) ( )* *'
, , , ' , , ' 's
Q s a T s a s R s a s V sgé ù= +ë ûå( ) ( )* *max ,
aV s Q s a=
*p
-
Why Not Search Trees?
19
ØWhy not solve with expectimax?
ØProblems:• This tree is usually infinite• Same states appear over and over• We would search once per state
Ø Idea: value iteration• Compute optimal values for all states all at
once using successive approximations
-
Value Estimates
20
ØCalculate estimates Vk*(s)• Not the optimal value of s!• The optimal value considering
only next k time steps (k rewards)
• As , it approaches the optimal value
ØAlmost solution: recursion (i.e. expectimax)
ØCorrect solution: dynamic programming
k®¥
-
ØValue iterationØPolicy iteration
21
Computing the optimal policy
-
Value Iteration
22
Ø Idea:• Start with V1(s) = 0• Given Vi, calculate the values for all states for depth i+1:
• This is called a value update or Bellman update• Repeat until converge• Use Vi as evaluation function when computing Vi+1
ØTheorem: will converge to unique optimal values• Basic idea: approximations get refined towards optimal
values• Policy may converge long before values do
( ) ( ) ( ) ( )1'
max , , ' , , ' 'i ia sV s T s a s R s a s V sg+ ¬ +é ùë ûå
-
Example: Bellman Updates
23
( ) ( ) ( ) ( )1'
max , , ' , , ' 'i ia sV s T s a s R s a s V sg+ = +é ùë ûåV2 3,3( ) = T 3,3 ,right,s '( ) R 3,3( )+0.9V1 s '( )!" #$
s '∑
= 0.9 0.8i1+0.1i0+0.1i0!" #$max happens for a=right, other actions not shown
Example: =0.9, living reward=0, noise=0.2
-
Example: Value Iteration
24
Ø Information propagates outward from terminal states and eventually all states have correct value estimates
-
Convergence
25
ØDefine the max-norm:
ØTheorem: for any two approximations U and V
• I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value iteration converges to a unique, stable, optimal solution
ØTheorem:
• I.e. once the change in our approximation is small, it must also be close to correct
( )maxsU U s=
1 1t t t tU V U Vg+ +- £ -
( )1 1 2 1t t tU U U Ue eg g+ +- £ Þ - £ -
-
Practice: Computing Actions
26
ØWhich action should we chose from state s:• Given optimal values V?
• Given optimal q-values Q?
( )*argmax ,aQ s a
( ) ( ) ( )*'
argmax , , ' , , ' 'a s
T s a s R s a s V sgé ù+ë ûå
-
Utilities for a Fixed Policy
27
ØAnother basic operation: compute the utility of a state s under a fixed (generally non-optimal) policy
ØDefine the utility of a state s, under a fixed policy π:V (s)=expected total discounted rewards
(return) starting in s and following π
ØRecursive relation (one-step lookahead / Bellman equation):
( ) ( )( ) ( )( ) ( )'
, , ' , , ' 's
V s T s s s R s s s V sp pp p gé ù= +ë ûå
-
Policy Evaluation
28
ØHow do we calculate the V’s for a fixed policy?
Ø Idea one: turn recursive equations into updates
Ø Idea two: it’s just a linear system, solve with Matlab(or other tools)
V1π s( ) = 0
Vi+1π s( )← T s,π s( ),s '( ) R s,π s( ),s '( )+γViπ s '( )"# $%
s '∑
-
Policy Iteration
29
ØAlternative approach:• Step 1: policy evaluation: calculate utilities for some fixed
policy (not optimal utilities!)• Step 2: policy improvement: update policy using one-step
look-ahead with resulting converged (but not optimal!) utilities as future values
• Repeat steps until policy converges
ØThis is policy iteration• It’s still optimal!• Can converge faster under some conditions
-
Policy Iteration
30
ØPolicy evaluation: with fixed current policy , find values with simplified Bellman updates:
ØPolicy improvement: with fixed utilities, find the best action according to one-step look-ahead
( ) ( )( ) ( )( ) ( )1'
, , ' , , ' 'k ki k k is
V s T s s s R s s s V sp pp p g+ é ù¬ +ë ûå
( ) ( ) ( ) ( )1'
argmax , , ' , , ' 'kka s
s T s a s R s a s V spp g+ é ù= +ë ûå
-
Comparison
31
ØBoth compute same thing (optimal values for all states)
Ø In value iteration:• Every iteration updates both utilities (explicitly, based on
current utilities) and policy (implicitly, based on current utilities)
• Tracking the policy isn’t necessary; we take the max
Ø In policy iteration:• Compute utilities with fixed policy• After utilities are computed, a new policy is chosen
ØBoth are dynamic programs for solving MDPs
( ) ( ) ( ) ( )1'
max , , ' , , ' 'i ia sV s T s a s R s a s V sg+ ¬ +é ùë ûå
-
Preview: Reinforcement Learning
32
ØReinforcement learning:• Still have an MDP:
• A set of states • A set of actions (per state) A• A model T(s,a,s’)• A reward function R(s,a,s’)
• Still looking for a policy π(s)
• New twist: don’t know T or R• I.e. don’t know which states are good or what the actions
do• Must actually try actions and states out to learn
s SÎ