Chapter 17 – Making Complex Decisions A

Chapter 17 – Making Complex Decisions A

Sequential decision problems utility depends on a sequence of decisions instead of one-time or episodic like last

chapterStochastic environments non-deterministic: actions will be based on probability

Overview Markov decision problems (MDPs) Algorithms for finding optimal policies – the solution to the MDP Partially observable Markov decision problems (POMDPs)

17.1 – Sequential Decision Problems Simple Example

Fully observable environment Markovian property: probability of moving to new state only depends on current

state Actions: [ Left, Right, Up, Down] Move in direction of choice with 0.8 probability Reward = +1 and -1 for two goal states and -0.04 for non-goal states Want to maximize total reward or utility

Markov Decision Process (MDP)

We use MDPs to solve sequential decision problems. We eventually want to find the best choice of action for each state.

Consists of:a set of actions A(s) • for actions in each state in state s

transition model P(s' | s, a) • describing the probability of reaching s' using action a in s• transitions are Markovian - only depends on s not previous states

reward function R(s)• the reward an agent receives for arriving in state s

Policies

• Utitlity U(s): the sum of rewards obtained during the sequence including that state• Agent wants to maximize utility• Policy π: a function which describes the action for each state• Expected utility: the expected sum of rewards for a policy• optimal policy π*: policy which yields the highest expected utility – the solution

to the problem

Utilities Over Time

• Time interval is one step in sequence• Finite or infinite horizon: infinite horizon meas unlimited time steps

– Infinite horizons have stationary policies – policy doesn't change

How to calculate the utility of state sequences?• assume stationary state preferences: policies don't change depending on the current

state• additive rewards

• discounted rewards

• utility is finite for infinite horizon with discounted rewards• if γ < 1 then Uh = Rmax / (1 – γ) when summed over infinity• improper policy: one that is not guaranteed to terminate• use discounted mainly to prevent problems from improper policies

U h [ s0, s1, s2,. ..]=R s0+γ R s1 +γ2R s2. . .

U h [ s0, s1, s2,. ..]=R s0+R s2+R s3. ..

Optimal Policies for utilities of states

Expected utility for some policy π starting in state s

The optimal policy π* has the highest expected utility and will be given by

This sets π*(s) to the argument a of A(s) which gives the highest utility

Policy is actually independent of start state:• actions will differ but policy will never change• this comes from the nature of a Markovian decision problem with discounted utilities

over infinite horizonsU(s) is also independent of start state and current state

U π s =E[∑t= 0

∞

γt R S t ]

∗s =argmaxa∈A s

∑s 'P s '∣s , aU s '

Optimal Policies for utilities of states

Utilities for states in simple example with γ = 1

17.2 – Value Iteration Bellman Equation for Utilities

Bellman Equation:

Utility of the state = the reward of the state plus the discounted utility of the next stateNon-linear (max operator) function based on the utility of future states

Sample calculation for simple example:

U s=R s maxa∈A s

∑s 'P s '∣s , aU s '

Value Iteration Algorithm

Hard to calculate because it's non-linear so use an iterating algorithm. Basic idea is start at an initial value for all states then update each state using their neighbours until they hit equilibrium.

AlgorithmStart with an initial value for states eg. ZeroRepeat

For each state

Update each state using utility of neighbours in previous iteration

Until at equilibrium

Each iteration is called a Bellmen Update given by

U i+ 1s =R s maxa∈A s

∑s'P s '∣s , aU i s '

Using Value Iteration on the Simple Example

This shows the utilities over a number of iterations. Notice that they all converge at some equilibrium.

Convergence of Value Iteration

We need to know when to stop iterating

• A bit of error is produced by the algorithm at the start because it needs to be at equilibrium

• Error = |U'(s) – U(s)| -> which is the difference between the utilities of two iterations of the Bellman update

• We cut off the algorithm when max( |U'(s) – U(s)|) < e(1- γ) / γ

– where e = c Rmax and c is a constant factor• Do this because we want to avoid as much error as we can

• We also want to stop at a reasonable amount of iterations


Why use c Rmax(1- γ) / γ

• Recall: if γ < 1 and infinite-horizon then Uh converges to Rmax / (1 – γ) when summed over infinity

• So we know we want to stop at c Rmax(1- γ) / γ where c is a constant factor less than 1

• Normally c is fairly small around 0.1

• Even with some error the policy eqn (shown before) almost always produces an optimal policy

∗s =argmaxa∈A s

∑s 'P s '∣s , aU s '


Number of Bellman update iterations required to converge vs the discount factor for different values of c.

17. 3 - Policy Iteration

Uses the idea that for each state one action is most likely significantly better than others

Algorithmstart with policy π0

repeat Policy evaluation: for each state calculate Ui given by policy πi

– simplified version of Bellman Update eqn – no need for max

check if unchanged

Policy improvement: for each state

– if the max utility over each action gives a better result than π(s)– set π(s) to the new policy

until unchanged

U i+ 1s =R s∑s 'P s '∣s ,i sU i s '

17. 3 - Policy Iteration Algorithm

17.4 Partially Observable MDPs (POMDPs)Properties

Like MDPs has:• a set of actions A(s) • transition model P(s' | s, a) • reward function R(s)

Also has:• sensor model P(e | s)

– gives the probability of perceiving evidence e in state s• belief state b(s)

Belief States

• Belief state is a probability distribution over all possible states

• Similar to probability distributions from chapter 13

• Second term forward projects all possible states s to the next state s'

• First term updates that with evidence e

• α is normalization constant

b ' s ' = Pe∣s ' ∑sP s '∣s , ab s

Decision Cycle for POMDP Agent

General decision cycle is as follows:

• Given belief state b, execute action given from policy, a = π*(b)

• Receive percept e

• Set new belief state using

Neat fact: POMDPs on a physical space can be reduced to an MDP on a belief-state space. Not very effective though because of high dimensionality of state space

b ' s ' = Pe∣s ' ∑sP s '∣s , ab s

Value Iteration for POMDPs

• Generally very inefficient

• Even the simple example discussed previously is too hard

• Better to use approximate online algorithm based on look-ahead search

Online agents for POMDPs

• Transition and sensor models are represented by dynamic Bayesian networks

• They are extended with decision and utility nodes

• The model is called a dynamic decision network

• Filtering algorithms are used to update belief states when percepts and actions are found

• Decisions are made by projecting forward possible action sequences and choosing the best one

Very similar to minimax algorithm from chapter 5

• Creates trees of possible actions for a number of iterations

• Performs the best looking action

• Repeats the process

Chapter 17 – Making Complex Decisions A

Documents

Transcript of Chapter 17 – Making Complex Decisions A