Making complex decisions

28
Making complex decisions Chapter 17

description

Making complex decisions. Chapter 17. Outline. Sequential decision problems (Markov Decision Process) Value iteration Policy iteration Partially Observable MDP’s Decision theoretic agent Dynamic belief networks. Sequential Decisions. Agent’s utility depends on a sequence of decisions. - PowerPoint PPT Presentation

Transcript of Making complex decisions

Page 1: Making  complex decisions

Making complex decisions

Chapter 17

Page 2: Making  complex decisions

Outline

• Sequential decision problems (Markov Decision Process) – Value iteration– Policy iteration

Page 3: Making  complex decisions

Sequential Decisions

• Agent’s utility depends on a sequence of decisions

Page 4: Making  complex decisions

Markov Decision Process (MDP)

• Defined as a tuple: <S, A, M, R>– S: State– A: Action– M: Transition function

• Table Mija = P(sj| si, a), prob of sj| given action “a” in state si

– R: Reward

• R(si, a) = cost or reward of taking action a in state si

• In our case R = R(si)

• Choose a sequence of actions (not just one decision or one action)– Utility based on a sequence of decisions

Page 5: Making  complex decisions

Generalization Inputs:

• Initial state s0

• Action model• Reward R(si) collected in each state si

A state is terminal if it has no successor Starting at s0, the agent keeps executing actions

until it reaches a terminal state Its goal is to maximize the expected sum of

rewards collected (additive rewards) Additive rewards: U(s0, s1, s2, …) = R(s0) + R(s1) +

R(s2)+ … Discounted rewards (we will not use this reward form)

U(s0, s1, s2, …) = R(s0) + R(s1) + 2R(s2) + … (0 1)

Page 6: Making  complex decisions

Dealing with infinite sequences of actions

• Use discounted rewards

• The environment contains terminal states that the agent is guaranteed to reach eventually

• Infinite sequences are compared by their average rewards

Page 7: Making  complex decisions

Example

• Fully observable environment• Non deterministic actions

– intended effect: 0.8– not the intended effect: 0.2

1 2 3 4

1

2

3 +1

-1

start

0.8

0.10.1

Page 8: Making  complex decisions

MDP of example• S: State of the agent on the grid (4,3)

– Note that cell denoted by (x,y)• A: Actions of the agent, i.e., N, E, S, W

• M: Transition function – E.g., M( (4,2) | (3,2), N) = 0.1– E.g., M((3, 3) | (3,2), N) = 0.8– (Robot movement, uncertainty of another agent’s actions,

…)

• R: Reward (more comments on the reward function later)– R(1, 1) = -1/25, R(1, 2) = -1/25 ….– R (4,3) = +1 R(4,2) = -1

= 1

Page 9: Making  complex decisions

Policy

• Policy is a mapping from states to actions.• Given a policy, one may calculate the expected

utility from series of actions produced by policy.

• The goal: Find an optimal policy , one that would produce maximal expected utility.

+1

-1

1 2 3 4

1

2

3

Example of policy

Page 10: Making  complex decisions

Utility of a StateGiven a state s we can measure the expected utilities by

applying any policy .

We assume the agent is in state s and define St (a random variable) as the state reached at step t.

Obviously S0 = s

The expected utility of a state s given a policy is

U(s) = E[t t R(St)]

*s will be a policy that maximizes the expected utility of state s.

Page 11: Making  complex decisions

Utility of a State

The utility of a state s measures its desirability:

If i is terminal:U(Si) = R(Si)

If i is non-terminal, U(Si) = R(Si) + maxajP(Sj|Si,a) U(Sj) [Bellman equation]

[the reward of s augmented by the expected sum of discounted rewards collected in future states]

Page 12: Making  complex decisions

Utility of a State

The utility of a state s measures its desirability:

If i is terminal:U(Si) = R(Si)

If i is non-terminal, U(Si) = R(Si) + maxajP(Sj|Si,a) U(Sj) [Bellman equation]

[the reward of s augmented by the expected sum of discounted rewards collected in future states]

dynamic programming

Page 13: Making  complex decisions

Optimal Policy

A policy is a function that maps each state s into the action to execute if s is reached

The optimal policy * is the policy that always leads to maximizing the expected sum of rewards collected in future states (Maximum Expected Utility principle)

*(Si) = argmaxajP(Sj|Si,a) U(Sj)

Page 14: Making  complex decisions

Optimal policy and state utilities for MDP

• What will happen if cost of step is very low?

+1

-1

0.812

0.868

0.912

+1

0.762

0.660

-1

0.705

0.655

0.611

0.388

1 2 3 4 1 2 3 4

1

2

3

1

2

3

Page 15: Making  complex decisions

Finding the optimal policy

• Solution must satisfy both equations *(Si) = argmaxajP(Sj|Si,a) U(Sj) (1)

– U(Si) = R(Si) + maxajP(Sj|Si,a) U(Sj) (2)

• Value iteration:– start with a guess of a utility function and use (2)

to get a better estimate

• Policy iteration:– start with a fixed policy, and solve for the exact

utilities of states; use eq 1 to find an updated policy.

Page 16: Making  complex decisions

Value iteration

function Value-Iteration(MDP) returns a utility functioninputs: P(Si |Sj,a), a transition model R, a reward function on states

discount parameterlocal variables: Utility function, initially identical to R

U’, utility function, initially identical to Rrepeat

U U’for each state i do

U’ [Si] R [Si] + maxa j P(Sj |Si,a) U (Sj) end

until Close-Enough(U,U’)return U

Page 17: Making  complex decisions

Value iteration - convergence of utility values

Page 18: Making  complex decisions

policy iterationfunction Policy-Iteration (MDP) returns a policyinputs: P(Si |Sj,a), a transition model R, a reward function on states

discount parameter local variables: U, utility function, initially identical to R

, a policy, initially optimal with respect to Urepeat

U Value-Determination(, U, MDP)unchaged? truefor each state Si do

if maxa jP(Sj |Si,a) U (Sj) > jP(Sj |Si, (Si)) U (Sj) (Si) arg maxa jP(Sj |Si,a) U (Sj)

unchaged? falseuntil unchaged?return

Page 19: Making  complex decisions

Decision Theoretic Agent

• Agent that must act in uncertain environments. Actions may be non-deterministic.

Page 20: Making  complex decisions

Stationarity and Markov assumption

• A stationary process is one whose dynamics, i.e., P(Xt|Xt-1,…,X0) for t>0, are assumed not to change with time.

• The Markov assumption is that the current state Xt is dependent only on a finite history of previous states. In this class, we will only consider first-order Markov processes for which

P(Xt|Xt-1,…,X0) =P(Xt|Xt-1)

Page 21: Making  complex decisions

Transition model and sensor model

• For first-order Markov processes, the laws describing how the process state evolves with time is contained entirely within the conditional distribution P(Xt|Xt-1), which is called the transition model for the process.

• We will also assume that the observable state variables (evidence) Et are dependent only on the state variables Xt. P(Et|Xt) is called the sensor or observational model.

Page 22: Making  complex decisions

Decision Theoretic Agent - implemented

Page 23: Making  complex decisions

sensor model

generic Example – part of steam train control

Page 24: Making  complex decisions

sensor model II

Page 25: Making  complex decisions

Model for lane position sensor for automated vehicle

Page 26: Making  complex decisions

dynamic belief network – generic structure

• Describe the action model, namely the state evolution model.

Page 27: Making  complex decisions

State evolution model - example

Page 28: Making  complex decisions

Dynamic Decision Network

• Can handle uncertainty• Deal with continuous streams of

evidence• Can handle unexpected events (they

have no fixed plan)• Can handle sensor noise and sensor

failure• Can act to obtain information• Can handle large state spaces