Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

14
Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

description

Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3. Search, planning, MDP. Search. Actions are unreliable No hard goals Utility depends on entire environment history. Factorized state representations. Uncertainty & utility. MDP. Planning. Planning and MDPs. - PowerPoint PPT Presentation

Transcript of Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

Page 1: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

Markov Decision Processes

AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

Page 2: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

Search, planning, MDP

Search

Planning MDP

Factorized staterepresentations Uncertainty &

utility

Actions are unreliableNo hard goalsUtility depends on entire environment history

Page 3: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

Planning and MDPs• In addition to actions having costs, we might have goals

with rewards, with the understanding that if you achieve a goal, you get the corresponding reward

• So now, the objective of planning is to find a plan that has the highest net benefit measured as the difference between the cumulative reward for the goals achieved and the cumulative cost of the actions used

• This problem is both easy (since an “empty” plan is a solution, just not a very good one) and hard (since now the “quality of the plan” in terms of its net benefit is more important)

• On top of this, we might also want to say that rewards are not limited to just goals achieved in the final state, but can also be gathered for visiting certain good states on the way

Page 4: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

Example MDP

Page 5: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

Markov decision process• A sequential decision problem for a fully

observable, stochastic environment with a Markovian transition model and additive rewards is called a MDP– A set of states– A set of actions in each state– A transition model– A reward function

P(j | i, a) = probability of doing action a in i leads to j

Page 6: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

What does a solution look like?• The solution should tell the optimal

action to do in each state (called a “Policy”)– Policy is a function from states to

actions– Not a sequence of actions anymore

• because of the non-deterministic actions

– If there are |S| states and |A| actions that we can do at each state, then there are |A||S| policies

Page 7: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

Optimal policies depend on rewards

R(s) = -0.04

Page 8: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

Horizon & Policy• We said policy is a function from states to

actions, but we sort of lied• Best policy is non-stationary, i.e., depends on

how long the agent has to “live” – which is called “horizon”

• More generally, a policy is a mapping from <state, time-to-death> <action>– So, if we have a horizon of k, then we will have k

policies• If the horizon is infinite, then policies must all be

the same.. (So infinite horizon case is easy!)

Page 9: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

Horizon & Policy

We will concentrate on infinite horizon problemsIn which the optimal policy is stationary

Infinite horizon finite horizon, k=3

Page 10: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

Stationary preferences• Preferences between state sequences are

stationary:• If two sequences [s0,s1,s2,…] and [s0’,s1’,s2’,…]

begin with the same state (s0=s0’), then the two sequences should be preference-ordered the same way as the sequences [s1,s2,…] and [s1’,s2’,…]

• If you prefer future f1 to f2 starting tomorrow, you should prefer them the same way even if they start today

Page 11: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

Utility of state sequence• Define utility of a sequence of states in terms of their

rewards– Assume “stationarity” of preferences– Then, only two reasonable ways to define Utility of a sequence

of states

𝛾≤1

Page 12: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

The big picture

Compute the utilities of states

Compute the optimal policy

Page 13: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

Utility of a state• The utility of a state is the expected utility of the

state sequences that might follow it• The state sequences depend on the policy that

is executed• If we let st be the state the agent is in after

executing policy π for t steps (note that st is a random variable), then we have

• The true utility of a state is

The expected sum of discounted rewards if the agent executes an optimal policy.

Different with R(s),The short-term reward

Page 14: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3

Utility of a state

The expected sum of discounted rewards if the agent executes an optimal policy.