CS 188: Artificial Intelligence

50
CS 188: Artificial Intelligence Markov Decision Processes Instructors: Dan Klein and Pieter Abbeel University of California, Berkeley [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

description

CS 188: Artificial Intelligence. Markov Decision Processes. Instructors: Dan Klein and Pieter Abbeel University of California, Berkeley - PowerPoint PPT Presentation

Transcript of CS 188: Artificial Intelligence

CS 294-5: Statistical Natural Language Processing

AnnouncementsHomework 3: GamesIn observation of Presidents Day, due Tuesday 2/18 at 11:59pm.

Project 2: Multi-Agent PacmanHas been released, due Friday 2/21 at 5:00pm.Optional mini-contest, due Sunday 2/23 at 11:59pm.

edX and Pandagrader grade consolidation surveyDue Tuesday 2/18.

1AI in the news

CS 188: Artificial Intelligence

Markov Decision ProcessesInstructors: Dan Klein and Pieter AbbeelUniversity of California, Berkeley[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]Please retain proper attribution, including the reference to ai.berkeley.edu. Thanks!

3Non-Deterministic Search

Example: Grid World

A maze-like problemThe agent lives in a gridWalls block the agents path

Noisy movement: actions do not always go as planned80% of the time, the action North takes the agent North (if there is no wall there)10% of the time, North takes the agent West; 10% EastIf there is a wall in the direction the agent would have been taken, the agent stays put

The agent receives rewards each time stepSmall living reward each step (can be negative)Big rewards come at the end (good or bad)

Goal: maximize sum of rewards

[cut demo of moving around in grid world program]5Grid World Actions

Deterministic Grid WorldStochastic Grid World

Markov Decision ProcessesAn MDP is defined by:A set of states s SA set of actions a AA transition function T(s, a, s)Probability that a from s leads to s, i.e., P(s| s, a)Also called the model or the dynamicsA reward function R(s, a, s) Sometimes just R(s) or R(s)A start stateMaybe a terminal state

MDPs are non-deterministic search problemsOne way to solve them is with expectimax searchWell have a new tool soon

[Demo gridworld manual intro (L8D1)]In search problems: did not talk about actions, but about successor functions --- now the information inside the successor function is unpacked into actions, transitions and reward

Write out S, A, example entry in T, entry in R

Reward function different from the book : R(s,a,s)In book simpler for equations, but not useful for the projects.

Need to modify expectimax a tiny little bit to account for rewards along the way, but thats something you should be able to do, and so you can already solve MDPs (not in most efficient way)7Video of Demo Gridworld Manual Intro

What is Markov about MDPs?Markov generally means that given the present state, the future and the past are independent

For Markov decision processes, Markov means action outcomes depend only on the current state

This is just like search, where the successor function could only depend on the current state (not the history)

Andrey Markov (1856-1922)

Like search: successor function only depended on current state

Can make this happen by stuffing more into the state;

Very similar to search problems: when solving a maze with food pellets, we stored which food pellets were eaten

9Policies

Optimal policy when R(s, a, s) = -0.03 for all non-terminals s

In deterministic single-agent search problems, we wanted an optimal plan, or sequence of actions, from start to a goal

For MDPs, we want an optimal policy *: S AA policy gives an action for each stateAn optimal policy is one that maximizes expected utility if followedAn explicit policy defines a reflex agent

Expectimax didnt compute entire policiesIt computed the action for a single state only

Dan has a DEMO for this.10Optimal Policies

R(s) = -2.0R(s) = -0.4R(s) = -0.03R(s) = -0.01R(s) = the living reward

11Example: Racing

Example: RacingA robot car wants to travel far, quicklyThree states: Cool, Warm, OverheatedTwo actions: Slow, FastGoing faster gets double reward

CoolWarmOverheatedFastFastSlowSlow0.5 0.5 0.5 0.5 1.0 1.0 +1 +1 +1 +2 +2 -10Racing Search Tree

MDP Search TreesEach MDP state projects an expectimax-like search treeasss, a(s,a,s) called a transitionT(s,a,s) = P(s|s,a)R(s,a,s)s,a,ss is a state(s, a) is a q-state

In MDP chance node represents uncertainty about what might happen based on (s,a) [as opposed to being a random adversary]

Q-state (s,a) is when you were in a state and took an action 15Utilities of Sequences

Utilities of SequencesWhat preferences should an agent have over reward sequences?

More or less?

Now or later?

[1, 2, 2][2, 3, 4] or[0, 0, 1][1, 0, 0] orDiscountingIts reasonable to maximize the sum of rewardsIts also reasonable to prefer rewards now to rewards laterOne solution: values of rewards decay exponentially

Worth NowWorth Next StepWorth In Two Steps

DiscountingHow to discount?Each time we descend a level, we multiply in the discount once

Why discount?Sooner rewards probably do have higher utility than later rewardsAlso helps our algorithms converge

Example: discount of 0.5U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3U([1,2,3]) < U([3,2,1])

Rewards in the future (deeper in the tree) matter less

Interesting: running expectimax, if having to truncate the search, then not losing much; e.g., less then \gamma^d / (1-\gamma)

19Stationary PreferencesTheorem: if we assume stationary preferences:

Then: there are only two ways to define utilities

Additive utility:

Discounted utility:

Quiz: DiscountingGiven:

Actions: East, West, and Exit (only available in exit states a, e)Transitions: deterministic

Quiz 1: For = 1, what is the optimal policy?

Quiz 2: For = 0.1, what is the optimal policy?

Quiz 3: For which are West and East equally good when in state d?

Infinite Utilities?!Problem: What if the game lasts forever? Do we get infinite rewards?

Solutions:Finite horizon: (similar to depth-limited search)Terminate episodes after a fixed T steps (e.g. life)Gives nonstationary policies ( depends on time left)

Discounting: use 0 < < 1

Smaller means smaller horizon shorter term focus

Absorbing state: guarantee that for every policy, a terminal state will eventually be reached (like overheated for racing)

Recap: Defining MDPsMarkov decision processes:Set of states SStart state s0Set of actions ATransitions P(s|s,a) (or T(s,a,s))Rewards R(s,a,s) (and discount )

MDP quantities so far:Policy = Choice of action for each stateUtility = sum of (discounted) rewardsass, as,a,ssSolving MDPs

Optimal QuantitiesThe value (utility) of a state s:V*(s) = expected utility starting in s and acting optimally

The value (utility) of a q-state (s,a):Q*(s,a) = expected utility starting out having taken action a from state s and (thereafter) acting optimally

The optimal policy:*(s) = optimal action from state sasss, a(s,a,s) is a transitions,a,ss is a state(s, a) is a q-state[Demo gridworld values (L8D4)][Note: this demo doesnt show anything in action, just pops up V and Q values.]25Snapshot of Demo Gridworld V ValuesNoise = 0Discount = 1Living reward = 0

This demo doesnt show anything in action, it just shows the V and Q values.26Snapshot of Demo Gridworld Q ValuesNoise = 0Discount = 1Living reward = 0

This demo doesnt show anything in action, it just shows the V and Q values.

27Snapshot of Demo Gridworld V ValuesNoise = 0.2Discount = 1Living reward = 0

This demo doesnt show anything in action, it just shows the V and Q values.28Snapshot of Demo Gridworld Q ValuesNoise = 0.2Discount = 1Living reward = 0

This demo doesnt show anything in action, it just shows the V and Q values.

29Snapshot of Demo Gridworld V ValuesNoise = 0.2Discount = 0.9Living reward = 0

This demo doesnt show anything in action, it just shows the V and Q values.30Snapshot of Demo Gridworld Q ValuesNoise = 0.2Discount = 0.9Living reward = 0

This demo doesnt show anything in action, it just shows the V and Q values.

31Snapshot of Demo Gridworld V ValuesNoise = 0.2Discount = 0.9Living reward = -0.1

This demo doesnt show anything in action, it just shows the V and Q values.32Snapshot of Demo Gridworld Q ValuesNoise = 0.2Discount = 0.9Living reward = -0.1

This demo doesnt show anything in action, it just shows the V and Q values.

33Values of StatesFundamental operation: compute the (expectimax) value of a stateExpected utility under optimal actionAverage sum of (discounted) rewardsThis is just what expectimax computed!

Recursive definition of value:

ass, as,a,ssRacing Search Tree

Racing Search Tree

Racing Search TreeWere doing way too much work with expectimax!

Problem: States are repeated Idea: Only compute needed quantities once

Problem: Tree goes on foreverIdea: Do a depth-limited computation, but with increasing depths until change is smallNote: deep parts of the tree eventually dont matter if < 1

Time-Limited ValuesKey idea: time-limited values

Define Vk(s) to be the optimal value of s if the game ends in k more time stepsEquivalently, its what a depth-k expectimax would give from s

[Demo time-limited values (L8D6)]

This demo generates the time-limited values for gridworld as snapshotted onto the next slides.38k=0Noise = 0.2Discount = 0.9Living reward = 0

Assuming zero living reward39k=1Noise = 0.2Discount = 0.9Living reward = 0

Assuming zero living reward40k=2Noise = 0.2Discount = 0.9Living reward = 0

Assuming zero living reward41k=3Noise = 0.2Discount = 0.9Living reward = 0

Assuming zero living reward42k=4Noise = 0.2Discount = 0.9Living reward = 0

Assuming zero living reward43k=5Noise = 0.2Discount = 0.9Living reward = 0

Assuming zero living reward44k=6Noise = 0.2Discount = 0.9Living reward = 0

Assuming zero living reward45k=7Noise = 0.2Discount = 0.9Living reward = 0

Assuming zero living reward46k=8Noise = 0.2Discount = 0.9Living reward = 0

Assuming zero living reward47k=9Noise = 0.2Discount = 0.9Living reward = 0

Assuming zero living reward48k=10Noise = 0.2Discount = 0.9Living reward = 0

Assuming zero living reward49k=11Noise = 0.2Discount = 0.9Living reward = 0

Assuming zero living reward50k=12Noise = 0.2Discount = 0.9Living reward = 0

Assuming zero living reward51k=100Noise = 0.2Discount = 0.9Living reward = 0

Assuming zero living reward52

Computing Time-Limited Values

Value Iteration

Value IterationStart with V0(s) = 0: no time steps left means an expected reward sum of zero

Given vector of Vk(s) values, do one ply of expectimax from each state:

Repeat until convergence

Complexity of each iteration: O(S2A)

Theorem: will converge to unique optimal valuesBasic idea: approximations get refined towards optimal valuesPolicy may converge long before values do

aVk+1(s)s, as,a,sVk(s)Discuss computational complexity: S * A * S times number of iterations

Note: updates not in place [if in place, it means something else and not even clear what it means]55Example: Value Iteration

0 0 0 2 1 0 3.5 2.5 0Assume no discount!

Convergence*How do we know the Vk vectors are going to converge?

Case 1: If the tree has maximum depth M, then VM holds the actual untruncated values

Case 2: If the discount is less than 1Sketch: For any state Vk and Vk+1 can be viewed as depth k+1 expectimax results in nearly identical search treesThe difference is that on the bottom layer, Vk+1 has actual rewards while Vk has zerosThat last layer is at best all RMAX It is at worst RMIN But everything is discounted by k that far outSo Vk and Vk+1 are at most k max|R| differentSo as k increases, the values converge

Next Time: Policy-Based MethodsNon-Deterministic Search