CS 188: Artificial Intelligence
-
Upload
chester-lott -
Category
Documents
-
view
28 -
download
0
description
Transcript of CS 188: Artificial Intelligence
CS 294-5: Statistical Natural Language Processing
AnnouncementsHomework 3: GamesIn observation of Presidents Day, due Tuesday 2/18 at 11:59pm.
Project 2: Multi-Agent PacmanHas been released, due Friday 2/21 at 5:00pm.Optional mini-contest, due Sunday 2/23 at 11:59pm.
edX and Pandagrader grade consolidation surveyDue Tuesday 2/18.
1AI in the news
CS 188: Artificial Intelligence
Markov Decision ProcessesInstructors: Dan Klein and Pieter AbbeelUniversity of California, Berkeley[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]Please retain proper attribution, including the reference to ai.berkeley.edu. Thanks!
3Non-Deterministic Search
Example: Grid World
A maze-like problemThe agent lives in a gridWalls block the agents path
Noisy movement: actions do not always go as planned80% of the time, the action North takes the agent North (if there is no wall there)10% of the time, North takes the agent West; 10% EastIf there is a wall in the direction the agent would have been taken, the agent stays put
The agent receives rewards each time stepSmall living reward each step (can be negative)Big rewards come at the end (good or bad)
Goal: maximize sum of rewards
[cut demo of moving around in grid world program]5Grid World Actions
Deterministic Grid WorldStochastic Grid World
Markov Decision ProcessesAn MDP is defined by:A set of states s SA set of actions a AA transition function T(s, a, s)Probability that a from s leads to s, i.e., P(s| s, a)Also called the model or the dynamicsA reward function R(s, a, s) Sometimes just R(s) or R(s)A start stateMaybe a terminal state
MDPs are non-deterministic search problemsOne way to solve them is with expectimax searchWell have a new tool soon
[Demo gridworld manual intro (L8D1)]In search problems: did not talk about actions, but about successor functions --- now the information inside the successor function is unpacked into actions, transitions and reward
Write out S, A, example entry in T, entry in R
Reward function different from the book : R(s,a,s)In book simpler for equations, but not useful for the projects.
Need to modify expectimax a tiny little bit to account for rewards along the way, but thats something you should be able to do, and so you can already solve MDPs (not in most efficient way)7Video of Demo Gridworld Manual Intro
What is Markov about MDPs?Markov generally means that given the present state, the future and the past are independent
For Markov decision processes, Markov means action outcomes depend only on the current state
This is just like search, where the successor function could only depend on the current state (not the history)
Andrey Markov (1856-1922)
Like search: successor function only depended on current state
Can make this happen by stuffing more into the state;
Very similar to search problems: when solving a maze with food pellets, we stored which food pellets were eaten
9Policies
Optimal policy when R(s, a, s) = -0.03 for all non-terminals s
In deterministic single-agent search problems, we wanted an optimal plan, or sequence of actions, from start to a goal
For MDPs, we want an optimal policy *: S AA policy gives an action for each stateAn optimal policy is one that maximizes expected utility if followedAn explicit policy defines a reflex agent
Expectimax didnt compute entire policiesIt computed the action for a single state only
Dan has a DEMO for this.10Optimal Policies
R(s) = -2.0R(s) = -0.4R(s) = -0.03R(s) = -0.01R(s) = the living reward
11Example: Racing
Example: RacingA robot car wants to travel far, quicklyThree states: Cool, Warm, OverheatedTwo actions: Slow, FastGoing faster gets double reward
CoolWarmOverheatedFastFastSlowSlow0.5 0.5 0.5 0.5 1.0 1.0 +1 +1 +1 +2 +2 -10Racing Search Tree
MDP Search TreesEach MDP state projects an expectimax-like search treeasss, a(s,a,s) called a transitionT(s,a,s) = P(s|s,a)R(s,a,s)s,a,ss is a state(s, a) is a q-state
In MDP chance node represents uncertainty about what might happen based on (s,a) [as opposed to being a random adversary]
Q-state (s,a) is when you were in a state and took an action 15Utilities of Sequences
Utilities of SequencesWhat preferences should an agent have over reward sequences?
More or less?
Now or later?
[1, 2, 2][2, 3, 4] or[0, 0, 1][1, 0, 0] orDiscountingIts reasonable to maximize the sum of rewardsIts also reasonable to prefer rewards now to rewards laterOne solution: values of rewards decay exponentially
Worth NowWorth Next StepWorth In Two Steps
DiscountingHow to discount?Each time we descend a level, we multiply in the discount once
Why discount?Sooner rewards probably do have higher utility than later rewardsAlso helps our algorithms converge
Example: discount of 0.5U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3U([1,2,3]) < U([3,2,1])
Rewards in the future (deeper in the tree) matter less
Interesting: running expectimax, if having to truncate the search, then not losing much; e.g., less then \gamma^d / (1-\gamma)
19Stationary PreferencesTheorem: if we assume stationary preferences:
Then: there are only two ways to define utilities
Additive utility:
Discounted utility:
Quiz: DiscountingGiven:
Actions: East, West, and Exit (only available in exit states a, e)Transitions: deterministic
Quiz 1: For = 1, what is the optimal policy?
Quiz 2: For = 0.1, what is the optimal policy?
Quiz 3: For which are West and East equally good when in state d?
Infinite Utilities?!Problem: What if the game lasts forever? Do we get infinite rewards?
Solutions:Finite horizon: (similar to depth-limited search)Terminate episodes after a fixed T steps (e.g. life)Gives nonstationary policies ( depends on time left)
Discounting: use 0 < < 1
Smaller means smaller horizon shorter term focus
Absorbing state: guarantee that for every policy, a terminal state will eventually be reached (like overheated for racing)
Recap: Defining MDPsMarkov decision processes:Set of states SStart state s0Set of actions ATransitions P(s|s,a) (or T(s,a,s))Rewards R(s,a,s) (and discount )
MDP quantities so far:Policy = Choice of action for each stateUtility = sum of (discounted) rewardsass, as,a,ssSolving MDPs
Optimal QuantitiesThe value (utility) of a state s:V*(s) = expected utility starting in s and acting optimally
The value (utility) of a q-state (s,a):Q*(s,a) = expected utility starting out having taken action a from state s and (thereafter) acting optimally
The optimal policy:*(s) = optimal action from state sasss, a(s,a,s) is a transitions,a,ss is a state(s, a) is a q-state[Demo gridworld values (L8D4)][Note: this demo doesnt show anything in action, just pops up V and Q values.]25Snapshot of Demo Gridworld V ValuesNoise = 0Discount = 1Living reward = 0
This demo doesnt show anything in action, it just shows the V and Q values.26Snapshot of Demo Gridworld Q ValuesNoise = 0Discount = 1Living reward = 0
This demo doesnt show anything in action, it just shows the V and Q values.
27Snapshot of Demo Gridworld V ValuesNoise = 0.2Discount = 1Living reward = 0
This demo doesnt show anything in action, it just shows the V and Q values.28Snapshot of Demo Gridworld Q ValuesNoise = 0.2Discount = 1Living reward = 0
This demo doesnt show anything in action, it just shows the V and Q values.
29Snapshot of Demo Gridworld V ValuesNoise = 0.2Discount = 0.9Living reward = 0
This demo doesnt show anything in action, it just shows the V and Q values.30Snapshot of Demo Gridworld Q ValuesNoise = 0.2Discount = 0.9Living reward = 0
This demo doesnt show anything in action, it just shows the V and Q values.
31Snapshot of Demo Gridworld V ValuesNoise = 0.2Discount = 0.9Living reward = -0.1
This demo doesnt show anything in action, it just shows the V and Q values.32Snapshot of Demo Gridworld Q ValuesNoise = 0.2Discount = 0.9Living reward = -0.1
This demo doesnt show anything in action, it just shows the V and Q values.
33Values of StatesFundamental operation: compute the (expectimax) value of a stateExpected utility under optimal actionAverage sum of (discounted) rewardsThis is just what expectimax computed!
Recursive definition of value:
ass, as,a,ssRacing Search Tree
Racing Search Tree
Racing Search TreeWere doing way too much work with expectimax!
Problem: States are repeated Idea: Only compute needed quantities once
Problem: Tree goes on foreverIdea: Do a depth-limited computation, but with increasing depths until change is smallNote: deep parts of the tree eventually dont matter if < 1
Time-Limited ValuesKey idea: time-limited values
Define Vk(s) to be the optimal value of s if the game ends in k more time stepsEquivalently, its what a depth-k expectimax would give from s
[Demo time-limited values (L8D6)]
This demo generates the time-limited values for gridworld as snapshotted onto the next slides.38k=0Noise = 0.2Discount = 0.9Living reward = 0
Assuming zero living reward39k=1Noise = 0.2Discount = 0.9Living reward = 0
Assuming zero living reward40k=2Noise = 0.2Discount = 0.9Living reward = 0
Assuming zero living reward41k=3Noise = 0.2Discount = 0.9Living reward = 0
Assuming zero living reward42k=4Noise = 0.2Discount = 0.9Living reward = 0
Assuming zero living reward43k=5Noise = 0.2Discount = 0.9Living reward = 0
Assuming zero living reward44k=6Noise = 0.2Discount = 0.9Living reward = 0
Assuming zero living reward45k=7Noise = 0.2Discount = 0.9Living reward = 0
Assuming zero living reward46k=8Noise = 0.2Discount = 0.9Living reward = 0
Assuming zero living reward47k=9Noise = 0.2Discount = 0.9Living reward = 0
Assuming zero living reward48k=10Noise = 0.2Discount = 0.9Living reward = 0
Assuming zero living reward49k=11Noise = 0.2Discount = 0.9Living reward = 0
Assuming zero living reward50k=12Noise = 0.2Discount = 0.9Living reward = 0
Assuming zero living reward51k=100Noise = 0.2Discount = 0.9Living reward = 0
Assuming zero living reward52
Computing Time-Limited Values
Value Iteration
Value IterationStart with V0(s) = 0: no time steps left means an expected reward sum of zero
Given vector of Vk(s) values, do one ply of expectimax from each state:
Repeat until convergence
Complexity of each iteration: O(S2A)
Theorem: will converge to unique optimal valuesBasic idea: approximations get refined towards optimal valuesPolicy may converge long before values do
aVk+1(s)s, as,a,sVk(s)Discuss computational complexity: S * A * S times number of iterations
Note: updates not in place [if in place, it means something else and not even clear what it means]55Example: Value Iteration
0 0 0 2 1 0 3.5 2.5 0Assume no discount!
Convergence*How do we know the Vk vectors are going to converge?
Case 1: If the tree has maximum depth M, then VM holds the actual untruncated values
Case 2: If the discount is less than 1Sketch: For any state Vk and Vk+1 can be viewed as depth k+1 expectimax results in nearly identical search treesThe difference is that on the bottom layer, Vk+1 has actual rewards while Vk has zerosThat last layer is at best all RMAX It is at worst RMIN But everything is discounted by k that far outSo Vk and Vk+1 are at most k max|R| differentSo as k increases, the values converge
Next Time: Policy-Based MethodsNon-Deterministic Search