11/22: Conditional Planning & Replanning Current Standings sent Semester project report due 11/30...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of 11/22: Conditional Planning & Replanning Current Standings sent Semester project report due 11/30...
11/22: Conditional Planning & Replanning
Current Standings sentSemester project report due 11/30Homework 4 will be due before the last classNext class: Review of MDPs (*please* read Chapter 16 and the class slides)
Sensing Actions Sensing actions in essence “partition” a
belief state Sensing a formula f splits a belief state B to
B&f; B&~f Both partitions need to be taken to the goal
state now Tree plan AO* search
Heuristics will have to compare two generalized AND branches In the figure, the lower branch has an
expected cost of 11,000 The upper branch has a fixed sensing cost
of 300 + based on the outcome, a cost of 7 or 12,000
If we consider worst case cost, we assume the cost is 12,300
If we consider both to be equally likey, we assume 6303.5 units cost
If we know actual probabilities that the sensing action returns one result as against other, we can use that to get the expected cost…
As
A
7
12,000
11,000
300
Sensing: General observations Sensing can be thought in terms of
Speicific state variables whose values can be found OR sensing actions that evaluate truth of some boolean formula
over the state variables. Sense(p) ; Sense(pV(q&r))
A general action may have both causative effects and sensing effects Sensing effect changes the agent’s knowledge, and not the world Causative effect changes the world (and may give certain
knowledge to the agent) A pure sensing action only has sensing effects; a pure causative
action only has causative effects.
Progression/Regression with Sensing
When applied to a belief state, AT RUN TIME the sensing effects of an action wind up reducing the cardinality of that belief state basically by removing all states that are not consistent with the sensed
effects AT PLAN TIME, Sensing actions PARTITION belief states
If you apply Sense-f? to a belief state B, you get a partition of B1: B&f and B2: B&~f
You will have to make a plan that takes both partitions to the goal state Introduces branches in the plan
If you regress two belief state B&f and B&~f over a sensing action Sense-f?, you get the belief state B
If a state variable pIs in B, then there is some action Ap thatCan sense whether p is true or false
If P=B, the problem is fully observableIf B is empty, the problem is non observableIf B is a subset of P, it is partially observable
Note: Full vs. Partial observability is independent of sensing individual fluents vs. sensing formulas.
(assuming single literal sensing)
Full Observability: State Space partitioned to singleton Obs. ClassesNon-observability: Entire state space is a single observation class Partial Observability: Between 1 and |S| observation classes
Hardness classes for planning with sensing
Planning with sensing is hard or easy depending on: (easy case listed first) Whether the sensory actions give us full or partial
observability Whether the sensory actions sense individual fluents
or formulas on fluents Whether the sensing actions are always applicable
or have preconditions that need to be achieved before the action can be done
A Simple Progression Algorithm in the presence of pure sensing actions
Call the procedure Plan(BI,G,nil) where Procedure Plan(B,G,P)
If G is satisfied in all states of B, then return P Non-deterministically choose:
I. Non-deterministically choose a causative action a that is applicable in B. Return Plan(a(B),G,P+a)
II. Non-deterministically choose a sensing action s that senses a formula f (could be a single state variable)
Let p’ = Plan(B&f,G,nil); p’’=Plan(B&~f,G,nil) /*Bf is the set of states of B in which f is true */
Return P+(s?:p’;p’’)
If we always pick I and never do II then we will produce conformantPlans (if we succeed).
Remarks on the progression with sensing actions
Progression is implicitly finding an AND subtree of an AND/OR Graph If we look for AND subgraphs, we can represent DAGS.
The amount of sensing done in the eventual solution plan is controlled by how often we pick step I vs. step II (if we always pick I, we get conformant solutions). Progression is as clue-less as to whether to do sensing and
which sensing to do, as it is about which causative action to apply Need heuristic support
Heuristics for sensing
We need to compare the cumulative distance of B1 and B2 to goal with that of B3 to goal Notice that Planning cost is related to plan
size while plan exec cost is related to the length of the deepest branch (or expected length of a branch)
If we use the conformant belief state distance (as discussed last class), then we will be over estimating the distance (since sensing may allow us to do shorter branch)
Bryce [ICAPS 05—submitted] starts wth the conformant relaxed plan and introduces sensory actions into the plan to estimate the cost more accurately
As
A
7
12,000
11,000
300
B1
B2
B3
Sensing: More things under the mat(which we won’t lift for now )
Sensing extends the notion of goals (and action preconditions). Findout goals: Check if Rao is awake vs. Wake up Rao
Presents some tricky issues in terms of goal satisfaction…! You cannot use “causative” effects to support “findout” goals
But what if the causative effects are supporting another needed goal and wind up affecting the goal as a side-effect? (e.g. Have-gong-go-off & find-out-if-rao-is-awake)
Quantification is no longer syntactic sugaring in effects and preconditions in the presence of sensing actions Rm* can satisfy the effect forall files remove(file); without KNOWING what are the
files in the directory! This is alternative to finding each files name and doing rm <file-name>
Sensing actions can have preconditions (as well as other causative effects); they can have cost
The problem of OVER-SENSING (Sort of like a beginning driver who looks all directions every 3 millimeters of driving; also Sphexishness) [XII/Puccini project] Handling over-sensing using local-closedworld assumptions
Listing a file doesn’t destroy your knowledge about the size of a file; but compressing it does. If you don’t recognize it, you will always be checking the size of
the file after each and every action
Very simple ExampleA1 p=>r,~pA2 ~p=>r,p
A3 r=>g
O5 observe(p)
Problem: Init: don’t know p Goal: g
Plan: O5:p?[A1A3][A2A3]
Notice that in this case we also have a conformant plan: A1;A2;A3 --Whether or not the conformant plan is cheaper depends on how costly is sensing action O5 compared to A1 and A2
A more interesting example: MedicationThe patient is not Dead and may be Ill. The test paper is not Blue.We want to make the patient be not Dead and not IllWe have three actions: Medicate which makes the patient not ill if he is illStain—which makes the test paper blue if the patient is illSense-paper—which can tell us if the paper is blue or not.
No conformant plan possible here. Also, notice that I cannot be sensed directly but only through B
This domain is partially observable because the states (~D,I,~B) and (~D,~I,~B) cannot be distinguished
“Goal directed” conditional planning
Recall that regression of two belief state B&f and B&~f over a sensing action Sense-f will result in a belief state B
Search with this definition leads to two challenges:1. We have to combine search states into single ones (a sort of reverse AO*
operation)2. We may need to explicitly condition a goal formula in partially observable
case (especially when certain fluents can only be indirectly sensed) Example is the Medicate domain where I has to be found through B If you have a goal state B, you can always write it as B&f and B&~f for any
arbitrary f! (The goal Happy is achieved by achieving the twin goals Happy&rich as well as Happy&~rich) Of course, we need to pick the f such that f/~f can be sensed (i.e. f and ~f
defines an observational class feature) This step seems to go against the grain of “goal-directedenss”—we may not
know what to sense based on what our goal is after all!
Regression forPO case isStill notWell-understood
Very simple ExampleA1
p=>r,~pA2
~p=>r,p
A3r=>g
O5observe(p)
Problem: Init: don’t know p Goal: g
Regresssion
Handling the “combination” during regression
We have to combine search states into single ones (a sort of reverse AO* operation) Two ideas:
1. In addition to the normal regression children, also generate children from any pair of regressed states on the search fringe (has a breadth-first feel. Can be expensive!) [Tuan Le does this]
2. Do a contingent regression. Specifically, go ahead and generate B from B&f using Sense-f; but now you have to go “forward” from the “not-f” branch of Sense-f to goal too. [CNLP does this; See the example]
Need for explicit conditioning during regression (not needed for Fully Observable case)
If you have a goal state B, you can always write it as B&f and B&~f for any arbitrary f! (The goal Happy is achieved by achieving the twin goals Happy&rich as well as Happy&~rich) Of course, we need to pick the f
such that f/~f can be sensed (i.e. f and ~f defines an observational class feature)
This step seems to go against the grain of “goal-directedenss”—we may not know what to sense based on what our goal is after all!
Consider the Medicate problem. Coming from the goal of ~D&~I, we will never see the connection to sensing blue!
Notice the analogy to conditioning in evaluating a probabilistic query
Similar processing can be done for regression (PO planning is nothing but least-committed regression planning)
We now have yet another way of handling unsafe links --Conditioning to put the threatening step in a different world!
11/24
ReplanningMDPs[HW4 updated; See the paper task; Only MDP stuff to be added]
Sensing: More things under the mat(which we won’t lift for now )
Sensing extends the notion of goals (and action preconditions). Findout goals: Check if Rao is awake vs. Wake up Rao
Presents some tricky issues in terms of goal satisfaction…! You cannot use “causative” effects to support “findout” goals
But what if the causative effects are supporting another needed goal and wind up affecting the goal as a side-effect? (e.g. Have-gong-go-off & find-out-if-rao-is-awake)
Quantification is no longer syntactic sugaring in effects and preconditions in the presence of sensing actions Rm* can satisfy the effect forall files remove(file); without KNOWING what are the
files in the directory! This is alternative to finding each files name and doing rm <file-name>
Sensing actions can have preconditions (as well as other causative effects); they can have cost
The problem of OVER-SENSING (Sort of like a beginning driver who looks all directions every 3 millimeters of driving; also Sphexishness) [XII/Puccini project] Handling over-sensing using local-closedworld assumptions
Listing a file doesn’t destroy your knowledge about the size of a file; but compressing it does. If you don’t recognize it, you will always be checking the size of
the file after each and every action
Review
Sensing: Limited Contingency planning
In many real-world scenarios, having a plan that works in all contingencies is too hard An idea is to make a plan for some of the contingencies; and
monitor/Replan as necessary. Qn: What contingencies should we plan for?
The ones that are most likely to occur…(need likelihoods) Qn: What do we do if an unexpected contingency arises?
Monitor (the observable parts of the world) When it goes out of expected world, replan starting from that state.
Things more complicated if the world is partially observable Need to insert sensing actions to sense fluents that can only be indirectly sensed
“Triangle Tables”
This involves disjunctive goals!
Replanning—Respecting Commitments
In real-world, where you make commitments based on your plan, you cannot just throw away the plan at the first sign of failure
One heuristic is to reuse as much of the old plan as possible while doing replanning.
A more systematic approach is to 1. Capture the commitments made by the agent based on the
current plan2. Give these commitments as additional soft constraints to the
planner
Replanning as a universal antidote…
If the domain is observable and lenient to failures, and we are willing to do replanning, then we can always handle non-deterministic as well as stochastic actions with classical planning!
1. Solve the “deterministic” relaxation of the problem2. Start executing it, while monitoring the world state3. When an unexpected state is encountered, replan
A planner that did this in the First Intl. Planning Competition—Probabilistic Track, called FF-Replan, won the competition.
30 years of researchinto programming languages, ..and C++ is the result?
20 years of researchinto decision theoreticplanning, ..and FF-Replan is the result?
Models of Planning
Classical Contingent (FO)MDP
??? Contingent POMDP
??? Conformant (NO)MDP
Complete Observation
Partial
None
UncertaintyDeterministic Disjunctive Probabilistic
MDPs as Utility-based problem solving agents
Repeat
[can generalize to have action costs C(a,s)]
If Mij matrix is not known a priori, then we have a reinforcement learning scenario..
Repeat
(Value)
How about deterministic case? U(si) is the shortest path to the goal
Think of these as h*() values…Called value function U*
Think of these as related to h* values
Repeat
Policies change with rewards..
- -
What does a solution to an MDP look like?
• The solution should tell the optimal action to do in each state (called a “Policy”)
– Policy is a function from states to actions (* see finite horizon case below*)
– Not a sequence of actions anymore• Needed because of the non-deterministic actions
– If there are |S| states and |A| actions that we can do at each state, then there are |A||S| policies
• How do we get the best policy?– Pick the policy that gives the maximal expected reward– For each policy
• Simulate the policy (take actions suggested by the policy) to get behavior traces
• Evaluate the behavior traces• Take the average value of the behavior traces.
• How long should behavior traces be?– Each trace is no longer than k (Finite Horizon case)
• Policy will be horizon-dependent (optimal action depends not just on what state you are in, but how far is your horizon)
– Eg: Financial portfolio advice for yuppies vs. retirees.
– No limit on the size of the trace (Infinite horizon case)
• Policy is not horizon dependent• Qn: Is there a simpler way than having to evaluate
|A||S| policies? – Yes…
We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state)
.8
.1.1
Why are values coming down first?Why are some states reaching optimal value faster?
Updates can be done synchronously OR asynchronously --convergence guaranteed as long as each state updated infinitely often
.8
.1.1
Terminating Value Iteration
• The basic idea is to terminate the value iteration when the values have “converged” (i.e., not changing much from iteration to iteration)– Set a threshold and stop when the change across
two consecutive iterations is less than – There is a minor problem since value is a vector
• We can bound the maximum change that is allowed in any of the dimensions between two successive iterations by
• Max norm ||.|| of a vector is the maximal value among all its dimensions. We are basically terminating when ||Ui – Ui+1|| <
Policies converge earlier than values•There are finite number of policies but infinite number of value functions.
• So entire regions of value vector are mapped to a specific policy
• So policies may be converging faster than values. Search in the space of policies
•Given a utility vector Ui we can compute the greedy policy ui
• The policy loss of ui is ||UuiU*||
(max norm difference of two vectors is the maximum amount by which they differ on any dimension)
V(S1)
V(S2)
Consider an MDP with 2 states and 2 actions
P1P2
P3
P4
U*
We can either solve the linear eqns exactly, or solve them approximately by running the value iteration a few times (the update wont have the “max” operation)
n linear equations with n unknowns.
Other ways of solving MDPs• Value and Policy iteration are the
bed-rock methods for solving MDPs. Both give optimality guarantees
• Both of them tend to be very inefficient for large (several thousand state) MDPs
• Many ideas are used to improve the efficiency while giving up optimality guarantees
– E.g. Consider the part of the policy for more likely states (envelope extension method)
– Interleave “search” and “execution” (Real Time Dynamic Programming)
• Do limited-depth analysis based on reachability to find the value of a state (and there by the best action you you should be doing—which is the action that is sending you the best value)
• The values of the leaf nodes are set to be their immediate rewards
• If all the leaf nodes are terminal nodes, then the backed up value will be true optimal value. Otherwise, it is an approximation…
RTDP
What if you see this as a game?The expected value computation is fine if you are maximizing “expected” returnIf you are --if you are risk-averse? (and think “nature” is out to get you) V2= min(V3,V4)
If you are perpetual optimist then V2= max(V3,V4)
Incomplete observability(the dreaded POMDPs)
• To model partial observability, all we need to do is to look at MDP in the space of belief states (belief states are fully observable even when world states are not)
– Policy maps belief states to actions• In practice, this causes (humongous) problems
– The space of belief states is “continuous” (even if the underlying world is discrete and finite). {GET IT? GET IT??}
– Even approximate policies are hard to find (PSPACE-hard). • Problems with few dozen world states are hard to solve currently
– “Depth-limited” exploration (such as that done in adversarial games) are the only option…
Belief state ={ s1:0.3, s2:0.4; s4:0.3}
This figure basically shows that belief states change as we take actions
5 LEFTs 5 UPs
MDPs and Deterministic Search• Problem solving agent search corresponds to what special case of
MDP?– Actions are deterministic; Goal states are all equally valued, and are all
sink states.• Is it worth solving the problem using MDPs?
– The construction of optimal policy is an overkill• The policy, in effect, gives us the optimal path from every state to the goal
state(s))– The value function, or its approximations, on the other hand are useful.
How?• As heuristics for the problem solving agent’s search
• This shows an interesting connection between dynamic programming and “state search” paradigms– DP solves many related problems on the way to solving the one
problem we want– State search tries to solve just the problem we want– We can use DP to find heuristics to run state search..
Modeling Softgoal problems as deterministic MDPs