11/22: Conditional Planning & Replanning Current Standings sent Semester project report due 11/30...

11/22: Conditional Planning & Replanning

Current Standings sentSemester project report due 11/30Homework 4 will be due before the last classNext class: Review of MDPs (*please* read Chapter 16 and the class slides)

Sensing Actions Sensing actions in essence “partition” a

belief state Sensing a formula f splits a belief state B to

B&f; B&~f Both partitions need to be taken to the goal

state now Tree plan AO* search

Heuristics will have to compare two generalized AND branches In the figure, the lower branch has an

expected cost of 11,000 The upper branch has a fixed sensing cost

of 300 + based on the outcome, a cost of 7 or 12,000

If we consider worst case cost, we assume the cost is 12,300

If we consider both to be equally likey, we assume 6303.5 units cost

If we know actual probabilities that the sensing action returns one result as against other, we can use that to get the expected cost…

As

A

7

12,000

11,000

300

Sensing: General observations Sensing can be thought in terms of

Speicific state variables whose values can be found OR sensing actions that evaluate truth of some boolean formula

over the state variables. Sense(p) ; Sense(pV(q&r))

A general action may have both causative effects and sensing effects Sensing effect changes the agent’s knowledge, and not the world Causative effect changes the world (and may give certain

knowledge to the agent) A pure sensing action only has sensing effects; a pure causative

action only has causative effects.

Progression/Regression with Sensing

When applied to a belief state, AT RUN TIME the sensing effects of an action wind up reducing the cardinality of that belief state basically by removing all states that are not consistent with the sensed

effects AT PLAN TIME, Sensing actions PARTITION belief states

If you apply Sense-f? to a belief state B, you get a partition of B1: B&f and B2: B&~f

You will have to make a plan that takes both partitions to the goal state Introduces branches in the plan

If you regress two belief state B&f and B&~f over a sensing action Sense-f?, you get the belief state B

If a state variable pIs in B, then there is some action Ap thatCan sense whether p is true or false

If P=B, the problem is fully observableIf B is empty, the problem is non observableIf B is a subset of P, it is partially observable

Note: Full vs. Partial observability is independent of sensing individual fluents vs. sensing formulas.

(assuming single literal sensing)

Full Observability: State Space partitioned to singleton Obs. ClassesNon-observability: Entire state space is a single observation class Partial Observability: Between 1 and |S| observation classes

Hardness classes for planning with sensing

Planning with sensing is hard or easy depending on: (easy case listed first) Whether the sensory actions give us full or partial

observability Whether the sensory actions sense individual fluents

or formulas on fluents Whether the sensing actions are always applicable

or have preconditions that need to be achieved before the action can be done

A Simple Progression Algorithm in the presence of pure sensing actions

Call the procedure Plan(BI,G,nil) where Procedure Plan(B,G,P)

If G is satisfied in all states of B, then return P Non-deterministically choose:

I. Non-deterministically choose a causative action a that is applicable in B. Return Plan(a(B),G,P+a)

II. Non-deterministically choose a sensing action s that senses a formula f (could be a single state variable)

Let p’ = Plan(B&f,G,nil); p’’=Plan(B&~f,G,nil) /*Bf is the set of states of B in which f is true */

Return P+(s?:p’;p’’)

If we always pick I and never do II then we will produce conformantPlans (if we succeed).

Remarks on the progression with sensing actions

Progression is implicitly finding an AND subtree of an AND/OR Graph If we look for AND subgraphs, we can represent DAGS.

The amount of sensing done in the eventual solution plan is controlled by how often we pick step I vs. step II (if we always pick I, we get conformant solutions). Progression is as clue-less as to whether to do sensing and

which sensing to do, as it is about which causative action to apply Need heuristic support

Heuristics for sensing

We need to compare the cumulative distance of B1 and B2 to goal with that of B3 to goal Notice that Planning cost is related to plan

size while plan exec cost is related to the length of the deepest branch (or expected length of a branch)

If we use the conformant belief state distance (as discussed last class), then we will be over estimating the distance (since sensing may allow us to do shorter branch)

Bryce [ICAPS 05—submitted] starts wth the conformant relaxed plan and introduces sensory actions into the plan to estimate the cost more accurately

As

A

7

12,000

11,000

300

B1

B2

B3

Sensing: More things under the mat(which we won’t lift for now )

Sensing extends the notion of goals (and action preconditions). Findout goals: Check if Rao is awake vs. Wake up Rao

Presents some tricky issues in terms of goal satisfaction…! You cannot use “causative” effects to support “findout” goals

But what if the causative effects are supporting another needed goal and wind up affecting the goal as a side-effect? (e.g. Have-gong-go-off & find-out-if-rao-is-awake)

Quantification is no longer syntactic sugaring in effects and preconditions in the presence of sensing actions Rm* can satisfy the effect forall files remove(file); without KNOWING what are the

files in the directory! This is alternative to finding each files name and doing rm <file-name>

Sensing actions can have preconditions (as well as other causative effects); they can have cost

The problem of OVER-SENSING (Sort of like a beginning driver who looks all directions every 3 millimeters of driving; also Sphexishness) [XII/Puccini project] Handling over-sensing using local-closedworld assumptions

Listing a file doesn’t destroy your knowledge about the size of a file; but compressing it does. If you don’t recognize it, you will always be checking the size of

the file after each and every action

Very simple ExampleA1 p=>r,~pA2 ~p=>r,p

A3 r=>g

O5 observe(p)

Problem: Init: don’t know p Goal: g

Plan: O5:p?[A1A3][A2A3]

Notice that in this case we also have a conformant plan: A1;A2;A3 --Whether or not the conformant plan is cheaper depends on how costly is sensing action O5 compared to A1 and A2

A more interesting example: MedicationThe patient is not Dead and may be Ill. The test paper is not Blue.We want to make the patient be not Dead and not IllWe have three actions: Medicate which makes the patient not ill if he is illStain—which makes the test paper blue if the patient is illSense-paper—which can tell us if the paper is blue or not.

No conformant plan possible here. Also, notice that I cannot be sensed directly but only through B

This domain is partially observable because the states (~D,I,~B) and (~D,~I,~B) cannot be distinguished

“Goal directed” conditional planning

Recall that regression of two belief state B&f and B&~f over a sensing action Sense-f will result in a belief state B

Search with this definition leads to two challenges:1. We have to combine search states into single ones (a sort of reverse AO*

operation)2. We may need to explicitly condition a goal formula in partially observable

case (especially when certain fluents can only be indirectly sensed) Example is the Medicate domain where I has to be found through B If you have a goal state B, you can always write it as B&f and B&~f for any

arbitrary f! (The goal Happy is achieved by achieving the twin goals Happy&rich as well as Happy&~rich) Of course, we need to pick the f such that f/~f can be sensed (i.e. f and ~f

defines an observational class feature) This step seems to go against the grain of “goal-directedenss”—we may not

know what to sense based on what our goal is after all!

Regression forPO case isStill notWell-understood

Very simple ExampleA1

p=>r,~pA2

~p=>r,p

A3r=>g

O5observe(p)

Problem: Init: don’t know p Goal: g

Regresssion

Handling the “combination” during regression

We have to combine search states into single ones (a sort of reverse AO* operation) Two ideas:

1. In addition to the normal regression children, also generate children from any pair of regressed states on the search fringe (has a breadth-first feel. Can be expensive!) [Tuan Le does this]

2. Do a contingent regression. Specifically, go ahead and generate B from B&f using Sense-f; but now you have to go “forward” from the “not-f” branch of Sense-f to goal too. [CNLP does this; See the example]

Need for explicit conditioning during regression (not needed for Fully Observable case)

If you have a goal state B, you can always write it as B&f and B&~f for any arbitrary f! (The goal Happy is achieved by achieving the twin goals Happy&rich as well as Happy&~rich) Of course, we need to pick the f

such that f/~f can be sensed (i.e. f and ~f defines an observational class feature)

This step seems to go against the grain of “goal-directedenss”—we may not know what to sense based on what our goal is after all!

Consider the Medicate problem. Coming from the goal of ~D&~I, we will never see the connection to sensing blue!

Notice the analogy to conditioning in evaluating a probabilistic query

Similar processing can be done for regression (PO planning is nothing but least-committed regression planning)

We now have yet another way of handling unsafe links --Conditioning to put the threatening step in a different world!

11/24

ReplanningMDPs[HW4 updated; See the paper task; Only MDP stuff to be added]

Sensing: More things under the mat(which we won’t lift for now )

Sensing extends the notion of goals (and action preconditions). Findout goals: Check if Rao is awake vs. Wake up Rao

Presents some tricky issues in terms of goal satisfaction…! You cannot use “causative” effects to support “findout” goals

But what if the causative effects are supporting another needed goal and wind up affecting the goal as a side-effect? (e.g. Have-gong-go-off & find-out-if-rao-is-awake)

Quantification is no longer syntactic sugaring in effects and preconditions in the presence of sensing actions Rm* can satisfy the effect forall files remove(file); without KNOWING what are the

files in the directory! This is alternative to finding each files name and doing rm <file-name>

Sensing actions can have preconditions (as well as other causative effects); they can have cost

The problem of OVER-SENSING (Sort of like a beginning driver who looks all directions every 3 millimeters of driving; also Sphexishness) [XII/Puccini project] Handling over-sensing using local-closedworld assumptions

Listing a file doesn’t destroy your knowledge about the size of a file; but compressing it does. If you don’t recognize it, you will always be checking the size of

the file after each and every action

Review

Sensing: Limited Contingency planning

In many real-world scenarios, having a plan that works in all contingencies is too hard An idea is to make a plan for some of the contingencies; and

monitor/Replan as necessary. Qn: What contingencies should we plan for?

The ones that are most likely to occur…(need likelihoods) Qn: What do we do if an unexpected contingency arises?

Monitor (the observable parts of the world) When it goes out of expected world, replan starting from that state.

Things more complicated if the world is partially observable Need to insert sensing actions to sense fluents that can only be indirectly sensed

“Triangle Tables”

This involves disjunctive goals!

Replanning—Respecting Commitments

In real-world, where you make commitments based on your plan, you cannot just throw away the plan at the first sign of failure

One heuristic is to reuse as much of the old plan as possible while doing replanning.

A more systematic approach is to 1. Capture the commitments made by the agent based on the

current plan2. Give these commitments as additional soft constraints to the

planner

Replanning as a universal antidote…

If the domain is observable and lenient to failures, and we are willing to do replanning, then we can always handle non-deterministic as well as stochastic actions with classical planning!

1. Solve the “deterministic” relaxation of the problem2. Start executing it, while monitoring the world state3. When an unexpected state is encountered, replan

A planner that did this in the First Intl. Planning Competition—Probabilistic Track, called FF-Replan, won the competition.

30 years of researchinto programming languages, ..and C++ is the result?

20 years of researchinto decision theoreticplanning, ..and FF-Replan is the result?

Models of Planning

Classical Contingent (FO)MDP

??? Contingent POMDP

??? Conformant (NO)MDP

Complete Observation

Partial

None

UncertaintyDeterministic Disjunctive Probabilistic

MDPs as Utility-based problem solving agents

Repeat

[can generalize to have action costs C(a,s)]

If Mij matrix is not known a priori, then we have a reinforcement learning scenario..

Repeat

(Value)

How about deterministic case? U(si) is the shortest path to the goal

Think of these as h*() values…Called value function U*

Think of these as related to h* values

Repeat

Policies change with rewards..

- -

What does a solution to an MDP look like?

• The solution should tell the optimal action to do in each state (called a “Policy”)

– Policy is a function from states to actions (* see finite horizon case below*)

– Not a sequence of actions anymore• Needed because of the non-deterministic actions

– If there are |S| states and |A| actions that we can do at each state, then there are |A||S| policies

• How do we get the best policy?– Pick the policy that gives the maximal expected reward– For each policy

• Simulate the policy (take actions suggested by the policy) to get behavior traces

• Evaluate the behavior traces• Take the average value of the behavior traces.

• How long should behavior traces be?– Each trace is no longer than k (Finite Horizon case)

• Policy will be horizon-dependent (optimal action depends not just on what state you are in, but how far is your horizon)

– Eg: Financial portfolio advice for yuppies vs. retirees.

– No limit on the size of the trace (Infinite horizon case)

• Policy is not horizon dependent• Qn: Is there a simpler way than having to evaluate

|A||S| policies? – Yes…

We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state)

.8

.1.1

Why are values coming down first?Why are some states reaching optimal value faster?

Updates can be done synchronously OR asynchronously --convergence guaranteed as long as each state updated infinitely often

.8

.1.1

Terminating Value Iteration

• The basic idea is to terminate the value iteration when the values have “converged” (i.e., not changing much from iteration to iteration)– Set a threshold and stop when the change across

two consecutive iterations is less than – There is a minor problem since value is a vector

• We can bound the maximum change that is allowed in any of the dimensions between two successive iterations by

• Max norm ||.|| of a vector is the maximal value among all its dimensions. We are basically terminating when ||Ui – Ui+1|| <

Policies converge earlier than values•There are finite number of policies but infinite number of value functions.

• So entire regions of value vector are mapped to a specific policy

• So policies may be converging faster than values. Search in the space of policies

•Given a utility vector Ui we can compute the greedy policy ui

• The policy loss of ui is ||UuiU*||

(max norm difference of two vectors is the maximum amount by which they differ on any dimension)

V(S1)

V(S2)

Consider an MDP with 2 states and 2 actions

P1P2

P3

P4

U*

We can either solve the linear eqns exactly, or solve them approximately by running the value iteration a few times (the update wont have the “max” operation)

n linear equations with n unknowns.

Other ways of solving MDPs• Value and Policy iteration are the

bed-rock methods for solving MDPs. Both give optimality guarantees

• Both of them tend to be very inefficient for large (several thousand state) MDPs

• Many ideas are used to improve the efficiency while giving up optimality guarantees

– E.g. Consider the part of the policy for more likely states (envelope extension method)

– Interleave “search” and “execution” (Real Time Dynamic Programming)

• Do limited-depth analysis based on reachability to find the value of a state (and there by the best action you you should be doing—which is the action that is sending you the best value)

• The values of the leaf nodes are set to be their immediate rewards

• If all the leaf nodes are terminal nodes, then the backed up value will be true optimal value. Otherwise, it is an approximation…

RTDP

What if you see this as a game?The expected value computation is fine if you are maximizing “expected” returnIf you are --if you are risk-averse? (and think “nature” is out to get you) V2= min(V3,V4)

If you are perpetual optimist then V2= max(V3,V4)

Incomplete observability(the dreaded POMDPs)

• To model partial observability, all we need to do is to look at MDP in the space of belief states (belief states are fully observable even when world states are not)

– Policy maps belief states to actions• In practice, this causes (humongous) problems

– The space of belief states is “continuous” (even if the underlying world is discrete and finite). {GET IT? GET IT??}

– Even approximate policies are hard to find (PSPACE-hard). • Problems with few dozen world states are hard to solve currently

– “Depth-limited” exploration (such as that done in adversarial games) are the only option…

Belief state ={ s1:0.3, s2:0.4; s4:0.3}

This figure basically shows that belief states change as we take actions

5 LEFTs 5 UPs

MDPs and Deterministic Search• Problem solving agent search corresponds to what special case of

MDP?– Actions are deterministic; Goal states are all equally valued, and are all

sink states.• Is it worth solving the problem using MDPs?

– The construction of optimal policy is an overkill• The policy, in effect, gives us the optimal path from every state to the goal

state(s))– The value function, or its approximations, on the other hand are useful.

How?• As heuristics for the problem solving agent’s search

• This shows an interesting connection between dynamic programming and “state search” paradigms– DP solves many related problems on the way to solving the one

problem we want– State search tries to solve just the problem we want– We can use DP to find heuristics to run state search..

Modeling Softgoal problems as deterministic MDPs

11/22: Conditional Planning & Replanning Current Standings sent Semester project report due 11/30...

Documents

Transcript of 11/22: Conditional Planning & Replanning Current Standings sent Semester project report due 11/30...