Learning for Planning Sungwook Yoon Subbarao Kambhampati Arizona State University Tutorial presented...

Learning for Planning

Sungwook Yoon

Subbarao Kambhampati

Arizona State University

Tutorial presented at ICAPS 2007

History of Learning in Planning

Pre-1995 planning algorithms could synthesize about 6 – 10 action plans in minutes

Massive dependence on speedup learning techniques

Golden age for Speedup Learning in Planning

Realistic encodings of Munich airport!

But KBPlanners (customized by humans) did even better opening up renewed interest in learning the kinds of knowledge humans are able to

put in..and there is increasing acknowledgement of domain-modeling burden

making it attractive to “learn” domain-models from examples and demonstrations

Significant scale-up in the last 6-7 years mostly through powerful reachability heuristics

Now, we can synthesize 100 action plans in seconds.

Reduced interest in learning as a crutch

http://www.asu.edu/

Planner Customization(using domain-specific Knowledge)

Domain independent planners tend to miss the regularities in the domain

Domain specific planners have to be built from scratch for every domain

An “Any-Expertise” Solution: Try adding domain specific control knowledge to the domain-independent planners

ACME

all p

urpos

e

planner

Ronco

Block

s world

Planner Ronco

logist

ics

Planner

Ronco

jobsh

op

Planner

AC-RO

Custom

izab

le

planner

Domain SpecificKnowledge

Learned

Human Given

Any E

xpertise S

olution

http://www.asu.edu/

Improve Speed? Don’t we have pretty fast planners (and pretty

amazing heuristics driving them) already? [If domains are hard] humans are still able to

generate better hand-coded search control KB-planning track was able to show significantly

higher speeds. It would be good to automatically learn what Dana and Fahiem put in by hand

[If domains are easy] the “general purpose” planner should (with learning) customize itself to the complexity of the domain..

Also, need for search control is higher with more expressive domain dynamics (temporal, stochastic etc.)

http://www.asu.edu/

A “Learning for Planning” Track in IPC

There are now “plans” to hold a learning for planning track in IPC

Structure Same domains as used in IPC Learning time (During which the competitors are

allowed to “learn” or “analyze” the domains and add the learned knowledge to their planner)

Test time—where all planners—learning and non-learning ones attempt to solve test problems

Performance during test time is rated [Contact Alan Fern at OSU for details]

IPC TestLearning phase

http://www.asu.edu/

Domain Modeling BURDEN??

There are many scenarios where domain modeling is the biggest obstacle Web Service Composition

Most services have very little formal models attached Workflow management

Most workflows are provided with little information about underlying causal models

Learning to plan from demonstrations We will have to contend with incomplete and evolving

domain models..

..but our techniques assume complete and correct models..

Answer: Model-Lite Planning

Any M

odel S

olution


Is synthesis really the main problem??

http://www.asu.edu/

Model-Lite Planning is Planning with incomplete models ..“incomplete” “not enough domain

knowledge to verify correctness/optimality”

How incomplete is incomplete? Missing a couple of

preconditions/effects?

Knowing no more than I/O types?


We reduce the validation burden from the user..

http://www.asu.edu/

Challenges in Realizing Model-Lite Planning

1. Planning support for shallow domain models

2. Plan creation with approximate domain models

3. Learning to improve completeness of domain models

http://www.asu.edu/

Twin Motivations for exploring Learning Techniques for Planning

[Improve Speed] Even in the age of

efficient heuristic planners, hand-crafted knowledge-based planners seem to perform orders of magnitude better

Explore effective techniques for automatically customizing planners

[Reduce Domain-modeling Burden]

Planning Community tends to focus on speedup given correct and complete domain models

Domain modeling burden, often unacknowledged, is nevertheless a strong impediment to application of planning techniques

Explore effective techniques for automatically learning domain models

Any Expertise Solution Any Model Solution

http://www.asu.edu/

Industry desperately needs domain model learning and adaptation

Physical System != Abstractions Huge tuning and debugging effort Physical system wear Planning with no model is inefficient Control theory is well ahead of us..

Slid

e fro

m W

hee

ler R

um

l

http://www.asu.edu/

Beneficial to both Planning & Learning

From Planning Side To speed up the

solution process Search control

To reduce the domain-modeling burden

Model-lite Planning (Kambhampati, AAAI 2007)

To support planning with partial domain models

From Machine Learning Side Challenging Application

Planning can be seen as an application of machine learning

However, in contrast to a majority of learning applications:

Planning requires sequential decisions,

Relational structure Use of the domain

knowledge

It is neither just applied learning nor applied planning but rather a worthy fundamental research goal!

http://www.asu.edu/

Outline

Learning Search Control (Lessons from Knowledge-

Based Planning Track) Control Rules, Macros,

Reuse Improved Heuristics,

Policies

Learning Domain Models (Model-lite Planning) Learning action

preconditions/effects Learning hierarchical

schemas

Motivation and the Big Picture Very Brief Review of planning for learning

folks & learning for planning folks

We shall put more focus on the recent and promising developments

DONE

http://www.asu.edu/

Classification Learning

Training ExamplesTypically, is Positive example is Negative example

Express with Features Fit a classifier to the data

Training ExamplesMultiple label case

Express with Features Fit a classifier to the data(Decision Tree?)

http://www.asu.edu/

(model-free) Reinforcement Learning

G

Unknown State

Known State

Explore and Learn

Goal State

Current Policy

G

Explore and Learn

G

Typically, (model free) RL constructsPolicy (solution) as well as the model

http://www.asu.edu/

RL and MDP A foundational approach to Planning and learning is

Reinforcement Learning (RL). Model-Free RL combines speed-up and domain learning

aspect Model-based RL achieves speed-up planning

Solution techniques to Markov Decision Processes (MDP) problems are related to L2P Finding policies Learning Approximate Value Function Learning Policy

RL and MDP techniques do not scale well Typically, all the state space needs to be enumerated

We need scalable planning to deal with real world

http://www.asu.edu/

Important Dimensions of Variation

What is being learned? Search control vs. Domain

Knowledge

What kind of background knowledge is used?

Full vs. partial domain models

Online vs. Offline

How is training data obtained?

Self exploration or exercise?

From search tree?

User provided (demonstrations)?

Automated planning results?

How is training data represented?

Propositional vs. relational

How are features generated?

http://www.asu.edu/

Spectrum of Approaches..

PLANNING ASPECTS LEARNING ASPECTS

Learning PhaseProblem Type

. . .

Type of Learning

analogical

Planning-Learning Goal

Planning Approach

Learn or improve domain theory

bayesian learning

Compilation Approaches

Plan Space search

State Space search[Conjunctive / Disjunctive ]

CSP

L P

SATDuring plan execution

Before planning starts

During planning process

Inductive decision tree

Neural Network

‘other’ induction

Reinforcement Learning

Inductive Logic Programming

Analytical

EBL

Static analysis/ Abstractions

Case Based Reasoning(derivational / transformational

analogy)

Multi-strategy

EBL & Inductive Logic Programming

analytical & induction

EBL & Reinforcement Learning

Classical Planning static world deterministic fully observable instantaneous actions propositional

‘Full Scope’ Planning

dynamic world stochastic partially observable durative actions asynchronous goals metric/continuous

Speed up planning

Improve plan

quality

Spectrum of Approaches Tried [AI Mag, 2003]

http://www.asu.edu/

Spectrum of Approaches

Target Knowledge

Search Control

Policy Value Function

Macro / Subgoal

Domain Definition

HTN

Classic (probabilistic)Planning

Y Y Y Y Y Y

Oversubscribed Planning

Y

Temporal Planning

Partial Observable

Y

ORTS Y

Learning Technique

s

EBL ILP Perceptron / Least Square

Set Covering

Kernel Method

Bayesian


Y Y Y Y Y


Y

Temporal Planning

Partial Observable

Y

ORTS

http://www.asu.edu/

Planning – Domain Definition

(define (domain Blocksworld) (:requirements … ) (:predicates … ) (:action pickup :parameters (?x) :precondition (and (clear ?x) (ontable ?x) (armempty)) :effect (and (holding ?x) (not (clear ?x)) (not (ontable ?x)) (not (armempty))) )

Domain Name

:typed :negative-precondition

Predicate Definition

Action DefinitionSchema (name and Parameters)PreconditionEffect

Table

Table

Of course, we need initial state and goal for the problem definition

This model itself should be learned to reduce modeling burden..

http://www.asu.edu/

Goal

Planning – Forward State Space Search

Pickup Yellow

Pickup Red

(ontable yellow)(ontable red)(ontable blue)(clear yellow)(clear red)(clear blue) (on Yellow Red)

(on Red Blue

http://www.asu.edu/

Initial State

Planning – Backward State Space Search

GoalStack Yellow Red(UnStack Yellow Red)

Pickup Yellow

(on Yellow Red)

(ontable yellow)(ontable red)(ontable blue)(clear yellow)(clear red)(clear blue)

http://www.asu.edu/

Search & Control

Which branch should we expand? ..depends on which branch is leading (closer) to the goal

p

pq

pr

ps

pqr

pq

pqs

psq

ps

pst

A1A2

A3

A2A1A3

A1A3

A4

p

pq

pr

ps

pqr

pq

pqs

psq

ps

pst

A1A2

A3

A2A1A3

A1A3

A4

Progression Search Regression Search

p

pq

pr

ps

pqr

pq

pqs

psq

ps

pst

A1A2

A3

A2A1A3

A1A3

A4

p

pq

pr

ps

pqr

pq

pqs

psq

ps

pst

A1A2

A3

A2A1A3

A1A3

A4

http://www.asu.edu/

POP Algorithm1. Plan Selection: Select a plan P from the search queue2. Flaw Selection: Choose a flaw f

(open cond or unsafe link)3. Flaw resolution:

If f is an open condition, choose an action S that achieves f If f is an unsafe link, choose promotion or demotion Update P Return NULL if no resolution exist

4. If there is no flaw left, return P

S0

S1

S2

S3

Sinf

p

~p

g1

g2g2oc1

oc2

q1

Choice points• Flaw selection (open condition? unsafe link? Non-backtrack choice)• Flaw resolution/Plan Selection (how to select (rank) partial plan?)

S0

Sinf

g1

g2

1. Initial plan:

2. Plan refinement (flaw selection and resolution):

http://www.asu.edu/

Outline




Policies



schemas

Motivation and the Big Picture Very Brief Review of planning for

learning folks & learning for planning folks

http://www.asu.edu/

Planner Customization(using domain-specific Knowledge)

Domain independent planners tend to miss the regularities in the domain

Domain specific planners have to be built from scratch for every domain

An “Any-Expertise” Solution: Try adding domain specific control knowledge to the domain-independent planners

ACME

all p

urpos

e

planner

Ronco

Block

s world

Planner Ronco

logist

ics

Planner

Ronco

jobsh

op

Planner

AC-RO

Custom

izab

le

planner

Domain SpecificKnowledge

Learned

Human Given

http://www.asu.edu/

How is the Customization Done? Given by humans (often, they are quite

willing!)[IPC KBPlanning Track]

As declarative rules (HTN Schemas, Tlplan rules)

Don’t need to know how the planner works..

Tend to be hard rules rather than soft preferences…

Whether or not a specific form of knowledge can be exploited by a planner depends on the type of knowledge and the type of planner

As procedures (SHOP)

Direct the planner’s search alternative by alternative..

Through Machine Learning

Learning Search Control rules

UCPOP+EBL, PRODIGY+EBL,

(Graphplan+EBL) Case-based planning (plan reuse)

DerSNLP, Prodigy/Analogy Learning/Adjusting heuristics

Domain pre-processing

Invariant detection; Relevance detection;

Choice elimination, Type analysis

STAN/TIM, DISCOPLAN etc. RIFO; ONLP

Abstraction

ALPINE; ABSTRIPS, STAN/TIM etc.

how

easy

is it

to w

rite

cont

rol i

nfor

mat

ion?

We will start with KB-Planning track to get a feel for what control knowledge has been found to be most useful; and see how to get it..

Given by humans (often, they are quite willing!)[IPC KBPlanning Track]– As declarative rules (HTN

Schemas, Tlplan rules)» Don’t need to know how the

planner works..» Tend to be hard rules rather

than soft preferences…» Whether or not a specific form

of knowledge can be exploited by a planner depends on the type of knowledge and the type of planner

– As procedures (SHOP)» Direct the planner’s search

alternative by alternative..

http://www.asu.edu/


Types of Guidance

Declarative knowledge about desirable or undesirable solutions and partial solutions (SATPLAN+DOM; Cutting Planes)

Declarative knowledge about desirable/undesirable search paths (TLPlan & TALPlan)

A declarative grammar of desirable solutions (HTN)

Procedural knowledge about how the search for the solution should be organized (SHOP)

Search control rules for guiding choice points in the planner’s search (NASA RAX; UCPOP+EBL; PRODIGY)

Cases and rules about their applicability

Planner specific. Expert needs to understand the specific details of the planner’s search space

(largely) independent of the details of the specific planner[affinities do exist between specific types of guidance and planners)

http://www.asu.edu/

Task Decomposition (HTN) Planning The OLDEST approach for providing domain-specific knowledge

Most of the fielded applications use HTN planning

Domain model contains non-primitive actions, and schemas for reducing them

Reduction schemas are given by the designer

Can be seen as encoding user-intent

Popularity of HTN approaches a testament of ease with which these schemas are available?

Two notions of completeness:

Schema completeness

(Partial Hierarchicalization) Planner completeness

Travel(S,D)

GobyBus(S,D) GobyTrain(S,D)

Getin(B,S)

BuyTickt(B)

Getout(B,D)

BuyTickt(T)

Getin(T,S)

Getout(T,D)

Hitchhike(S,D)

http://www.asu.edu/

Modeling Action Reduction

GobyBus(Phx,Msn)Get(Money) Buy(WiscCheese)

At(Msn)

Hv-Money

t1: Getin(B,Phx)

t2: BuyTickt(B)

t3: Getout(B,Msn)

In(B)Hv-Tkt

Hv-MoneyAt(D)

Get(Money)

Buy(WiscCheese)

GobyBus(S,D)

t1: Getin(B,S)

t2: BuyTickt(B)

t3: Getout(B,D)

In(B)

Hv-Tkt

Hv-Money At(D)

Affi

nity

bet

wee

n re

duct

ion

sche

mas

and

plan

-spa

ce p

lann

ing

http://www.asu.edu/

Full procedural control: The SHOP way

Travel by bus only if going by taxi doesn’t work out

Shop provides a “high-level” programminglanguage in which the user can code his/herdomain specific planner

-- Similarities to HTN planning -- Not declarative (?) The SHOP engine can beseen as an interpreterfor this language

[Nau et. al., 99]

Blurs the domain-specific/domain-independent divideHow often does one have this level of knowledge about a domain?

http://www.asu.edu/

Rules on desirable State Sequences: TLPlan approach

TLPlan [Bacchus & Kabanza, 95/98] controls a forward state-space planner

Rules are written on state sequences using the linear temporal logic (LTL)

LTL is an extension of prop logic with temporal modalities U until [] always O next <> eventually

Example:

If you achieve on(B,A), then preserve it until On(C,B) is achieved:

[] ( on(B,A) => on(B,A) U on(C,B) )

http://www.asu.edu/

Keep growing “good” towers, and avoid “bad” towers

Good towers are those that do not violate any goal conditions

TLPLAN Rules can get quite baroque

How “Obvious”

are these ru

les?

Can these be

learned?

The heart of TLPlan is the ability to incrementally and effectively evaluate the truth of LTL formulas.

http://www.asu.edu/

What are the lessons of KB Track? If TLPlan did better than SHOP

in ICP, then how are we supposed to interpret it?

That TLPlan is a superior planning technology over SHOP?

That the naturally available domain knowledge in the competition domains is easier to encode as linear temporal logic statements on state sequences than as procedures in the SHOP language?

That Fahiem Bacchus and Jonas Kvarnstrom are way better at coming up with domain knowledge for blocks world (and other competition domains) than Dana Nau?

May be we should “learn” this guidance

IC APS workshop on the C ompetition Subbarao Kambhampati

Are we comparing Dana & Fahiem or SHOP and TLPlan?

(A Critique of Knowledge-based Planning Track at ICP)

Subbarao KambhampatiDept. of Computer Science & Engg.

Arizona State UniversityTempe AZ 85287-5406

Click here to download TLPlan– Click here to download a

Fahiem

Click here to download SHOP– Click here to download a

Dana

http://www.asu.edu/

Approaches for Learning Search Control

Improve an existing planner Learn “from scratch” how to plan

--Learn “reactive policies” State x Goalaction

[Work by Khadron, 99; Givan, Fern, Yoon, 2003 ]

“speedup learning”

Learn rules to guide choice points

Learn plans to reuse

Learn adjustments to heuristics

--Macros--Annotated cases

Outline

Learning Search Control

(Lessons from Knowledge-Based Planning Track)

Control Rules, Macros, Reuse

Improved Heuristics, Policies

Learning Domain Models

(Model-lite Planning)

Learning action preconditions/effects

Learning hierarchical schemas

Motivation and the Big Picture

Very Brief Review of planning for learning folks & learning for planning folks

http://www.asu.edu/

General Strategy for Inductive Learning of Search Control

Convert to “classification” learning +ve examples: Search nodes on the success path -ve examples: Search nodes one step away from

the success path Learn a classifier

Classifier may depend on the features of the problem (Init, Goal), as well as the current state.

Several systems: Grasshopper (Leckie & Zuckerman; 1998) Inductive Logic Programming; (Estlin & Mooney;

1993)

http://www.asu.edu/

If Polished(x)@S & ~Initially-True(Polished(x)) Then REJECT Stepadd(Roll(x),Cylindrical(x)@s)

http://www.asu.edu/

Explanation-based Learning Start with a labeled example, and some

background domain theory Explain, using the background theory, why the

example deserves the label Think of explanation as a way of picking class-

relevant features with the help of the background knowledge

Use the explanation to generalize the example (so you have a general rule to predict the label)

Used extensively in planning Given a correct plan for an initial and goal state

pair, learn a general plan Given a search tree with failing subtrees, learn

rules that can predict failures Given a stored plan and the situations where it

could not be extended, learn rules to predict applicability of the plan

http://www.asu.edu/

Issues in EBL for Search Control Rules

Effectiveness of learning depends on the explanation Primitive explanations

of failure may involve constraints that are directly inconsistent

But it would be better if we can unearth hidden inconsistencies

..an open issue is to learn with probably incorrect explanations UCPOP+CFEBL

http://www.asu.edu/

Status of EBL learning in Planning Explanation-based learning from failures has been ported to modern planners

GP-EBL [Kambhampati, 2000] ports EBL to Graphplan

“Mutual exclusion relations” are learned

(exploits the connection between EBL and “nogood” learning in CSP)

Impressive speed improvements

EBL is considered standard part of Graphplan implementation now..

…but much of the learning was intra problem

http://www.asu.edu/

Some misconceptions about EBL Misconception 1: EBL needs complete and correct

background knowledge (Confounds “Inductive vs. Analytical” with “Knowledge rich

vs. Knowledge poor”) If you have complete and correct knowledge then the learned

knowledge will be in the deductive closure of the original knowledge;

If not, then the learned knowledge will be tentative (just as in inductive learning)

Misconception 2: EBL is competing with inductive learning In cases where we have weak domain theories, EBL can be

seen as a “feature selection” phase for the inductive learner Misconception 3: Utility problem is endemic to EBL

Search control learning of any sort can suffer from utility problem

E.g. Using inductive learning techniques to learn search control

http://www.asu.edu/

L2P – Search Control - EBL Potential Future Approach

Combine with MDL (Minimal Description Length) paradigm Use EBL paradigm as feature selection approach

Note that Proof structure itself can be very useless, since only leaf node of the proof tree can be used as features

Simplify hypothesis space Generally, ILP approaches did not work too well Find alternative compact and modular KR (description logic?)

http://www.asu.edu/




[Work by Khadron, 99; Givan, Fern, Yoon, 2003 ]






Outline











http://www.asu.edu/

L2P – Search Control - Macro From PDDL , for two actions, when effect of one is well

connected to the precondition of the other, we can construct a macro action.

This can be verified from example solutions A Macro is used as just an action during planning Example, Push-Start and Push-End actions in Pipesworld domain

(IPC-4) A learner can find frequent pattern in the solution plans Learning systems

MacroFF and Marvin Future Approaches

How to find longer Macros Learn Macros from tagged solution trajectories

http://www.asu.edu/

L2P – Search Control - MacroFF

(:action UNLOAD:parameters(?x - hoist ?y - crate ?t - truck ?p - place):precondition(and (in ?y ?t) (available ?x) (at ?t ?p) (at ?x ?p)):effect(and (not (in ?y ?t)) (not (available ?x)) (lifting ?x ?y)))

(:action DROP:parameters(?x - hoist ?y - crate ?s - surface ?p - place):precondition(and (lifting ?x ?y) (clear ?s) (at ?s ?p) (at ?x ?p)):effect(and (available ?x) (not (lifting ?x ?y)) (at ?y ?p)(not (clear ?s)) (clear ?y) (on ?y ?s)))

(:action UNLOAD|DROP:parameters(?h - hoist ?c - crate ?t - truck ?p - place ?s - surface):precondition(and (at ?h ?p) (in ?c ?t) (available ?h)(at ?t ?p) (clear ?s) (at ?s ?p)):effect(and (not (in ?c ?t)) (not (clear ?s))(at ?c ?p) (clear ?c) (on ?c ?s)))

http://www.asu.edu/


We can also explain (& generalize) Success

Success explanations tendto involve more componentsof the plan than failure explanations

http://www.asu.edu/

Case-study: DerSNLP Modifiable derivational traces are reused

Traces are automatically acquired during problem solving

Analyze the interactions among the parts of a plan, and store plans for non-interacting subgoals separately

Reduces retrieval cost Use of EBL failure analysis to detect interactions

All relevant trace fragments are retrieved and replayed before the control is given to from-scratch planner

Extension failures are traced to individual replayed traces, and their storage indices are modified appropriately

Improves retrieval accuracy

( Ihrig & Kambhampati, JAIR 97)

Oldcases

EBL

http://www.asu.edu/

Reuse/Macrops Current Status

Since ~1996 there has been little work on reuse and macrop based improvement of base-planners

People sort of assumed that the planners are already so fast, they can’t probably be improved further

Macro-FF, a system that learns 2-step macros in the context of FF, posted a respectable performance at IPC 2004 (but NOT in the KB-track)

Uses a sophisticated method assessing utility of the learned macrops (& also benefits from the FF enforced hill-climbing search)

Macrops are retained only if they improve performance significantly on a suite of problems

Given that there are several theoretical advantages to reuse and replay compared to Macrops, it would certainly be worth seeing how they fare at IPC [Open]

http://www.asu.edu/

L2P – Search Control - Macro From PDDL , for two actions, when effect of one is well

connected to the precondition of the other, we can construct a macro action.

This can be verified from example solutions A Macro is used as just an action during planning Example, Push-Start and Push-End actions in Pipesworld domain

(IPC-4) A learner can find frequent pattern in the solution plans Learning systems

MacroFF and Marvin Future Approaches

How to find longer Macros Learn Macros from tagged solution trajectories

Learning 2 Planning Sungwook Yoon http://www.public.asu.edu/~syoon/ L2P-tutorial.html

http://www.asu.edu/

Dimensions of Variation What is learned?

Search control vs. Domain Knowledge

What kind of background knowledge is used? Full vs. partial domain

models Online vs. Offline

How is training data obtained? Self exploration or

exercise? From search tree? User provided

(demonstrations)? Automated planning

results? How are features

generated?

49 http://www.public.asu.edu/

~syoon/ L2P-tutorial.html

http://www.asu.edu/

L2P – Search Control - MacroFF

(:action UNLOAD:parameters(?x - hoist ?y - crate ?t - truck ?p - place):precondition(and (in ?y ?t) (available ?x) (at ?t ?p) (at ?x ?p)):effect(and (not (in ?y ?t)) (not (available ?x)) (lifting ?x ?y)))

(:action DROP:parameters(?x - hoist ?y - crate ?s - surface ?p - place):precondition(and (lifting ?x ?y) (clear ?s) (at ?s ?p) (at ?x ?p)):effect(and (available ?x) (not (lifting ?x ?y)) (at ?y ?p)(not (clear ?s)) (clear ?y) (on ?y ?s)))

(:action UNLOAD|DROP:parameters(?h - hoist ?c - crate ?t - truck ?p - place ?s - surface):precondition(and (at ?h ?p) (in ?c ?t) (available ?h)(at ?t ?p) (clear ?s) (at ?s ?p)):effect(and (not (in ?c ?t)) (not (clear ?s))(at ?c ?p) (clear ?c) (on ?c ?s)))Learning 2 Planning

Sungwook Yoon http://www.public.asu.edu/~syoon/ L2P-tutorial.html

http://www.asu.edu/

Macro – Machine Learning

Training Example Generation

Solutions from domain independent planners (FF)

Positive Examples vs. Negative Examples

Positive Examples: Consequent actions in the plansNegative Examples: non-Consequent actions

Features Automatically constructed from operator definitions

Background Knowledge

Domain Definition


http://www.asu.edu/

Reuse/Macrops Current Status

Since ~1996 there has been little work on reuse and macrop based improvement of base-planners

People sort of assumed that the planners are already so fast, they can’t probably be improved further

Macro-FF, a system that learns 2-step macros in the context of FF, posted a respectable performance at IPC 2004 (but NOT in the KB-track)

Uses a sophisticated method assessing utility of the learned macrops (& also benefits from the FF enforced hill-climbing search)

Macrops are retained only if they improve performance significantly on a suite of problems

Given that there are several theoretical advantages to reuse and replay compared to Macrops, it would certainly be worth seeing how they fare at IPC [Open]Learning 2 Planning


http://www.asu.edu/

What I will talk about Control Knowledge for Satplan Learning Value Function

Heuristic Function Measures of Progress

Learning Policy Policy Learning RRL Random Walk – Approximate Policy Iteration

Learning Domain Models Logical Filtering Probabilistic operator Learning ARMS Markov Logic Network

Conclusion & Future research

http://www.public.asu.edu/~syoon/ L2P-tutorial.htmlLearning 2 Planning

Sungwook Yoon

http://www.asu.edu/

When designing machine learning algorithm for planning, ASK

How will you represent the target concept? Policy?, search control?, If so, how? Decision tree?,

What is your feature space? State Facts? First order logic? Kernel?

Where does your training data come from? Automated planning? Random wandering? Human provided?

How will you learn from the data? Gradient descent? Least Squares? Boosting? Set coverage?


http://www.asu.edu/

Set Covering


Sungwook Yoon

+Pickup redPickup ontablePickup clear

-Pickup bluePickup ontablePickup clear

-Pickup yellowPickup ontablePickup clear

+Stack red blueStack holding clear

Pickup RedStack Red Blue

Learned Rules

http://www.asu.edu/

Perceptron Update


Sungwook Yoon

http://www.asu.edu/



Sungwook Yoon

Target Knowledge

Search Control


Macro / Subgoal

Domain Definition

HTN


Y Y Y Y Y Y


Y

Temporal Planning

Partial Observable

Y

ORTS Y

Learning Technique

s


Set Covering

Kernel Method

Bayesian


Y Y Y Y Y


Y

Temporal Planning

Partial Observable

Y

ORTS

http://www.asu.edu/

L2P – Search Control – SAT constraints(controls SAT search, unit propagation) The performance of SAT planner can be enhanced

with domain background knowledge For logistics domain,

Packages that are already in the goal shouldn’t be moved Once a package leaves a location, it should not return to it A package can only be in original location or goal location

Learning System Huang, Selman and Kautz, 2000, ICML, Generate training examples from solved plans

How to generate training example, what are features and how to learn?

http://www.asu.edu/

L2P – Search Control – SAT constraints

Goal

Pickup Red

Pickup YellowPickup Blue

Stack Red Blue

Putdown Red

Pickup Yellow

Unstack Red Blue

Stack Yellow Red

Putdown Yellow

Static/DynamicSelect Positive

Static/DynamicSelect Negative


http://www.asu.edu/

L2P – Search Control – SAT constraints

With positive and negative training examples, run FOIL to learn “selection” rules and “rejection” rules

Use the learned rules to generate clauses for SAT (pickup ?x) <- (clear ?x) Generate (not (clear a)i V (pickup a)i) For ground facts and actions at levels I

Experiments showed performance enhancement Future Approaches

Apply the learning to IP, LP, or CSP approaches How to use stochastic rules, since learning can be

imperfect Maxsat


http://www.asu.edu/

Control Knowledge for SatPlan - Summary


Solutions from Satplan


Positive Examples: Actions in the solution PlansNegative Examples: Actions not in the solution Plans(reverse for rejection rule learning)

Features Relational Features from FOIL


Predicates in the domain

Target Representation

First Order Rules

Learning Method Greedy Set Coverage

(potential)Future Extension

Apply to other forms of reduction, IPPlan or CSPPlan


http://www.asu.edu/



Sungwook Yoon

Target Knowledge

Search Control


Macro / Subgoal

Domain Definition

HTN


Y Y Y Y Y Y


Y

Temporal Planning

Partial Observable

Y

ORTS Y

Learning Technique

s


Set Covering

Kernel Method

Bayesian


Y Y Y Y Y


Y

Temporal Planning

Partial Observable

Y

ORTS

http://www.asu.edu/

Learning to Improve Heuristics Most modern planners use reachability heuristics

These approximate the reachability by ignoring some types of interactions (usually, negative interactions between subgoals)

While effective in general, ignoring such negative interactions can worsen the heuristic guidance and lead the planners astray

A way out is to “adjust” the reachability information with the information about interactions that were ignored

1. (Static) Adjusted Sum heuristics as popularized in AltAlt Increases the heuristic cost (as we need to propagate

negative interactions) Could be bad for progression planners which grow the

planning graph once for each node..2. (Learn dynamically) Learn to predict the difference between the

heuristic estimate by the relaxed plan and the “true” distance

Yoon et al. show that this is feasible—and manage to improve the performance of FF

[Yoon,Fern and Givan, ICAPS 2006]Learning 2 Planning


http://www.asu.edu/

Heuristic Value Comparison

P P

P,Q

Q Q

P,Q

P

Q

P,Q

P,Q

P

Q

P,Q

P,Q

P,Q

P,Q

P,QP

Q

P,Q

P,Q

P,Q

P,Q

P,Q

P,Q

P,Q

P,Q

Consider deletions of in(CAR,x) when move(x,y) is taken in relaxed plan

Plangraph Length 4

Relaxed Plan Length (RPL) 7

Complementary Heuristic 1

Real Plan Length 8


http://www.asu.edu/

http://images.google.com/imgres?imgurl=http://www.bradfitzpatrick.com/stock_illustration/images/cartoon_car_001.jpg&imgrefurl=http://www.bradfitzpatrick.com/stock_illustration/cartoon_car_001.htm&h=180&w=240&sz=16&tbnid=zUXxeS6wK3068M:&tbnh=78&tbnw=104&hl=ko&ei=BbUARIGdLor8igHS7LipCg&sig2=KW8UUmz3yH37iZmy4bTx1g&start=1&prev=/images?q=cartoon+car&svnum=10&hl=ko&lr=







Learning Adjustments to Heuristics Start with a set of training examples [Problem, Plan]

Use a standard planner, such as FF to generate these

For each example [(I,G), Plan]

For each state S on the plan

Compute the relaxed plan heuristic SR

Measure the actual distance of S from goal S* (easy since we have the current plan—assuming it is optimal)

Inductive learning problem

Training examples: Features of the relaxed plan of S Yoon et al use a taxonomic feature representation Class

labels: S*-SR (adjustment) Learn the classifier

Finite linear combination of features of the relaxed plan

[Yoon et al, ICAPS 2006]Learning 2 Planning Sungwook Yoon http://www.public.asu.edu/~syoon/ L2P-tutorial.html

http://www.asu.edu/

Learning Heuristic Functions from Relaxed Plans

(d_in * CAR)

Feature Evaluation

in(p1,CAR), in(p2, CAR),In(p3, z), In(p4, k), gin(p3,z), gin(p4,z),gin(p2,z), gin(p1,z),

move(a,b), unload(p4,z),a_In(CAR,p4), a_in(p4,z),d_In(CAR,a), d_in(p4, CAR)

RDB from a state S EnumeratedTaxonomic Syntax (from domain definition)

(in * car), (cin * location)(unload * location)(d_in * CAR)

Taxonomic Expressions

p1 p2

(in * car)

= 2p4 = 1

http://www.asu.edu/

Heuristic (Value) Function Learning - Summary



Target Value Target Value: The difference between remaining plan (real plan length) and relaxed plan length

Features Taxonomic Syntax automatically constructed from state and plangraph




Linear combination of features

Learning Method Least Square Optimization


Beam Search Learning, Oversubscribed Planning


http://www.asu.edu/

L2P – Search Control – Heuristic Learning(EBL with incomplete domain aspect)

The idea of using the relaxed plan in the plangraph as feature space is related to EBL Plangraph is partial explanation of the potential

plan The learning finds flaw in the plangraph (or

explantion) from training examples Thus, this approach is a good example approach

to EBL using weak or incomplete domain theory Here relaxed operators are incomplete domain theory


http://www.asu.edu/

L2P – Search Control – Beam Search

S1 S2

S1 Sa Sb Beam: size 3

Expand neighbors

S2

Solution Trajectory

Sc

Sc Sa SbS2

Sort

Sc Sa Sb

H(s) = ∑wi * fi ,Increase wi where fi isTrue for S2 and decrease wi where fi is false for S2

Xu et. Al, IJCAI, 2007


http://www.asu.edu/

Heuristic (Value) Function Learning – Beam Search - Summary



Target Value Value Function that can induce the plan with beam search – we prefer smaller beams

Features Taxonomic Syntax automatically constructed from state and plangraph





Learning Method Perceptron Update


http://www.asu.edu/

Stage for Oversubscribed Planning

Oversubscribed Planning has a lot common with optimization AAAI 07 Tutorial by Do, Zimmerman and Kambhampati

Research Question: Can we use machine learning for optimization techniques in Oversubscribed Planning problems?

How learning can be involved? What will be the feature space? Will this be domain learning? Or problem-specific learning?

STAGE Algorithm

Given, S1, S2, ……, Sn, what is the value for Si for the training?

V(Si) = min Obj(Sk), where K> IThis is a bit similar to no discount TD learning … the difference is ….

For Oversubscribed Planning, we can use the following schemeV(Si) = min Obj(Sk), where Sk is in the subtree below Si.Learning 2 Planning


http://www.asu.edu/

Stage for Oversubscribed Planning


Sungwook Yoon

Stage for Optimization Stage for Oversubscribed Planning

Original Search

Guided by objective function, e.g., number of bins used in bin-packing problem (provided by human)

Guided by reachability heuristic

Features for New Value Function

Engineered by Human Automated features from domain definition, e.g., state facts or taxonomic features

Problem specific adaptation?

Yes Yes

Target Value The best value following the current state

The best value in the subtree of the current search node

Learning Least Squares Fit Least Squares Fit

http://www.asu.edu/

Heuristic (Value) Function Learning – Oversubscribed Planning - Summary


Trajectories Generated from Heuristic Search

Target Value The best Utility values found in the subtree under the current node (state)

Features State Facts





Learning Method Least Square Optimization


Other Features, Temporal Extension


http://www.asu.edu/

L2P – Search Control – Measures Of Progress

Measure of Progress in planning is some measure that monotonically increases (or decreases) with good plans – Parmar AAAI 02

Example The number of blocks in the good tower The number of packages in the goal location

The planning can be easier if we know such measure We can safely use Enforced Hill Climbing approach

The questions is how we automatically find such measure


http://www.asu.edu/

L2P – Search Control – Measures Of Progress

-Given solution trajectories, a training example is consecutive two states in the trajectories, let the set of such states be J-Find a measure l that increases most in J-Add l to the tail of the measure list L-Remove pair of states that are covered by l from J, set the new J and go back until J is empty

-Again the trick is using KR that is well suited for planning--- Yoon et al. AAAI 2005

http://www.asu.edu/

Heuristic (Value) Function Learning – Measure of Progress - Summary



Target Value Monotonic function that increases with plan

Features Taxonomic Features




Ordered list of value functions

Learning Method Greedy Set Covering


Hierarchical decomposition


http://www.asu.edu/




[Work by Khadron, 99; Winner & Veloso, 2002; Fern, Yoon and Givan, 2003 Gretton & Thiebaux, 2004]






Outline












http://www.asu.edu/



Sungwook Yoon

Target Knowledge

Search Control


Macro / Subgoal

Domain Definition

HTN


Y Y Y Y Y Y


Y

Temporal Planning

Partial Observable

Y

ORTS Y

Learning Technique

s


Set Covering

Kernel Method

Bayesian


Y Y Y Y Y Y


Y

Temporal Planning

Partial Observable

Y

ORTS

http://www.asu.edu/

L2P – Learning Policy What is Policy?

State to action mapping What does a policy mean to the planning

problems? If the policy applies to any problem, then it is a

domain specific planner Any problem in a planning domain is a state The domain is then not connected

Can we then apply any MDP techniques to the planning domains? No


http://www.asu.edu/

L2P – Policy Learning


Sungwook Yoon

http://www.asu.edu/

L2P – Policy Learning – Rule Learning

Khardon MLJ ’99 provided theoretical proof [L2ACT] If one can find a deterministic strategy that can find trajectories

close to the training trajectories, then the strategy can perform as good as the provider of the training trajectories

One can view the strategy as a deterministic policy in MDP A policy is a mapping from state to action

Martin and Geffner, KR 2000, developed a policy learning system for Blocksworld

Showed the importance of KR Used Description Logic to compactly represent “good tower”

concept Yoon, Fern and Givan, UAI ‘02, developed a policy learning

system for first order MDPs These systems use decision-rule representations


http://www.asu.edu/

Goal (any tower)L2P – Policy Learning – Rule Learning

Pickup Red

Pickup YellowPickup Blue

Stack Red Blue

Putdown Red

Pickup Yellow

Unstack Red Blue

Stack Yellow Red

Putdown Yellow

Positive Example

Negative Example

Though Pickup Red was selected in the first state, other actions are equally good

http://www.asu.edu/

L2P – Policy Learning – Rule Learning

Treat each state-action pair separately from solution trajectories, let the pairs be J

Try to find the action-selection-rule l that covers J most well Add l to the tail of L, the decision list Remove state-action pairs, covered by the rule l. Set the new J and go

back until there is no remaining state-action pair Rule learning technique is one of the most successful learning

techniques for planning in modern era Martin and Geffner/Yoon,Fern and Givan showed the importance of

KR The learning technique can be applied to any reactive style control,

so can be applied to POMDP, Stochastic Planning as well as Conformant or Temporal Planning

Future Approaches Develop KR that suits well to the Conformant or Temporal planning and

apply the learning technique


http://www.asu.edu/

L2P – Search Control – How to Use

Machine generated policies may not be complete Is there any way to intelligently leverage machine learned

policies? Discrepancy Search, follow the policy most of the times except for

some limited amount of times Maxsat approach to learned SATPlan control knowledge

Use discrepancy search in heuristic search produced better results

When a node A is being expanded, learned policy is applied to the node

Add all the nodes that occur along the policy from A, to the search queue

This is different from YAHSP or MacroFF style search The intension is finding flaw of the input policy and fix it Yoon, Fern and Givan, IJCAI 07, reported successful experimental

resultsLearning 2 Planning


http://www.asu.edu/

Using Policy in Heuristic Search

S1 S2 S3 S4 S5 S6 S7

Enumerate Neighbors

S1-1S1-2S1-3S1-4S1-5S1-6

Add and Sort

Search Queue

S1-3S2 S3 S4 S5 S6 S7 S1-1S1-2S1-5S1-6S1-4S1 S2 S3 S4 S5 S6 S7S1-H-6 S2S1-3 S4 S5

Execute the Input policy for some Horizon H

S1-1

π

S1-2

π

S..

π

S..

π

S..

π

S1-H-6

π

S1-H-5

π

S1-H-4

π

S1-H-3

π

S1-H-2

π

S1-H-1

π

S1-HLearning 2 Planning


http://www.asu.edu/

Challenges and Solutions

• Domain Knowledge in multiple modalities

Use multiple ILRs customized to different types of knowledge

• Learning in multiple time-scales Combine eager (e.g. EBL-style)

and lazy (e.g. CBR-style) learning techniques

• Handling partially correct domain models and explanation

Use local closed world assumptions

• Avoiding balkanization of learned knowledge

Use structured explanations as a unifying “glue”

• Meeting explicit learning goals Use Goal-driven meta-learning

techniques• Goal-driven, explanation-based learning

approach of GILA gleans and exploits knowledge in multiple natural formats

— Single example analyzed under the lenses of multiple “ILRs” to learn/improve

– Planning operators– Task networks– Planning cases– Domain uncertainty 86 http://

www.public.asu.edu/~syoon/ L2P-tutorial.html

http://www.asu.edu/

Gila - Performance ReviewScenario DTL SPL CBL QOS

AverageQOS Median

1 39% 47% 8% 61.4 70

2 49% 27% 19% 40.0 10

3 26% 52% 17% 55.6 70

4 23% 77% 0% 76.7 80

5 43% 51% 6% 65.0 70

6 32% 63% 0% 83.8 90

Numbers of DTL, SPL, and CBL : The percent of contribution of each ILR component. The percent of the actually used solution suggested by each component

QOS means the quality of each Pstep

As SPL participates in the solution more, the QOS improves

Policy Learner


http://www.asu.edu/

Policy Learning - Summary


Solutions from automated planner (or all the optimal actions)


Positive Examples: Actions in the solution PlansNegative Examples: Actions not in the solution Plans(reverse for rejection rule learning)

Features Relational Features, Taxonomic Features




Decision List

Learning Method Rivest Style Decision List Learning


Apply to temporal planning, oversubscribed planning


http://www.asu.edu/

L2P – Policy Learning- RRL [RRL] is a relational version of

reinforcement learning – Dzeroski, De Raedt, and Driessens, MLJ, 2001

Has been successfully applied to some versions of Blocksworld and games

Used TILDE, relational tree regression technique to learn Q-value functions, which scores State-Action pair

Later Direct Policy learning has been shown to be better than Q-value learning


http://www.asu.edu/

L2P – Policy Learning - RRL

1: (stack Yellow Red)

(holding Yellow)(on Red Blue)(clear Red)

0.9: (pickup Yellow)

(on Red Blue)(ontable Yellow)(clear Red)(clear Yellow)

0.81: (stack Red Blue)

(holding Red)(ontable Blue)(ontable Yellow)(clear Blue)

0.72 (Pickup Red)

(ontable Red)(ontable Blue)(ontable Yellow)


http://www.asu.edu/

L2P – Policy Learning - RRL Merit: doesn’t have to worry about

positive/negative examples from trajectories Model-free: RRL does not need domain model Does not need teacher. Slow-convergence: especially when it is very

hard to find goals We will deal with this in the next technique

Future Approaches Concept-language based reinforcement learning

techniques


http://www.asu.edu/

Policy Learning – RRL - Summary


Trajectories of the current exploration

Target Value Discounted Reward

Features Relational Features




Relational Decision Tree

Learning Method TILDE-RT


Extend the approach with richer feature space


http://www.asu.edu/

L2P – Policy Learning – Automated Domain Solver

When we have a good policy learner but does not have a teacher who provides solution trajectories to the target domain

How we get training data? Fern,Yoon and Givan, JAIR ‘06, developed an interesting technique based

on random walk idea Easy problems can easily be generated with small random walk

The end of the random walk is the goal state One can imagine that as the random walk length increases the problems will

become harder It is not guaranteed that random-walk generated problem set is close to the real

distribution of the planning problems Consider FreeCell However, in practice, this idea produced good results across benchmark domains

Use approximate policy iteration technique The technique has been successfully applied to both deterministic and

probabilistic planning domains


http://www.asu.edu/



Sungwook Yoon

1 2 34 5 67 8

8-puzzle Problem

Take one random action

1 2 34 5 67 8

Initial State Goal State

1 2 34 5 67 8

Take many random actions

3 7 814

2 65

Initial State Goal State

http://www.asu.edu/



Sungwook Yoon

Increase Random walk length until the current policy can only solve less than 80% of the random-walk generated problem set

Update the current policy using Approximate Policy Iteration technique

RWL

Current Policy P

Updated policy

http://www.asu.edu/

http://www.public.asu.edu/~syoon/ L2P-tutorial.html

Approximate Policy Iteration : Fern, Yoon and

Givan, NIPS ‘03

trajectories of improved policy ’

’ Learn approximation of ’Control Policy

??

? ?

current policy

Planning Domain(problem distribution)

?

Learning 2 Planning Sungwook Yoon

http://www.asu.edu/

http://www.public.asu.edu/~syoon/ L2P-tutorial.html

Computing ’ Trajectories from

s ……

…

……

Trajectories under

a1

a2

Given: current policy and problem ?

… …

Output: a trajectory under improved policy ’

?

s

Use FF heuristicat these states


http://www.asu.edu/

API – Random Walk - Summary


Policy Rollout with Random Walk Length Control


Positive Example: Actions deemed to be best in the policy rollout simulation from the current stateNegative Example: Actions deemed to be worse than the best action

Features Taxonomic Features




Decision List

Learning Method Rivest Style Decision List Learning


Apply to temporal planning, oversubscribed planning, and ORTS


http://www.asu.edu/

Outline




Policies



schemas



99Learning 2 Planning Sungwook Yoon http://www.public.asu.edu/~syoon/ L2P-tutorial.html

http://www.asu.edu/

Learning Domain Knowledge

Learning from scratch

Operator Learning

Operationalizing existing knowledge

EBL-based operationalization [Levine/DeJong; 2006]

RL for focusing on “interesting parts” of the model …lots of people including [Aberdeen et. Al. 06]

Outline












http://www.asu.edu/

Question You have a super fast planner and a

target application domain, say FreeCell, what is the first problem you have to solve?, is it the first FreeCell problem?


Sungwook Yoon

http://www.asu.edu/

(Gently) Questioning the Assumption

There are many scenarios where domain modeling is the biggest obstacle Web Service Composition

Most services have very little formal models attached Workflow management

Most workflows are provided with little information about underlying causal models

Learning to plan from demonstrations We will have to contend with incomplete and evolving

domain models..

..but our applications assume complete and correct models..

The way to get more applications is to tackle more and more expressive domains



Is synthesis really the main problem??

http://www.asu.edu/

Model-Lite Planning is Planning with incomplete models ..“incomplete” “not enough domain

knowledge to verify correctness/optimality”

How incomplete is incomplete? Missing a couple of

preconditions/effects?

Knowing no more than I/O types?



We reduce the validation burden from the user..

http://www.asu.edu/

Challenges in Realizing Model-Lite Planning

1. Planning support for shallow domain models

2. Plan creation with approximate domain models

3. Learning to improve completeness of domain models


http://www.asu.edu/

Learning Domain Knowledge(From observation) Learning Operators (Action Models)

Given a set of [Problem; Plan: (operator sequence) ] examples; and the space of domain predicates (fluents)

Induce operator descriptions Operators will have more parameters in expressive domains

Durations and time points; probabilities of outcomes etc. Dimensions of variation

Availability of intermediate states (Complete or Partial) Makes the problem easy—since we can learn each action

separately. Unrealistic (especially “complete” states) Availability of partial action models

Makes the problem easier by biasing the hypotheses (we can partially explain the correctness of the plans). Reasonably realistic.

Interactive learning in the presence of humans Makes it easy for the human in the loop to quickly steer the

system from patently wrong modelsLearning 2 Planning


http://www.asu.edu/



Sungwook Yoon

Target Knowledge

Search Control


Macro / Subgoal

Domain Definition

HTN


Y Y Y Y Y Y


Y

Temporal Planning

Partial Observable

Y

ORTS Y

Learning Technique

s


Set Covering /

EM

Kernel Method

Bayesian


Y Y Y Y Y Y


Y

Temporal Planning

Partial Observable

Y

ORTS

http://www.asu.edu/

L2P – Domain Learning – Logical Filtering

Chang and Amir, ICAPS ‘05, applied logical filtering approach to learning domain transition models

Maintain both Belief state and Domain Transition Models

Update the belief state and domain transition models with logical filtering

Thus a belief state in this work is a pair of belief state and transition model

The approach has been successfully to propositional domains

Logical filtering can be a good candidate for domain learning


http://www.asu.edu/

L2P – Domain Learning – Learn Probabilistic Operators

Zettlemoyer, Pasula and Kaelbling, AAAI, 2005, learned probabilistic planning operators from simulated blocksworld Includes precondition and effects Used deictic representation

pickup(X) : Y : on(X, Y), Z : table(Z) inhand-nil:.80 : ¬on(X, Y), inhand(X),¬inhand-nil, clear(Y).10 : ¬on(X, Y), on(X, Z), clear(Y).10 : no changewhere Y is now defined as a deictic


http://www.asu.edu/

L2P – Domain Learning – Learn Probabilistic Operators

LearnRuleSet(E)Inputs:Training examples EComputation:Initialize rule set R to contain only the default ruleWhile better rules sets are foundFor each search operator OCreate new rule sets with O, RO = O(R,E)For each rule set R0 in ROIf the score improves (S(R0) > S(R))Update the new best rule set, R = R0Output:The final rule set R

The learned operators were tested by planning with the operators. With learned operators, the planner could perform well on the task of stacking blocks

There are 8 methods for the enumeration of the new rule set. One of them is EBL

The learning and planning system involves visual interpretation and rigid body models. Thus very close to real world environment


http://www.asu.edu/

Learn Rule SetInitialRule set

The bestRule set

The bestRule set

The bestRule set

8 searchOperators

Decided byLearning Heuristic Against Training Examples

RrEsas

as rPENrassPRS )(,,|'ˆlog)()',,(

),(


http://www.asu.edu/

Probabilistic Operator Learning - Summary


Trajectories from random wondering


Positive Examples: Observed FactsNegative Examples: Non-observed facts

Features Relational-deictic representation




Deictic Operator Representation

Learning Method Heuristic search


http://www.asu.edu/

ARMS (Doesn’t assume intermediate states; but requires action parameters)

Idea: See the example plans as “constraining” the hypothesis space of action models

The constraints can be modeled as SAT constraints (with variable weights)

Best hypotheses can be generated by solving the MAX-SAT instances

Performance judged in terms of whether the learned action model can explain the correctness of the observed plans (in the test set)

Constraints Actions’ preconditions and

effects must share action parameters

Actions must have non-empty preconditions and effects; Actions cannot add back what they require; Actions cannot delete what they didn’t ask for

For every pair of frequently co-occurring actions ai-aj, there must be some causal reason

E.g. ai must be giving something to aj OR ai is deleting something that aj gives

[Yang et. al. 2005]


http://www.asu.edu/

Algorithm Execution

(unstack ?x ?y) Precondition: (on ?x ?y) (clear ?x) (arm-empty) (on ?y ?z)

Effect: (clear ?x)

(Putdown ?x) Precondition:

Effect:

(on a b)(on b c)(clear a)(on-table c)(arm-empty)

(clear c)

Unstack a b Putdown a Unstack b c


http://www.asu.edu/

Algorithm Execution


(clear c)


(on b c)(clear b)(arm-empty)

(clear b)(on-table c)(on a b)(on b c)(arm-empty)(clear a)

(unstack ?x ?y)Precondition: (on ?x ?y) (clear ?x) (arm-empty)

Effect: (clear ?y) (not (clear ?x))

(Putdown ?x)Precondition:

Effect:

In case (on a b)Both cannot be clear


http://www.asu.edu/

Algorithm Execution


(clear c)



(clear b)(on-table c)(on a b)(on b c)(arm-empty)


Effect: (clear ?y) (not (clear ?x)) (not (arm-empty)

(Putdown ?x)Precondition:

Effect:

Unstack b cCan be executed in the second stageSome precondition of Unstack b cMust not be met


http://www.asu.edu/

Algorithm Execution


(clear c)



(clear b)(on-table c)(on a b)(on b c)



(Putdown ?x)Precondition: (not (arm-empty)

Effect: (arm-empty) (on-table ?x)

Action with Arguments must havePredicate with that argumentsAs Effects


http://www.asu.edu/

Algorithm Execution


(clear c)


(on b c)(clear b)(arm-empty)(on-table a)

(clear b)(on-table c)(on a b)(on b c)



(Putdown ?x)Precondition: (not (arm-empty) (holding ?x)

Effect: (arm-empty) (on-table ?x)


http://www.asu.edu/

Algorithm Execution


(clear c)


(on b c)(clear b)(arm-empty)(on-table a)(holding a)

(clear b)(on-table c)(on a b)(on b c)(holding a)



(Putdown ?x)Precondition: (not (arm-empty) (holding ?x)

Effect: (arm-empty) (on-table ?x) (not (holding ?x))

On-table cannotExist with holdingsimultaneously


http://www.asu.edu/

ARMS - Summary


Problems and solution plans

EM Observed : Actions, initial state and goalNon-observed: State Facts

Features Predicates


Action Schema


PDDL (STRIPS)

Learning Method EM


http://www.asu.edu/

L2P – Domain Learning - MLN We can use existing Machine Learning package like (Markov Logic

Network) MLN to learn domain operators Yoon and Kambhampati, ICAPS ‘07 workshop, showed learning and

planning approaches based on MLN Learning

Separate precondition axiom and effect axiom This has been used by Kautz and Selman

Update the axioms from observations using MLN tool Action -> Precondition (in the current state) Action -> Effect (next state)

Can Use readily available MLN package, Alchemy

Planning Construct probabilistic plangraph with learned axioms View the plangraph as Bayes Net

Precondition and effect are conditional upon actions Prior action probabilities are specified as .5

View initial state and goal state as evidence variables and solve for MPELearning 2 Planning


http://www.asu.edu/

L2P – Domain Learning - MLN Operators can be represented with probabilistic logic

or Markov Logic Network (MLN) Precondition : Action -> Precondition (relation between

current state and action in the state) Effect: Action->Precondition (relation between the current

action and the next state) After training, axioms will have weight

Frequently verified axioms will have higher weight None observed axioms will have lower weight

If random wondering produced the trajectory, S1,A1, ……, Sn

(S1,A1), …., (Sn-1,An-1) are training examples for precondition axiom

(A1,S2), ….., (An-1,Sn) are training examples for effect axiom


http://www.asu.edu/

L2P - Domain Learning - MLN

(armempty)(ontable Y)(ontable R)(ontable B)(clear R)(clear B)(clear Y)

(holding R)(clear Y)(clear B)(ontable Y)(ontable B)

Pickup R

Precondition Axiom(Pickup ?x) → (armempty), 0.5 → 0.7

Effect Axiom(Pickup ?x) → NOT(armempty) 0.5 → 0.7


http://www.asu.edu/

Planning for Model-lite domain Even for deterministic planning, the

planning can be probabilistic Diverse plans Conformant planning


http://www.asu.edu/

Toward Model-lite Planning - Summary


Trajectories generated from random walk


Positive Examples: facts observedNegative Examples: facts non-observed

Features Automatically constructed from predicate definition and action schema


Can be provided, if needed.


Weighted Logic

Learning Method Perceptron based update


http://www.asu.edu/

Outline




Policies



schemas




http://www.asu.edu/

Summary Learning methods have been used in

planning for both improving search and for learning domain physics Most early work concentrated on search Most recent work is concentrating on learning

domain physics Largely because we seem to have a very good handle

on search Most effective learning methods for planning seem

to be: Knowledge based

Variants of Explanation-based learning have been very popular

Relational Many neat open problems...


http://www.asu.edu/



Sungwook Yoon

Target Knowledge

Search Control


Macro / Subgoal

Domain Definition

HTN


Y Y Y Y Y YY


Y

Temporal Planning

Y

Partial Observable

Y

ORTS Y

Learning Technique

s


Set Covering

Kernel Method

Bayesian


Y Y Y Y Y Y


Y

Temporal Planning

Partial Observable

Y

ORTS

http://www.asu.edu/

Twin Motivations for exploring Learning Techniques for Planning

[Improve Speed] Even in the age of

efficient heuristic planners, hand-crafted knowledge-based planners seem to perform orders of magnitude better

Explore effective techniques for automatically customizing planners

[Reduce Domain-modeling Burden]

Planning Community tends to focus on speedup given correct and complete domain models

Domain modeling burden, often unacknowledged, is nevertheless a strong impediment to application of planning techniques

Explore effective techniques for automatically learning domain models

Any Expertise Solution Any Model Solution

Reprise

http://www.asu.edu/

Beneficial to both Planning & Learning

From Planning Side To speed up the

solution process Search control

To reduce the domain-modeling burden

Model-lite Planning (Kambhampati, AAAI 2007)

To support planning with partial domain models

From Machine Learning Side Challenging Application

Planning can be seen as an application of machine learning

However, in contrast to a majority of learning applications:

Planning requires sequential decisions,

Relational structure Use of the domain

knowledge

It is neither just applied learning nor applied planning but rather a worthy fundamental research goal!

Reprise

http://www.asu.edu/

References DerSNLP (Ihrig and Kambhampati, AAAI, 1994)

[MLP] Model-lite Planning (Kambhampati, AAAI, 2007)

[RL] Reinforcement Learning: A Survey (Kaelbling, Littman and Moore, JAIR, 1996)

[NDP] Neuro-Dynamic Programming (Bertsekas and TsiTsiklis, Athena Scientific)

Learning-Assisted Automated Planning: Looking Back, Taking Stock, Going Forward (Zimmerman and Kambhampati, AI Magazine, 2003)

STRIPS (Fikes and Nilsson, 1971)

[HAMLET] Lazy incremental learning of control knowledge for efficiently obtaining quality plans. AI Review Journal. Special Issue on Lazy Learning, (Borrajo and Veloso) February 1997

Learning by experimentation: The operator refinement method. (Carbonell and Gil) Machine Learning: An Artificial Intelligence Approach, Volume III, 1990.

[RRL] Relational reinforcement learning. Machine Learning, (Dzeroski, De Raedt and Driessens) 2001.

Learning to improve both efficiency and quality of planning. (Estlin and Mooney) IJCAI, 1997

[TIM] The automatic inference of state invariants in tim. (Fox and Long), JAIR, 1998.

[DISCOPLAN] Discovering state constraints in DISCOPLAN: Some new results. (Gerevini and Schubert), AAAI 2000


http://www.asu.edu/

References [Camel] Camel: Learning method preconditions for HTN planning (Ilghami, Nau, Munoz-Aila and Aha) AIPS, 2002

[SNLP+EBL] Learning explanation-based search control rules for partial order planning. (Katukam and Kambhampati), AAAI, 1994

[L2ACT] Learning action strategies for planning domains. (Khardon) Artificial Intelligence, 1999.

[ALPINE] Learning abstraction hierarchies for problem solving. (Knoblock), AAAI, 1990

[SOAR] Chunking in SOAR: The anatomy of a general learning mechanism. (Laird, Rosenbloom and Newell) 1986.

Machine Learning Methods for Planning. (Minton and Zweben) Morgan Kaufmann, 1993.

[DOLPHIN] Combining FOIL and EBG to speed-up logic programs. (Zelle and Mooney)IJCAI 1993.

[TLPlan] Using Temporal Logics to Express Search Control Knowledge for Planning, (Bacchus and Kabanza), AI, 2000

[PDDL] The Planning Domain Definition Language, (McDermott),

[Graphplan] Fast Planning Through Planning Graph Analysis (Blum and Furst), AI, 1997

[FF] The FF Planning System: Fast Plan Generation Through Heuristic Search, (Hoffmann and Nebel) JAIR, 2001

[Satplan] Planning as Satisfiability, (Kautz and Selman), ECAI, 1992

[IPPlan] On the use of integer programming Models in AI Planning, (Vossen, Ball, Lotem and Nau), IJCAI, 1999

[SGPlan] Hsu, Wah, Huang and ChenLearning 2 Planning


http://www.asu.edu/

References [Yahsp] , A Lookahead Strategy for Heuristic Search Planning, (Vidal), ICAPS 2004

[Macro-FF] , Improving AI planning with automatically learned macro operators (Botea, Enzenberger, Muller, and Schaeffer), JAIR, 2005

[Marvin] , Online Identification of Useful Macro-Actions for Planning, (Coles and Smith), ICAPS, 2007

Learning Declarative Control Rules for Constraint-Based Planning, (Huang, Selman and Kautz), ICML, 2000

[FOIL], FOIL: A Midterm Report, (Quinlan and Cameron-Jones), ECML, 1993

[Martin and Geffner] Learning Generalized Policies in Planning Using Concept Languages, KR, 2000

Inductive Policy Selection for First-Order MDPs, (Yoon, Fern, Givan), UAI, 2002

Learning Measures of Progress for Planning Domains, (Yoon, Fern and Givan), AAAI, 2005

Approximate Policy Iteration with a Policy Language Bias: Learning to Solve Relational Markov Decision Processes, (Fern, Yoon and Givan), JAIR, 2006

Learning Heuristic Functions from Relaxed Plans , (Yoon, Fern and Givan), ICAPS, 2006

Using Learned Policies in Heuristic-Search Planning , (Yoon, Fern and Givan), IJCAI, 2007

Goal Achievement in Partially Known, Partially Observable Domains (Chang and Amir), ICAPS, 2006

Learning Planning Rules in Noisy Stochastic Worlds (Zettlemoyer, Pasula, and Kaelbling), AAAI, 2005

[ARMS] Learning Action Models from Plan Examples with Incomplete Knowledge, (Yang, Wu and Jiang), ICAPS, 2005


http://www.asu.edu/

References Towards Model-lite Planning: A Proposal For Learning & Planning with Incomplete Domain Models, (Yoon and

Kambhampati), 2007, ICAPS-Workshop for Learning and Planning

Markov Logic Networks (Richardson and Domingos), 2006, MLJ

Learning Recursive Control Programs for Problem Solving (Langley and Choi), 2006, JMLR

[HDL] Learning to do HTN Planning (Ilghami, Nau and Munoz-Avila), 2006, ICAPS

Task Decomposition Planning with Context Sensitive Actions (Barrett), 1997


http://www.asu.edu/

Learning for Planning Sungwook Yoon Subbarao Kambhampati Arizona State University Tutorial presented...

Documents

Transcript of Learning for Planning Sungwook Yoon Subbarao Kambhampati Arizona State University Tutorial presented...