Optimizing Recursive Optimizing Recursive Information Gathering Information Gathering PlansPlans
Eric Lambrecht, Subbarao Kambhampati
Senthil Gnanaprakasam
Arizona State University
Tempe, USA
rakaposhi.eas.asu.edu/yochan.html
Information GatheringInformation Gathering
<html>
cgi
wrapper wrapper db
Gatherer user
Build query plan using source inversion
Logical Optimizations:Redundancy removal
Execution Optimizations: Source call ordering
Execute query plan
[Duschka (with Genesereth & Levy) 97]
EMERAC Query Planning EMERAC Query Planning SystemSystem
[Optimization steps]
OrganizationOrganization
•Optimization challenges in EMERAC
•Building Source Complete Plans: Review
•Logical optimization
•Minimization of recursive IG plans by removing redundant source calls
•Execution optimization
•Ordering source calls to minimize both access and tuple-transfer costs
•Implementation and Results
•Contributions
Modeling Information GatheringModeling Information Gathering
Information sources:
•relational
•answer ‘select’ queries (possibly a restricted set of query patterns)
•autonomous
World model:
•relational
Query on the world model:
Reformulate the query as calls on information sources. Optimize. Execute.
<html> cgi
wrapper wrapper db
Gatherer user
“Local as View” model
Modeling SourcesModeling Sources
Sources related to world model by describing them as views over world model:
Required binding..
movie-hut(X, Y) -> title-time(X, Y), title-actor(X, Z)
house-of-movies($X, Y) -> title-time(X, Y), title-actor(X, Z)
query(X, Y) :- title-time(X, Y)
Optimization challenges in Optimization challenges in EMERACEMERAC
• Each relation is exported in to-Each relation is exported in to-to by a single databaseto by a single database
• All sources are assumed to be All sources are assumed to be fully relationalfully relational
• Multiple sources export partial Multiple sources export partial and overlapping portions of a and overlapping portions of a relationrelation
– Need to minimize plans to remove Need to minimize plans to remove redundancyredundancy
• Sources are rarely fully Sources are rarely fully relational relational
– Only limited types of queries allowedOnly limited types of queries allowed
• Wrapped web-pagesWrapped web-pages
• Form-interfaced databasesForm-interfaced databases
• Certain forms of join Certain forms of join computation may be computation may be precludedprecluded
– Need to model query capabilitiesNeed to model query capabilities
Traditional Information Gathering
• Tuple-transfer costs are Tuple-transfer costs are assumed to dominate the assumed to dominate the query-execution costsquery-execution costs
– Use of “Bound-is-easier” Use of “Bound-is-easier” assumptionassumption
• Assume availability of full source-Assume availability of full source-statisticsstatistics
– Selectivity indices, histograms etc. Selectivity indices, histograms etc.
• Access cost & source latencies Access cost & source latencies tend to equal or dominate the tend to equal or dominate the transfer costtransfer cost
– Need to consider number of source Need to consider number of source callscalls
– Need for considering bushy joins Need for considering bushy joins (instead of just left-linear join trees) (instead of just left-linear join trees)
• Full statistics are rarely Full statistics are rarely available about internet sourcesavailable about internet sources
– Sources are decentralized and Sources are decentralized and autonomousautonomous
– Difficult to do systematic optimizationDifficult to do systematic optimization
[Continued]Optimization challenges in Optimization challenges in EMERACEMERAC
Source Access LimitationsSource Access Limitations• Sources can have a variety of access limitationsSources can have a variety of access limitations
– Form interfaced databases may require certain attributes to be Form interfaced databases may require certain attributes to be boundbound
• Whitepages may require the name of the personWhitepages may require the name of the person– To get the numbers of a set of To get the numbers of a set of nn people, we will have to people, we will have to
access the source access the source nn times times
– and may be unable to handle bindings of other attributesand may be unable to handle bindings of other attributes
• A Whitepages database may not take the address of a person A Whitepages database may not take the address of a person as a bound attributeas a bound attribute
– To get the number of John Doe, who lives on Lemon St, we To get the number of John Doe, who lives on Lemon St, we will have to get the numbers of will have to get the numbers of allall John Does, and locally John Does, and locally filter the ones not living on Lemon Street filter the ones not living on Lemon Street
– Wrapped web-pages cannot select over any attributesWrapped web-pages cannot select over any attributes
Representing Source Representing Source Access LimitationsAccess Limitations
• Use annotations on the attributes of the source relationUse annotations on the attributes of the source relation
– ““$$” annotation identifies attributes that ” annotation identifies attributes that mustmust be bound be bound
– ““%%” annotation identifies un-selectable attributes” annotation identifies un-selectable attributes
• S($X,%Y,Z) S($X,%Y,Z) – A form-interfaced web-page that requires bindings for X and A form-interfaced web-page that requires bindings for X and
is able to do selections only on Z.is able to do selections only on Z.
• $ and % annotations help identify feasible binding patterns for $ and % annotations help identify feasible binding patterns for sourcessources
– SSb-- b-- are feasible; Sare feasible; Sf--f-- are infeasible; are infeasible;
– SSbbf bbf must be modeled as S must be modeled as Sbffbff filtered locally with binding on Y filtered locally with binding on Y
Properties of optimal Properties of optimal information gathering information gathering plansplans
• Source-complete: no other plan returns more information using the available sources
• Source-minimal: a plan for which no information source can be removed, yet the plan returns the same answer.
• Access-cost minimal: a plan which reduces the number of separate accesses to individual sources
• Bandwidth-minimal: a plan that, when executed, transfers the smallest amount of data over the network yet is still source complete
Build query plan
Logical Optimizations
Execution Optimizations
Execute query plan
[Source completeness]
[Source-minimality][Access cost and bandwidth minimality]
Ensuring properties of Ensuring properties of optimal information optimal information gathering plansgathering plans
Building Source Complete Building Source Complete PlansPlans
movie-hut(X, Y) -> title-time(X, Y), title-actor(X, Z)
house-of-movies($X, Y) -> title-time(X, Y), title-actor(X, Z)
title-time(X, Y) :- dom(X), house-of-movies(X, Y)
<X, f2(X, Y)>
title-time(X, Y) :- movie-hut(X, Y)<X, f1(X, Y)>
[Duschka, Genesereth 97]
title-actor (X, X, Y) :- movie-hut(X, Y)
dom(X) :- movie-hut(X, Y)
dom(Y) :- movie-hut(X, Y)
title-actor (X, X, Y) :- dom(X), house-of-movies(X, Y)dom(Y) :- dom(X), house-of-movies(X, Y)query(X, Y) :- title-time(X, Y)
query(X, Y) :- title-time(X, Y)
Source Inversion Rules
Binding restrictions lead to recursion in the plan
Problems with Plans derived Problems with Plans derived from source inversion rulesfrom source inversion rules
title-time(X, Y) :- dom(X), house-of-movies(X, Y)
title-actor (Y, X, Y) :- dom(X), house-of-movies(X, Y)
dom(Y) :- dom(X), house-of-movies(X, Y)
query(X,Y) :- title-time(X, Y)
<X, f2(X, Y)>
title-time(X, Y) :- movie-hut(X, Y)
title-actor (Y, X, Y) :- movie-hut(X, Y)
dom(X) :- movie-hut(X, Y)
dom(Y) :- movie-hut(X, Y)
<X, f1(X, Y)>
If both movie-hut and house-of-
movies have same information:
• both sources are not necessary
• the recursion is not necessary
Every source that is remotely relevant to the query is made part of the plan
•Many of these sources may be overlapping
Minimizing information gathering Minimizing information gathering plansplans
Model source overlaps
– Use LCW statements
Rewrite the source-complete plan:
– Greedily remove rules from plan with uniform equivalence and LCW statements (= make the plan source-minimal)
• Uniform containment checks [Sagiv, 88]
• Use heuristics to guide removal and pull out recursion first
LCW StatementsLCW Statements
View: movie-hut(X, Y) -> title-time(X, Y), title-actor(X, Z)
LCW: movie-hut(X, Y) <- title-time(X, Y), title-actor(X, Z)
To check if one rule, r , with information source predicates contains another rule, r , see if
r [s s l] contains r [s s v]
1
1
2
2
[Etzioni et al 97], [Duschka 97]
Inter-source subsumption relations[Mirror sources] can also be handled
Uniform Uniform EquivalenceEquivalence
Equivalence:
• Two datalog programs X and Y are equivalent if, for every set of extensional predicates, the two programs produce the same output.
• Undecidable
Uniform Equivalence:
• X and Y are equivalent if, for every set of extensional and intensional predicates the two plans produce the same output
• Decidable
• Implies equivalence [Sagiv 88]
Testing for Uniform Testing for Uniform ContainmentContainment
p(X, Y) :- q(X, Y)
q(X, Y) :- r(X, Y)p(W, X) :- r(W, X)
does
uniformly contain
?
assert r(“W”, “X”) and try to derive p(“W”, “X”)
Greedily Minimizing Information Greedily Minimizing Information Gathering PlansGathering Plans
Remove non-recursive IDB predicates
Sort the rules so those with dom predicates come before those without dom predicates
for each rule r do
let r be a rule of P that has not yet been considered
let P be the program obtained by deleting rule r from P
if P[s s l] uniformly contains r[s s v] then
replace P with P. Prune unreachable rules.
^
^
^
Sour
ce cos
ts
can
be u
sed
Uniform containment check is exponential in the worst case
Minimization exampleMinimization example
title-time(X, Y) :- dom(X), house-of-movies(X, Y)
<X, f2(X, Y)>
title-time(X, Y) :- movie-hut(X, Y)<X, f1(X, Y)>title-actor (X, X, Y) :- movie-hut(X, Y)
dom(X) :- movie-hut(X, Y)
dom(Y) :- movie-hut(X, Y)
title-actor (X, X, Y) :- dom(X), house-of-movies(X, Y)dom(Y) :- dom(X), house-of-movies(X, Y)query(X, Y) :- title-time(X, Y)
movie-hut(X, Y) <- title-time(X, Y), title-actor(X, Z)
Build query plan
Logical Optimizations
Execution Optimizations
Execute query plan
[Source completeness]
[Source-minimality][Access cost and bandwidth minimality]
EMERACEMERAC
Issues in ordering source Issues in ordering source callscalls
• Execution cost is a function of both access cost and the tuple-transfer cost (Execution cost is a function of both access cost and the tuple-transfer cost ( ignoring local ignoring local processing costs…)processing costs…)
• Tension between access costs & traffic costsTension between access costs & traffic costs
– E.g. Execute “E.g. Execute “S1(W,X) & S2(X,Y)S1(W,X) & S2(X,Y)” where the query binds W ” where the query binds W
– Tuple-transfer cost reduction motivates calling sources with the least general binding patterns possibleTuple-transfer cost reduction motivates calling sources with the least general binding patterns possible
• Bound-is-easier (S1 first, and then feed X bindings to S2)Bound-is-easier (S1 first, and then feed X bindings to S2)
– Access cost reduction motivates calling sources with the most general binding patterns possibleAccess cost reduction motivates calling sources with the most general binding patterns possible
• Feeding X bindings for S2 will generate many separate accesses, increasing the access costFeeding X bindings for S2 will generate many separate accesses, increasing the access cost
sttransfer
sst
taccess
ssa DCnCMinimize
coscos
**
Our Approach: Our Approach: AssumptionsAssumptions
• Exact optimization is not worth it…Exact optimization is not worth it…
– Lack of full source statisticsLack of full source statistics
– NP-hardness of the optimization problemNP-hardness of the optimization problem
• Join-ordering, which is a special case, is already Join-ordering, which is a special case, is already NP-CompleteNP-Complete
• Source access costs dominate tuple-transfer costs Source access costs dominate tuple-transfer costs by defaultby default
– Reasonable given the large setup and latency costs Reasonable given the large setup and latency costs for internet sourcesfor internet sources
Our Approach: OverviewOur Approach: Overview• A greedy approach (along the lines of “bound-is-easier” type A greedy approach (along the lines of “bound-is-easier” type
procedures)procedures)
• By default, attempts to access each source with the most general By default, attempts to access each source with the most general feasible binding patternfeasible binding pattern
– Reasonable given the assumption that access costs dominate transfer Reasonable given the assumption that access costs dominate transfer costscosts
• The default is over-ridden if a binding pattern is known to produce The default is over-ridden if a binding pattern is known to produce too much traffictoo much traffic
– Binding patterns producing high traffic are stored in a table called Binding patterns producing high traffic are stored in a table called HTBPHTBP
• Implicitly produces bushy join treesImplicitly produces bushy join trees
The HTBP TableThe HTBP Table• The HTBP table contains, for every source S, the least general The HTBP table contains, for every source S, the least general
binding patterns of S which are known to produce “high” trafficbinding patterns of S which are known to produce “high” traffic
– A call to source S with binding pattern B is considered high-traffic A call to source S with binding pattern B is considered high-traffic producing, if HTBP contains Sproducing, if HTBP contains SB’ B’ and B is either equal or more general and B is either equal or more general than B’than B’
– E.g. E.g. Book(Author,Title,ISBN,Subj,Price,Pages)Book(Author,Title,ISBN,Subj,Price,Pages)
• HTBP may contain all binding patterns that do not bind at least one HTBP may contain all binding patterns that do not bind at least one of the first four attributesof the first four attributes
– BookBookffffbb ffffbb listed explicitly in HTBPlisted explicitly in HTBP– BookBookfffffb fffffb BookBookfffffbf fffffbf BookBookffffffffffff
would be considered to be implicitly in HTBPwould be considered to be implicitly in HTBP
• Advantage: HTBP should be easy to specify even if full source Advantage: HTBP should be easy to specify even if full source statistics are not availablestatistics are not available
The AlgorithmThe Algorithm
For each stage i from 1 to m do For each unchosen subgoal S pick the most general & feasible BP B of S w.r.t. V & FBP such that B is not in HTBP. If such a B exists, Push SB into C[i]. Mark S chosen. Add all variables of S to V If no such B exists, but there is a feasible binding pattern for S Pick the BP B’ with most bound variables (in terms of #(.)) Push SB’ into P[i] If no subgoal has been chosen at this level (C[i] is empty), and there are some postponed sources (P[i] is non-empty) Choose Sk
B in P[i] with the maximum #(B) value Push Sk
B into C[i] Add all variables of Sk to V Return the array C[1…m]
Default case: Reduce accesses
HTBP case: Reduce transfer costs
ExampleExample•Sources: DP(A:Author,T:Title,Y:Year)
SM98(T:Title,U:URL)
•Query: Q(A,T,U,1998)
•Plan: Q(A,T,U,1998) :- DP(A,T,1998) & SM98(T,U)
HTBP: {DPbbb SM98bb}
Step 1. V={Y}
Cand: DPfff DPffb SM98ff
XX XX XX
P[1] = {DPffb SM98ff}
C[1] = DPffb
Step 2. V={A,T,Y}
Cand: SM98ff SM98bf
XX XX
P[2]={SM98bf}
C[2]=SM98bf
HTBP: {DPffb}
Step 1. V={Y}
Cand: DPfff DPffb SM98ff
XX XX
C[1] = SM98ff
Step 2. V={Y, U, T}
Cand: DPfff DPffb DPfbf DPfbb
XX XX XX
C[2] = DPfbf
HTBP: {}
Step 1. V={Y}
Cand: DPfff DPffb SM98ff
C[1] = SM98ff DPfff
Bound-is-easier
The Emerac Information Gatherer
•written in Java
•incorporates rewriting and execution ordering techniques
•executes plans in parallel
•returns partial results during plan execution
•object oriented design makes it easy to modify
ImplementationImplementation
ExperimentsExperiments
• Experimented with simulated sources derived form DBLP Experimented with simulated sources derived form DBLP datadata
– Our minimization approach reduces access costs by Our minimization approach reduces access costs by removing redundant recursive sourcesremoving redundant recursive sources
• Minimization cost offset by the improvements in Minimization cost offset by the improvements in execution timeexecution time
– Our source ordering approach tended to reduce the total Our source ordering approach tended to reduce the total cost over bound-is-easier approach whenever there were cost over bound-is-easier approach whenever there were significant number of binding patterns that are not significant number of binding patterns that are not subsumed by HBTPsubsumed by HBTP
LCW vs. Naïve [Artificial Sources]LCW vs. Naïve [Artificial Sources]
1.00E+03
1.00E+04
1.00E+05
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# redundant constrained sources
Tim
e t
o p
lan
& E
xecu
te (
ms)
[lo
g] Naïve d=1LCW d=1Naïve d=3LCW d=3Naïve d=5LCW d=5
LCW vs. Naïve [DBLP Sources]LCW vs. Naïve [DBLP Sources]
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
1.00E+08
1 2 3 4 5 6 7 8
# redundant constrained sources
Tim
e t
o p
lan
& E
xe
cu
te (
in m
. se
c.)
(lo
g)
Naive 256 (1)
LCW 256 (1)
Naive 256 (3)
LCW 256 (3)
Graceful degradation Graceful degradation
ContributionsContributions
•An approach for minimizing recursive information gathering plans•An approach for ordering source calls in information gathering plans
•Attempts at minimizing both access cost and tuple-transfer cost
•Implementation & Evaluation in EMERAC
Current directionsCurrent directions
• Integrate minimization & source-call ordering Integrate minimization & source-call ordering phases phases
• Model cost-quality tradeoffsModel cost-quality tradeoffs
• Handling run-time exceptionsHandling run-time exceptions
– unavailability of sources etc.unavailability of sources etc.
• Tracking time and solution quality statisticsTracking time and solution quality statistics
– Improve the granularity of the HTBP tableImprove the granularity of the HTBP table
Top Related