Robust Query Processing through Progressive Optimization
-
Upload
janna-mcconnell -
Category
Documents
-
view
37 -
download
1
description
Transcript of Robust Query Processing through Progressive Optimization
Robust Query Processing through Progressive Optimization
SIGMOD 2004
Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman, Hamid Pirahesh, Miso Cilimdzic
Modified by S. Sudarshan from talk by Raja Agrawal
Motivation
Current optimizers depend heavily upon the cardinality estimations
What if there errors in those estimations?
Errors can occur due to … Inaccurate statistics Invalid assumptions (e.g. attribute
independence)
Progressive Query Optimization
Idea: lazily trigger reoptimization during execution if cardinality counts indicate current plan is suboptimal introduces checkpoint (CHECK) operator
to compare actual vs estimated cardinality
key idea: precompute cardinality ranges for which plan is optimal
Evaluating a re-optimization scheme Risk Vs Opportunity Risk:
Extent to which re-optimization is not worthwhile reoptimization chooses another bad plan work redone cardinality errors may even cancel, and fixing
one may give an even worse plan! Opportunity:
Refers to the aggressiveness more CHECK operators..
Background
Redbrick Star schema with fact table and multiple dimension
tables First apply selections on dimension tables Then decide what plan to use
Kabra & DeWitt 98 (KD98) Introduced idea of mid-query reoptimization Allow partial results to be use like materialized views But ad-hoc cardinality threshold, and only reoptimize
fully materialized plans
Opportunity
Ris
k
Background
Tukwila data integration system optimizer may have no idea of statistics interleave optimization and query execution
partial query plans Fragment: fully pipelined tree with doubly
pipelined hash join Query Scrambling
reorder query to deal with delayed sources
Opportunity
Ris
k
Background
Eddies (Telegraph) Ingres/DEC Rdb: run multiple access methods
competitively then choose Parametric Query Optimization (PQO)
e.g. Cole and Graefe 94, Hulgeri and Sudarshan 02 Choose from a set of plans, each optimal for
selectivity range POP: converse: find optimal cardinality range for a
give plan
Opportunity
Ris
k
Architecture of POP CHECK operator to find if a plan is
suboptimal At optimization time, find out cardinality range
(at CHECK location) for which plan is optimal At run time, ensure cardinality within [l,u] If violated, stop plan execution and reoptimize
Location of CHECKs Re-optimize
taking observed cardinality into account, and exploiting intermediate results where beneficial Heuristic: limit number of reoptimizations
(default: 3)
Validity Ranges Consider a plan edge e that flows rows into
operator o, let P be the subplan rooted at o. The validity range for e is an upper and lower
bound on the number of rows flowing through e, such that if the range is violated at runtime, we can guarantee P is suboptimal
Ad-hoc thresholds (proposed earlier) are a bad idea E.g. even a 100x error on very small relation
may not make a difference in optimal plan
Finding Optimality Ranges
Plan Popt with root operator oopt is being compared with another plan Palt different only in the root operator oalt.
Finding Optimality Ranges
Need to solve cost(Palt , c) – cost(Popt , c) = 0 where c is the cardinality on edge e
Cost functions can be complex/non-linear/non-continuous
What does this achieve? Detects suboptimality of the root operator where Popt
and Palt share the same input edges. Validity range might miss a cross-over point with a
plan that uses a different join order (and hence has different input edges).
Two plans are structurally equivalent if they share the same set of edges where an edge is defined by the set of rows flowing
through it during query execution. Allows different algorithms, and flipping inner/outer
Optimality wrt structurally equivalent plans
Theorem: …. Suppose edges edges ei1 , ei2 , … , eik are seen to be “erroneous” wrt cardinality. Then the following statements are equivalent:1. P is suboptimal with respect to another plan P'
that has the same set of edges {e1 , e2 , … , em}
2. At least one of Pi1 , Pi2 , … , Pik is suboptimal given the cardinality errors in those edges in {e1 , e2 , … , em } that lie under them.
3. At least one of oi1 , oi2 , … , oik is a suboptimal operator given the cardinality errors in {e1 , e2 , … , em} that are in its input edges.
Conservative detection of suboptimality Suppose we “detect” suboptimality of (R
Join S) Join T wrt estimated costs of (R Join T) Join S During run time, we can never observe the
cardinality of R Join T We would be making an arbitrary guess as to the
correlation of the predicates on the R and T tables
Best not to infer suboptimality wrt such estimates
However, reoptimization may result in a different join order
Exploiting Intermediate Results All the intermediate results are stored as
temporary MVs with cardinalities available to the optimizer can be reused if it leads to a better plan
but not necessarily used, e.g. if join result is very large, and a different join order is preferred
must be reused if it has performed side-effects Reoptimization done as part of same
transaction
Variants of CHECK Variants applicable in different cases,
trade off risk for opportunity Variants
Lazy checking Lazy checking with eager materialization Eager checking without compensation Eager checking with buffering Eager checking with deferred
compensation
Lazy Checking Adding CHECKs above a materialization
point (SORT, TEMP etc) No results have been output yet And materialized results can be re-used very low overhead
Lazy checking with eager materialization
Can insert materialization point if it does not exists already Risk: overhead of materialization Typically done only for outer input of
indexed nested-loop join low cost if outer is small (as estimated by
optimizer) and INL is in trouble anyway if outer is large
Eager Checking
Lazy checking may be too late e.g. if very bad join order chosen, with
huge intermediate results Idea: check even before entire result
is materialized, and stop early Problem: what if some results have
already been output? Compensation
Eager Checking
EC without Compensation: CHECK is pushed down the
materialization point, into pipeline
Eager Checking EC with buffering
CHECK and buffer output from buffer once sure about bound
e.g. [0,b), or [b,infinity] else reoptimize “delayed pipelining”
EC with Deferred Compensation Only SPJ queries Identifier of all rows returned to the user
are stored in a table S, which is used later in the new plan for anti-join with the new-result stream
CHECK Placement
LCEM and ECB – outer side of nested-loop join
LC – above materialization points ECWC and ECDC – anywhere Do not place CHECKs if
no alternative plan above CHECK simple queries with low estimated cost
Performance Analysis: Robustness TPC-H Q10: Replace constant in selection on lineitem
by parameter marker, so optimizer doesn’t know actual selectivity
5 different optimal plans
Opportunity Analysis Goal: how often does opportunity to reoptimize arise? Introduce LC/LCEM/ECB checkpoints But turn off reoptimization, and run same plan Opportunity region for ECB: dotted line
POP in (in)action Real world workload (DMV data and queries) Complex predicates leading to cardinality estimation
errors substring comparison, like, IN, ..
POP in (in)action (contd.)
Re-optimization may result in the choice of worse plan due to: Two estimation errors canceling out each
other Re-using intermediate results