Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Author
bonnierice 
Category
Documents

view
223 
download
0
Embed Size (px)
Transcript of Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek
Query Optimization(chapter 14)
Notes 4

Lecture 3Query RewritingEquivalence RulesQuery OptimizationDynamic Programming (System R/DB2)Heuristics (Ingres/Postgres)Decorrelation of nested queriesResult Size EstimationPracticum Assignment 2
Notes 4

Lecture 3Query RewritingEquivalence RulesQuery OptimizationDynamic Programming (System R/DB2)Heuristics (Ingres/Postgres)Decorrelation of nested queriesResult Size EstimationPracticum Assignment 2
Notes 4

Transformation of Relational ExpressionsTwo relational algebra expressions are said to be equivalent if on every legal database instance the two expressions generate the same set of tuplesNote: order of tuples is irrelevantIn SQL, inputs and outputs are bags (multisets) of tuplesTwo expressions in the bag version of the relational algebra are said to be equivalent if on every legal database instance the two expressions generate the same bag of tuplesAn equivalence rule says that expressions of two forms are equivalentCan replace expression of first form by second, or vice versa
Notes 4

Equivalence Rules1.Conjunctive selection operations can be deconstructed into a sequence of individual selections. 2.Selection operations are commutative. 3.Only the last in a sequence of projection operations is needed, the others can be omitted. Selections can be combined with Cartesian products and theta joins.(E1 X E2) = E1 E2 1(E1 2 E2) = E1 12 E2
Notes 4

Algebraic Rewritings for Selection:scond2scond1Rscond1 AND cond2Rscond1 OR cond2Rscond2Rscond1scond1scond2R
Notes 4

Equivalence Rules (Cont.)5.Thetajoin operations (and natural joins) are commutative. E1 E2 = E2 E16. Natural join operations are associative: (E1 E2) E3 = E1 (E2 E3)
Notes 4

Equivalence Rules for Joinscommutativeassociative
Notes 4

Equivalence Rules (Cont.)For pushing down selections into a (theta) join we have the following cases:(push 1) When all the attributes in 0 involve only the attributes of one of the expressions (E1) being joined. 0E1 E2) = (0(E1)) E2 (split) When 1 involves only the attributes of E1 and 2 involves only the attributes of E2. 1 E1 E2) = (1(E1)) ( (E2))
(impossible) When involves both attributes of E1 and E2 (it is a join condition)
Notes 4

Pushing Selection thru Cartesian Product and JoinsRScondscondSRThe right directionrequires that cond refers to S attributes onlysRScondscondSR
Notes 4

Projection DecompositionppXYpXpYppYXYXppppXppYppX.taxY.pricetotal=X.tax*Y.pricepp{total} {tax} {price}pXYpXpY
Notes 4

More Equivalence Rules8.The projections operation distributes over the theta join operation as follows:(a) if involves only attributes from L1 L2: (b) Consider a join E1 E2. Let L1 and L2 be sets of attributes from E1 and E2, respectively. Let L3 be attributes of E1 that are involved in join condition , but are not in L1 L2, andLet L4 be attributes of E2 that are involved in join condition , but are not in L1 L2.
Notes 4

Join Ordering ExampleFor all relations r1, r2, and r3,(r1 r2) r3 = r1 (r2 r3 )If r2 r3 is quite large and r1 r2 is small, we choose (r1 r2) r3 so that we compute and store a smaller temporary relation.
Notes 4

Join Ordering Example (Cont.)Consider the expressioncustomername ((branchcity = Brooklyn (branch)) account depositor)Could compute account depositor first, and join result with branchcity = Brooklyn (branch) but account depositor is likely to be a large relation.Since it is more likely that only a small fraction of the banks customers have accounts in branches located in Brooklyn, it is better to compute branchcity = Brooklyn (branch) accountfirst.
Notes 4

Lecture 3Query RewritingEquivalence RulesQuery OptimizationDynamic Programming (System R/DB2)Heuristics (Ingres/Postgres)Decorrelation of nested queriesResult Size Estimation
Notes 4

Lecture 3Query RewritingEquivalence RulesQuery OptimizationDynamic Programming (System R/DB2)Heuristics (Ingres/Postgres)Decorrelation of nested queriesResult Size Estimation
Notes 4

The role of Query OptimizationSQLphysical algebralogical algebraparsing, normalizationlogical query optimization physical query optimization query execution
Notes 4

The role of Query OptimizationSQLphysical algebralogical algebraparsing, normalizationlogical query optimization physical query optimization query executionCompare different relational algebra plan on result size(Practicum 2A)
Notes 4

The role of Query OptimizationSQLphysical algebralogical algebraparsing, normalizationlogical query optimization physical query optimization query executionCompare different execution algorithms on true cost (IO, CPU, cache)
Notes 4

Enumeration of Equivalent ExpressionsQuery optimizers use equivalence rules to systematically generate expressions equivalent to the given expressionrepeated until no more expressions can be found: for each expression found so far, use all applicable equivalence rules, and add newly generated expressions to the set of expressions found so far
The above approach is very expensive in space and time
Time and space requirements are reduced by not generating all expressions
Notes 4

Finding A Good Join OrderConsider finding the best joinorder for r1 r2 . . . rn.
There are (2(n 1))!/(n 1)! different join orders for above expression. With n = 7, the number is 665280, with n = 10, the number is greater than 176 billion!
No need to generate all the join orders. Using dynamic programming, the leastcost join order for any subset of {r1, r2, . . . rn} is computed only once and stored for future use.
Notes 4

Dynamic Programming in OptimizationTo find best join tree for a set of n relations:To find best plan for a set S of n relations, consider all possible plans of the form: S1 (S S1) where S1 is any nonempty subset of S.Recursively compute costs for joining subsets of S to find the cost of each plan. Choose the cheapest of the 2n 1 alternatives.When plan for any subset is computed, store it and reuse it when it is required again, instead of recomputing itDynamic programming
Notes 4

Join Order Optimization Algorithmprocedure findbestplan(S) if (bestplan[S].cost ) return bestplan[S] // else bestplan[S] has not been computed earlier, compute it now for each nonempty subset S1 of S such that S1 S P1= findbestplan(S1) P2= findbestplan(S  S1) A = best algorithm for joining results of P1 and P2 cost = P1.cost + P2.cost + cost of A if cost < bestplan[S].cost bestplan[S].cost = cost bestplan[S].plan = execute P1.plan; execute P2.plan; join results of P1 and P2 using A return bestplan[S]
Notes 4

Left Deep Join TreesIn leftdeep join trees, the righthandside input for each join is a relation, not the result of an intermediate join.
Notes 4

Cost of OptimizationWith dynamic programming time complexity of optimization with bushy trees is O(3n). With n = 10, this number is 59000 instead of 176 billion!Space complexity is O(2n) To find best leftdeep join tree for a set of n relations:Consider n alternatives with one relation as righthand side input and the other relations as lefthand side input.Using (recursively computed and stored) leastcost join order for each alternative on lefthandside, choose the cheapest of the n alternatives.If only leftdeep trees are considered, time complexity of finding best join order is O(n 2n)Space complexity remains at O(2n) Costbased optimization is expensive, but worthwhile for queries on large datasets (typical queries have small n, generally < 10)
Notes 4

Physical Query OptimizationMinimizes absolute costMinimize I/Os Minimize CPU, cache miss cost (main memory DBMS)Must consider the interaction of evaluation techniques when choosing evaluation plans: choosing the cheapest algorithm for each operation independently may not yield best overall algorithm. E.g.mergejoin may be costlier than hashjoin, but may provide a sorted output which reduces the cost for an outer level aggregation.nestedloop join may provide opportunity for pipelining
Notes 4

Physical Optimization: Interesting OrdersConsider the expression (r1 r2 r3) r4 r5An interesting sort order is a particular sort order of tuples that could be useful for a later operation.Generating the result of r1 r2 r3 sorted on the attributes common with r4 or r5 may be useful, but generating it sorted on the attributes common only r1 and r2 is not useful.Using mergejoin to compute r1 r2 r3 may be costlier, but may provide an output sorted in an interesting order.Not sufficient to find the best join order for each subset of the set of n given relations; must find the best join order for each subset, for each interesting sort orderSimple extension of earlier dynamic programming algorithmsUsually, number of interesting orders is quite small and doesnt affect time/space complexity significantly
Notes 4

Heuristic OptimizationCostbased optimization is expensive, even with dynamic programming.Systems may use heuristics to reduce the number of choices that must be made in a costbased fashion.Heuristic optimization transforms the querytree by using a set of rules that typically (but not in all cases) improve execution performance:Perform selection early (reduces the number of tuples)Perform projection early (reduces the number of attributes)Perform most restrictive selection and join operations before other similar operations.Some systems use only heuristics, others combine heuristics with partial costbased optimization.
Notes 4

Steps in Typical Heuristic Optimization1.Deconstruct conjunctive selections into a sequence of single selection operations (Equiv. rule 1.).2.Move selection operations down the query tree for the earliest possible execution (Equiv. rules 2, 7a, 7b, 11).3.Execute first those selection and join operations that will produce the smallest relations (Equiv. rule 6).4.Replace Cartesian product operations that are followed by a selection condition by join operations (Equiv. rule 4a).5.Deconstruct and move as far down the tree as possible lists of projection attributes, creating new projections where needed (Equiv. rules 3, 8a, 8b, 12).6.Identify those subtrees whose operations can be pipelined, and execute them using pipelining).
Notes 4

Heuristic Join Order: the WongYoussefi algorithm (INGRES)Sample TPCH SchemaNation(NationKey, NName)Customer(CustKey, CName, NationKey)Order(OrderKey, CustKey, Status)Lineitem(OrderKey, PartKey, Quantity)Product(SuppKey, PartKey, PName)Supplier(SuppKey, SName)SELECT SNameFROM Nation, Customer, Order, LineItem, Product, SupplierWHERE Nation.NationKey = Cuctomer.NationKeyAND Customer.CustKey = Order.CustKeyAND Order.OrderKey=LineItem.OrderKeyAND LineItem.PartKey= Product.PartkeyAND Product.Suppkey = Supplier.SuppKeyAND NName = CanadaFind the names of suppliers that sell a product that appears in a line item of an order made by a customer who is in Canada
Notes 4

Challenges with Large Natural Join ExpressionsFor simplicity, assume that in the queryAll joins are naturalwhenever two tables of the FROM clause have commonattributes we join on themConsider RightIndex onlyNationCustomerOrderLineItemProductSupplierNName=CanadaSNameOne possible orderRIRIRIRIRIIndex
Notes 4

WongYussefi algorithm assumptions and objectivesAssumption 1 (weak): Indexes on all join attributes (keys and foreign keys)Assumption 2 (strong): At least one selection creates a small relationA join with a small relation results in a small relationObjective: Create sequence of indexbased joins such that all intermediate results are small
Notes 4

Hypergraphs
CName
CustKeyNationKey NName Status OrderKey
Quantity
PartKeySuppKey PName SName relation hyperedges two hyperedges for same relation are possible each node is an attribute can extend for nonnatural equality joins by merging nodes NationCustomerOrderLineItemProductSupplier
Notes 4

Small Relations/Hypergraph Reduction
CName
CustKeyNationKey NName Status OrderKey
Quantity
PartKeySuppKey PName SName NationCustomerOrderLineItemProductSupplierNationKey NNameNation is small because it has the equality selection NName = CanadaNationNName=CanadaIndexPick a small relation (and its conditions) to start the plan
Notes 4

CName
CustKeyNationKey NName Status OrderKey
Quantity
PartKeySuppKey PName SName NationCustomerOrderLineItemProductSupplierNationKey NNameNationNName=CanadaIndexRI(1) Remove small relation (hypergraph reduction) and color as small any relation that joins with the removed small relation Customer(2) Pick a small relation (and its conditions if any) and join it with the small relation that has been reduced
Notes 4

After a bunch of stepsNationCustomerOrderLineItemProductSupplierNName=CanadaSNameRIRIRIRIRIIndex
Notes 4

Some Query OptimizersThe System R/Starburst: dynamic programming on leftdeep join orders. Also uses heuristics to push selections and projections down the query tree.DB2, SQLserver are costbased optimizersSQLserver is transformation based, also uses dynamic programming.MySQL optimizer is heuristicsbased (rather weak)Heuristic optimization used in some versions of Oracle:Repeatedly pick best relation to join next Starting from each of n starting points. Pick best among these.
Notes 4

Lecture 3Query RewritingEquivalence RulesQuery OptimizationDynamic Programming (System R/DB2)Heuristics (Ingres/Postgres)Decorrelation of nested queriesResult Size EstimationPracticum Assignment 2
Notes 4

Lecture 3Query RewritingEquivalence RulesQuery OptimizationDynamic Programming (System R/DB2)Heuristics (Ingres/Postgres)Decorrelation of nested queriesResult Size EstimationPracticum Assignment 2
Notes 4

Optimizing Nested SubqueriesSQL conceptually treats nested subqueries in the where clause as functions that take parameters and return a single value or set of valuesParameters are variables from outer level query that are used in the nested subquery; such variables are called correlation variablesE.g. select customername from borrower where exists (select * from depositor where depositor.customername = borrower.customername)Conceptually, nested subquery is executed once for each tuple in the crossproduct generated by the outer level from clauseSuch evaluation is called correlated evaluation Note: other conditions in where clause may be used to compute a join (instead of a crossproduct) before executing the nested subquery
Notes 4

Optimizing Nested Subqueries (Cont.)Correlated evaluation may be quite inefficient since a large number of calls may be made to the nested query there may be unnecessary random I/O as a resultSQL optimizers attempt to transform nested subqueries to joins where possible, enabling use of efficient join techniquesE.g.: earlier nested query can be rewritten as select customername from borrower, depositor where depositor.customername = borrower.customernameNote: above query doesnt correctly deal with duplicates, can be modified to do so as we will seeIn general, it is not possible/straightforward to move the entire nested subquery from clause into the outer level query from clauseA temporary relation is created instead, and used in body of outer level query
Notes 4

Optimizing Nested Subqueries (Cont.)In general, SQL queries of the form below can be rewritten as shownRewrite: select from L1 where P1 and exists (select * from L2 where P2)To: create table t1 as select distinct V from L2 where P21 select from L1, t1 where P1 and P22P21 contains predicates in P2 that do not involve any correlation variablesP22 reintroduces predicates involving correlation variables, with relations renamed appropriatelyV contains all attributes used in predicates with correlation variables
Notes 4

Optimizing Nested Subqueries (Cont.)In our example, the original nested query would be transformed to create table t1 as select distinct customername from depositor select customername from borrower, t1 where t1.customername = borrower.customernameThe process of replacing a nested query by a query with a join (possibly with a temporary relation) is called decorrelation.Decorrelation is more complicated when the nested subquery uses aggregation, or when the result of the nested subquery is used to test for equality, or when the condition linking the nested subquery to the other query is not exists, and so on.
Notes 4

Practicum Assignment 2AGet the XML metadata description for TPCHxslt script for plotting histogramsTake our solution for your second query (assignment 1)For each operator in the tree give:SelectivityIntermediate Result sizeShort description how you computed thisExplanation how to compute histograms on all result columnsSum all intermediate result sizes into total query cost
DEADLINE: march 31
Notes 4

The Big Picture1.Parsing and translation2.Optimization3.Evaluation
Notes 4

The Big Picture1.Parsing and translation2.Optimization3.Evaluation
Notes 4

OptimizationQuery Optimization: Amongst all equivalent evaluation plans choose the one with lowest cost. Cost is estimated using statistical information from the database cataloge.g. number of tuples in each relation, size of tuples, etc.In this lecture we study logical cost estimationintroduction to histogramsestimating the amount of tuples in the result with perfect and equiheight histogramspropagation of histograms into result columnsHow to compute result size from width and #tuples
Notes 4

Cost EstimationPhysical cost estimation predict I/O blocks, seeks, cache misses, RAM consumption, Depends in the execution algorithmIn this lecture we study logical cost estimationthe plan with smallest intermediate result tends to be bestneed estimations for intermediate result sizes
Histogrambased estimation (practicum, assignment 2)estimating the amount of tuples in the result with perfect and equiheight histogramspropagation of histograms into result columnscompute result size as tuplewidth * #tuples
Notes 4

Selectivitiesselectexpr := X(col,const): X in { }  expr && expr  expr  exprsop(R) = R' / R. joinonly 1n / n1 foreign key joinss join(R1,R2) = R' / (R1*R2). aggrs(A(R;g1;a)) = A(R;g1;a) / R = distinct(R.g1) / R. s(A(R;g1,g2;a)) = (distinct(R.g1) * distinct(R.g2)) / R. project, orders project(R) = s order(R) = 1 topns topN(R) = min(N,R) / R = min(N/R,1).
Notes 4

Result Size#tuples_max * selectivity * #columnsWe disregard differences in columnwidth
project:#columns = projectlist#tuples_max = Raggr:#columns = groupbys + aggrs#tuples_max = min(R, g1 * .. * gn)join:#columns = child1 + child2#tuples_max = R1 * R2other:#columns stays equal wrt child#tuples_max = R
Notes 4

Selectivity estimationWe can estimate the selectivities using:domain constraintsmin/max statisticshistograms
Notes 4

HistogramsBuckets:B =
Leave min out (Bi.min = Bi1.max)
Notes 4

Different Kinds of HistogramsPerfectEquiwidthEquiheight
In the practicum we usePerfect histograms, when distinct(R.a) < 25Equiheight histograms of 10 buckets, otherwiseNot perfectly evenheight: disjunct value ranges between buckets(i.e. frequent value is not split over evenheight buckets. It may create a biggerthanheight bucket)
Notes 4

Perfect Histograms: EquiSelections(R.a=C) = Bk.total * (1/R)in case there is a k with Bk.max = Cs(R.a=C) = 0otherwise
acdftotals(R.a=d)
Notes 4

Perfect Histograms: EquiSelections(R.a=C) = Bk.total * (1/R)in case there is a k with Bk.max = Cs(R.a=C) = 0otherwise
acdftotals(R.a=d)
Notes 4
 Perfect Histograms: RangeSelections(R.a
 Perfect Histograms: RangeSelections(R.a
 EquiHeight Histograms: EquiSelections(R.a=C) = avg_freq(Bk) * (1/R) in case there is a k with B(k1).max < C
 EquiHeight Histograms: EquiSelections(R.a=C) = avg_freq(Bk) * (1/R) in case there is a k with B(k1).max < C
 EquiHeight Histograms: RangeSelections(R.a
 EquiHeight Histograms: RangeSelections(R.a

Select with And and OrAssume no correlation between attributes:s(a and c) = s(a) * s(c) s(a or c) = s(a) + (1s(a)) * s(c)
Note: must normalize a , c into nonoverlapping conditions
Notes 4

Foreignkey Join Selectivity/Hitrate EstimationForeignkey constraint:R1 matches at most once with R2each order matches on average with 7 lineitems hitrate = 7
But what if R2 (e.g. order) is an intermediate result?R2 may have multiple key occurrences due to a previous joinR2 may have less key occurrences (missing keys) due to a select (or join).
Simple Approach (practicum): Hitrate *= R2/R2
Notes 4

Aggr(R,[g1..gn],[..])Can only predict groupby columns and size:Expected result size =min(R,distinct(g1) * . * distinct(gn))
Notes 4

Histogram Propagationorder:histogram stays identicalproject:histogram stays identical Expression (e.g. l_tax*l_price) not required for the practicumpossible to use cartesian product on histograms, followed by expression evaluation and rebucketizing.topn:not required for the practicumUse last bucket (and backwards) to take highest N distinct values and their frequenciesaggr:not required for the practicumGroupbys: distinct is multiplication of distincts, freq=1Aggregates: only possible for global aggregates (no groupbys)fkjoin:multiply totals by join hitratedistinct = min(distinct,total) this is a simplicifation!Select:multiply totals by selectivitydistinct = min(distinct,total) Select (selection attribute):Get totals/distincts from subset of buckets
Notes 4

Practicum Assignment 2Get the XML metadata description for TPCHps/pfd histograms also availableTake our solution for your second query (assignment 1)For each operator in the tree give:SelectivityIntermediate Result sizeShort description how you computed thisExplanation how to compute histograms on all result columnsSum all intermediate result sizes into total query cost
DEADLINE: march 31
Notes 4
From the above we can easily derive that sigma_{c_1} sigma_{c_2} R = sigma_{c_2} sigma_{c_1} RThe meaning of the arrows: rewritings versus axioms. What we write is axioms. The left to right direction is 99% the beneficial. However views complicate matters.Generalization to theta join
Example 6.14Watch out rules like that: you cannot apply them blindly. May lead to an explosion of alternative expressions.
Use in the exam the rules of duplicates.111