Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

65
Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Transcript of Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Page 1: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Database Techniek

Query Optimization(chapter 14)

Page 2: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Lecture 3

• Query Rewriting– Equivalence Rules

• Query Optimization– Dynamic Programming (System R/DB2)– Heuristics (Ingres/Postgres)

• De-correlation of nested queries• Result Size Estimation

– Practicum Assignment 2

Page 3: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Lecture 3

• Query Rewriting– Equivalence Rules

• Query Optimization– Dynamic Programming (System R/DB2)– Heuristics (Ingres/Postgres)

• De-correlation of nested queries• Result Size Estimation

– Practicum Assignment 2

Page 4: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Transformation of Relational Expressions

• Two relational algebra expressions are said to be equivalent if on every legal database instance the two expressions generate the same set of tuples– Note: order of tuples is irrelevant

• In SQL, inputs and outputs are bags (multi-sets) of tuples– Two expressions in the bag version of the relational algebra

are said to be equivalent if on every legal database instance the two expressions generate the same bag of tuples

• An equivalence rule says that expressions of two forms are equivalent– Can replace expression of first form by second, or vice versa

Page 5: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Equivalence Rules1. Conjunctive selection operations can be

deconstructed into a sequence of individual selections.

2. Selection operations are commutative.

3. Only the last in a sequence of projection operations is needed, the others can be omitted.

4. Selections can be combined with Cartesian products and theta joins.a. (E1 X E2) = E1 E2

b. 1(E1 2 E2) = E1 12 E2

))(())((1221

EE

))(()(2121

EE

)())))((((121

EE ttntt

Page 6: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Algebraic Rewritings for Selection:

cond2

cond1

Rcond1 AND cond2

R

cond1 OR cond2

R

cond2

R

cond1

cond1

cond2

R

Page 7: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Equivalence Rules (Cont.)5.Theta-join operations (and natural joins) are

commutative.E1 E2 = E2 E1

6. Natural join operations are associative: (E1 E2) E3 = E1 (E2 E3)

Page 8: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Equivalence Rules for Joins

commutative

associative

Page 9: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

7. For pushing down selections into a (theta) join we have the following cases:

– (push 1) When all the attributes in 0 involve only the attributes of one of the expressions (E1) being joined.

0E1 E2) = (0(E1)) E2

– (split) When 1 involves only the attributes of E1 and 2 involves only the attributes of E2.

1 E1 E2) = (1(E1)) ( (E2))

– (impossible) When involves both attributes of E1 and E2

(it is a join condition)

Equivalence Rules (Cont.)

Page 10: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Pushing Selection thru Cartesian Product and Join

R S

cond

cond

SR

The right directionrequires that cond refers to S

attributes only

R S

cond

cond

SR

Page 11: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Projection Decomposition

pXYpXpYp

YX

YX

p

pXp pYp

X.tax Y.price

total=X.tax*Y.price

p

{total} {tax} {price}

pXYpXpY

Page 12: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

8. The projections operation distributes over the theta join operation as follows:(a) if involves only attributes from L1 L2:

(b) Consider a join E1 E2.

– Let L1 and L2 be sets of attributes from E1 and E2, respectively.

– Let L3 be attributes of E1 that are involved in join condition , but are not in L1 L2, and

– Let L4 be attributes of E2 that are involved in join condition , but are not in L1 L2.

More Equivalence Rules

))(())(()( 2......12.......1 2121EEEE LLLL

)))(())((().....( 2......121 42312121EEEE LLLLLLLL

Page 13: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Join Ordering Example

• For all relations r1, r2, and r3,

(r1 r2) r3 = r1 (r2 r3 )

• If r2 r3 is quite large and r1 r2 is small, we choose

(r1 r2) r3

so that we compute and store a smaller temporary relation.

Page 14: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Join Ordering Example (Cont.)

• Consider the expressioncustomer-name ((branch-city = “Brooklyn” (branch))

account depositor)• Could compute account depositor first, and join result

with branch-city = “Brooklyn” (branch)

but account depositor is likely to be a large relation.• Since it is more likely that only a small fraction of the

bank’s customers have accounts in branches located in Brooklyn, it is better to compute

branch-city = “Brooklyn” (branch) account

first.

Page 15: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Lecture 3

• Query Rewriting– Equivalence Rules

• Query Optimization– Dynamic Programming (System R/DB2)– Heuristics (Ingres/Postgres)

• De-correlation of nested queries• Result Size Estimation

Page 16: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Lecture 3

• Query Rewriting– Equivalence Rules

• Query Optimization– Dynamic Programming (System R/DB2)– Heuristics (Ingres/Postgres)

• De-correlation of nested queries• Result Size Estimation

Page 17: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

The role of Query Optimization

SQL

physical algebra

logical algebra

parsing, normalization

logical query optimization physical query optimization

query execution

Page 18: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

The role of Query Optimization

SQL

physical algebra

logical algebra

parsing, normalization

logical query optimization physical query optimization

query execution

Compare different relational algebra plan on result size(Practicum 2A)

Page 19: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

The role of Query Optimization

SQL

physical algebra

logical algebra

parsing, normalization

logical query optimization physical query optimization

query execution

Compare different execution algorithms on true cost (IO, CPU, cache)

Page 20: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Enumeration of Equivalent Expressions

• Query optimizers use equivalence rules to systematically generate expressions equivalent to the given expression

• repeated until no more expressions can be found: – for each expression found so far, use all applicable equivalence

rules, and add newly generated expressions to the set of expressions found so far

• The above approach is very expensive in space and time

• Time and space requirements are reduced by not generating all expressions

Page 21: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Finding A Good Join Order

• Consider finding the best join-order for r1 r2 . . . rn.

• There are (2(n – 1))!/(n – 1)! different join orders for above expression. With n = 7, the number is 665280, with n = 10, the number is greater than 176 billion!

• No need to generate all the join orders. Using dynamic programming, the least-cost join order for any subset of {r1, r2, . . . rn} is computed only once and stored for future use.

Page 22: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Dynamic Programming in Optimization

• To find best join tree for a set of n relations:– To find best plan for a set S of n relations,

consider all possible plans of the form: S1 (S – S1) where S1 is any non-empty subset of S.

– Recursively compute costs for joining subsets of S to find the cost of each plan. Choose the cheapest of the 2n – 1 alternatives.

– When plan for any subset is computed, store it and reuse it when it is required again, instead of recomputing it

• Dynamic programming

Page 23: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Join Order Optimization Algorithm

procedure findbestplan(S)if (bestplan[S].cost )

return bestplan[S]// else bestplan[S] has not been computed earlier, compute it nowfor each non-empty subset S1 of S such that S1 S

P1= findbestplan(S1)P2= findbestplan(S - S1)A = best algorithm for joining results of P1 and P2cost = P1.cost + P2.cost + cost of Aif cost < bestplan[S].cost

bestplan[S].cost = costbestplan[S].plan = “execute P1.plan; execute

P2.plan; join results of P1 and P2 using

A”return bestplan[S]

Page 24: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Left Deep Join Trees• In left-deep join trees, the right-hand-side input for

each join is a relation, not the result of an intermediate join.

Page 25: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Cost of Optimization

• With dynamic programming time complexity of optimization with bushy trees is O(3n). – With n = 10, this number is 59000 instead of 176 billion!

• Space complexity is O(2n) • To find best left-deep join tree for a set of n relations:

– Consider n alternatives with one relation as right-hand side input and the other relations as left-hand side input.

– Using (recursively computed and stored) least-cost join order for each alternative on left-hand-side, choose the cheapest of the n alternatives.

• If only left-deep trees are considered, time complexity of finding best join order is O(n 2n)– Space complexity remains at O(2n)

• Cost-based optimization is expensive, but worthwhile for queries on large datasets (typical queries have small n, generally < 10)

Page 26: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Physical Query Optimization

• Minimizes absolute cost– Minimize I/Os – Minimize CPU, cache miss cost (main memory DBMS)

• Must consider the interaction of evaluation techniques when choosing evaluation plans: choosing the cheapest algorithm for each operation independently may not yield best overall algorithm. E.g.– merge-join may be costlier than hash-join, but may provide

a sorted output which reduces the cost for an outer level aggregation.

– nested-loop join may provide opportunity for pipelining

Page 27: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Physical Optimization: Interesting Orders

• Consider the expression (r1 r2 r3) r4 r5

• An interesting sort order is a particular sort order of tuples that could be useful for a later operation.– Generating the result of r1 r2 r3 sorted on the attributes

common with r4 or r5 may be useful, but generating it sorted on the attributes common only r1 and r2 is not useful.

– Using merge-join to compute r1 r2 r3 may be costlier, but may provide an output sorted in an interesting order.

• Not sufficient to find the best join order for each subset of the set of n given relations; must find the best join order for each subset, for each interesting sort order– Simple extension of earlier dynamic programming

algorithms– Usually, number of interesting orders is quite small and

doesn’t affect time/space complexity significantly

Page 28: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Heuristic Optimization

• Cost-based optimization is expensive, even with dynamic programming.

• Systems may use heuristics to reduce the number of choices that must be made in a cost-based fashion.

• Heuristic optimization transforms the query-tree by using a set of rules that typically (but not in all cases) improve execution performance:– Perform selection early (reduces the number of tuples)– Perform projection early (reduces the number of

attributes)– Perform most restrictive selection and join operations

before other similar operations.– Some systems use only heuristics, others combine

heuristics with partial cost-based optimization.

Page 29: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Steps in Typical Heuristic Optimization

1. Deconstruct conjunctive selections into a sequence of single selection operations (Equiv. rule 1.).

2. Move selection operations down the query tree for the earliest possible execution (Equiv. rules 2, 7a, 7b, 11).

3. Execute first those selection and join operations that will produce the smallest relations (Equiv. rule 6).

4. Replace Cartesian product operations that are followed by a selection condition by join operations (Equiv. rule 4a).

5. Deconstruct and move as far down the tree as possible lists of projection attributes, creating new projections where needed (Equiv. rules 3, 8a, 8b, 12).

6. Identify those subtrees whose operations can be pipelined, and execute them using pipelining).

Page 30: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Heuristic Join Order: the Wong-Youssefi algorithm (INGRES)

Sample TPC-H Schema

Nation(NationKey, NName)Customer(CustKey, CName, NationKey)Order(OrderKey, CustKey, Status)Lineitem(OrderKey, PartKey, Quantity)Product(SuppKey, PartKey, PName)Supplier(SuppKey, SName)

SELECT SNameFROM Nation, Customer, Order, LineItem, Product, SupplierWHERE Nation.NationKey = Cuctomer.NationKey

AND Customer.CustKey = Order.CustKeyAND Order.OrderKey=LineItem.OrderKeyAND LineItem.PartKey= Product.PartkeyAND Product.Suppkey = Supplier.SuppKeyAND NName = “Canada”

Find the names of suppliers that sell a

product that appears in a line item of

an order made by a customer who is in Canada

Page 31: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Challenges with Large Natural Join Expressions

For simplicity, assume that in the query1. All joins are natural2. whenever two tables of the FROM clause have common

attributes we join on them1. Consider Right-Index only

Nation Customer Order LineItem Product Supplier

σNName=“Canada”

πSName

One possible order

RI

RI

RI

RI

RI

Index

Page 32: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Wong-Yussefi algorithm assumptions and objectives

• Assumption 1 (weak): Indexes on all join attributes (keys and foreign keys)

• Assumption 2 (strong): At least one selection creates a small relation– A join with a small relation results in a small relation

• Objective: Create sequence of index-based joins such that all intermediate results are small

Page 33: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Hypergraphs

CName

CustKey

NationKey NName

Status OrderKey

Quantity

PartKeySuppKey PName SName

• relation hyperedges• two hyperedges for same relation are possible

• each node is an attribute• can extend for non-natural equality joins by merging nodes

Nation

Customer

OrderLineItem

Product

Supplier

Page 34: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Small Relations/Hypergraph Reduction

CName

CustKey

NationKey NName

Status OrderKey

Quantity

PartKeySuppKey PName SName

Nation

Customer

OrderLineItem

Product

Supplier

NationKey NName

“Nation” is small because it has the equality selection

NName = “Canada”

Nation

σNName=“Canada”Index Pick a small

relation (and its conditions) to start

the plan

Page 35: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

CName

CustKey

NationKey NName

Status OrderKey

Quantity

PartKeySuppKey PName SName

Nation

Customer

OrderLineItem

Product

Supplier

NationKey NName

Nation

σNName=“Canada”Index

RI

(1) Remove small relation (hypergraph reduction) and color

as “small” any relation that joins with the removed “small” relation

Customer

(2) Pick a small relation (and its

conditions if any) and join it with the small relation that has been reduced

Page 36: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

After a bunch of steps…

Nation Customer Order LineItem Product Supplier

σNName=“Canada”

πSName

RI

RI

RI

RI

RI

Index

Page 37: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Some Query Optimizers

• The System R/Starburst: dynamic programming on left-deep join orders. Also uses heuristics to push selections and projections down the query tree.

• DB2, SQLserver are cost-based optimizers– SQLserver is transformation based, also uses dynamic

programming.

• MySQL optimizer is heuristics-based (rather weak)• Heuristic optimization used in some versions of Oracle:

– Repeatedly pick “best” relation to join next • Starting from each of n starting points. Pick best among these.

Page 38: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Lecture 3

• Query Rewriting– Equivalence Rules

• Query Optimization– Dynamic Programming (System R/DB2)– Heuristics (Ingres/Postgres)

• De-correlation of nested queries• Result Size Estimation

– Practicum Assignment 2

Page 39: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Lecture 3

• Query Rewriting– Equivalence Rules

• Query Optimization– Dynamic Programming (System R/DB2)– Heuristics (Ingres/Postgres)

• De-correlation of nested queries• Result Size Estimation

– Practicum Assignment 2

Page 40: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Optimizing Nested Subqueries• SQL conceptually treats nested subqueries in the where clause

as functions that take parameters and return a single value or set of values– Parameters are variables from outer level query that are used in

the nested subquery; such variables are called correlation variables

• E.g.select customer-namefrom borrowerwhere exists (select *

from depositor where depositor.customer-name =

borrower.customer-name)

• Conceptually, nested subquery is executed once for each tuple in the cross-product generated by the outer level from clause– Such evaluation is called correlated evaluation – Note: other conditions in where clause may be used to compute a

join (instead of a cross-product) before executing the nested subquery

Page 41: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Optimizing Nested Subqueries (Cont.)

• Correlated evaluation may be quite inefficient since – a large number of calls may be made to the nested query – there may be unnecessary random I/O as a result

• SQL optimizers attempt to transform nested subqueries to joins where possible, enabling use of efficient join techniques

• E.g.: earlier nested query can be rewritten as select customer-namefrom borrower, depositorwhere depositor.customer-name = borrower.customer-name– Note: above query doesn’t correctly deal with duplicates, can be

modified to do so as we will see

• In general, it is not possible/straightforward to move the entire nested subquery from clause into the outer level query from clause– A temporary relation is created instead, and used in body of

outer level query

Page 42: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Optimizing Nested Subqueries (Cont.)

In general, SQL queries of the form below can be rewritten as shown• Rewrite: select …

from L1

where P1 and exists (select * from L2

where P2)

• To: create table t1 as select distinct V from L2

where P21

select … from L1, t1 where P1 and P2

2

– P21 contains predicates in P2 that do not involve any correlation variables

– P22 reintroduces predicates involving correlation variables, with

relations renamed appropriately– V contains all attributes used in predicates with correlation variables

Page 43: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Optimizing Nested Subqueries (Cont.)

• In our example, the original nested query would be transformed to create table t1 as select distinct customer-name from depositor select customer-name from borrower, t1

where t1.customer-name = borrower.customer-name

• The process of replacing a nested query by a query with a join (possibly with a temporary relation) is called decorrelation.

• Decorrelation is more complicated when– the nested subquery uses aggregation, or– when the result of the nested subquery is used to test for equality,

or – when the condition linking the nested subquery to the other

query is not exists, – and so on.

Page 44: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Practicum Assignment 2A

• Get the XML metadata description for TPC-H– xslt script for plotting histograms

• Take our solution for your second query (assignment 1)

• For each operator in the tree give:– Selectivity– Intermediate Result size– Short description how you computed this– Explanation how to compute histograms on all result columns

• Sum all intermediate result sizes into total query cost

• DEADLINE: march 31

Page 45: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

The Big Picture1.Parsing and translation2.Optimization3.Evaluation

Page 46: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

The Big Picture1.Parsing and translation2.Optimization3.Evaluation

Page 47: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Optimization

• Query Optimization: Amongst all equivalent evaluation plans choose the one with lowest cost. – Cost is estimated using statistical information from the

database catalog• e.g. number of tuples in each relation, size of tuples, etc.

• In this lecture we study logical cost estimation– introduction to histograms– estimating the amount of tuples in the result with

perfect and equi-height histograms– propagation of histograms into result columns– How to compute result size from width and #tuples

Page 48: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Cost Estimation

• Physical cost estimation – predict I/O blocks, seeks, cache misses, RAM consumption, …– Depends in the execution algorithm

• In this lecture we study logical cost estimation– “the plan with smallest intermediate result tends to be best”– need estimations for intermediate result sizes

• Histogram-based estimation (practicum, assignment 2)– estimating the amount of tuples in the result with perfect

and equi-height histograms– propagation of histograms into result columns– compute result size as tuple-width * #tuples

Page 49: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Result Size#tuples_max * selectivity * #columns

We disregard differences in column-width

• project:– #columns = |projectlist|– #tuples_max = |R|

• aggr:– #columns = |groupbys| + |aggrs|– #tuples_max = min(|R|, |g1| * .. * |gn|)

• join:– #columns = |child1| + |child2|– #tuples_max = |R1| * |R2|

• other:– #columns stays equal wrt child– #tuples_max = |R|

Page 50: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Selectivity estimation

We can estimate the selectivities using:• domain constraints• min/max statistics• histograms

Page 51: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Histograms

Buckets:B = <min, max, total, distinct>

Leave min out (Bi.min = Bi-1.max)

Page 52: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Different Kinds of Histograms

• Perfect• Equi-width• Equi-height

In the practicum we use• Perfect histograms, when distinct(R.a) < 25• Equi-height histograms of 10 buckets, otherwise

– Not perfectly even-height: disjunct value ranges between buckets

– (i.e. frequent value is not split over even-height buckets. It may create a bigger-than-height bucket)

Page 53: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Perfect Histograms: Equi-Selection

• s(R.a=C) = Bk.total * (1/|R|)– in case there is a k with Bk.max = C

• s(R.a=C) = 0– otherwise

a c d f

total

s(R.a=d)

Page 54: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Perfect Histograms: Equi-Selection

• s(R.a=C) = Bk.total * (1/|R|)– in case there is a k with Bk.max = C

• s(R.a=C) = 0– otherwise

a c d f

total

s(R.a=d)

Page 55: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Perfect Histograms: Range-Selection

• s(R.a<C) = sum(Bi.total) * (1/|R|), – for all 1 <= i < k with B(k-1).max < C <= Bk.max

a c d f

total

s(R.a<d)

Page 56: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Perfect Histograms: Range-Selection

• s(R.a<C) = sum(Bi.total) * (1/|R|), – for all 1 <= i < k with B(k-1).max < C <= Bk.max

a c d f

total

s(R.a<d)

Page 57: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Equi-Height Histograms: Equi-Selection

• s(R.a=C) = avg_freq(Bk) * (1/|R|)

– in case there is a k with B(k-1).max < C <= Bk.max

– avg_freq(Bk) = Bk.total / Bk.distinct

• s(R.a=C) = 0– otherwise

a d e f

total

s(R.a=c)

Page 58: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Equi-Height Histograms: Equi-Selection

• s(R.a=C) = avg_freq(Bk) * (1/|R|)

– in case there is a k with B(k-1).max < C <= Bk.max

– avg_freq(Bk) = Bk.total / Bk.distinct

• s(R.a=C) = 0– otherwise

a d e f

total

s(R.a=c)

Page 59: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Equi-Height Histograms: Range-Selection

• s(R.a<C) = ( sum(Bi.total) + freq_lt(Bk,C) ) * (1/|R|),

– for all 1 <= i < k with B(k-1).max < C <= Bk.max.

total

s(R.a<c)

a d e f

Page 60: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Equi-Height Histograms: Range-Selection

• s(R.a<C) = ( sum(Bi.total) + freq_lt(Bk,C) ) * (1/|R|),

– for all 1 <= i < k with B(k-1).max < C <= Bk.max.

total

s(R.a<c)

a d e f

Page 61: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Select with And and Or

Assume no correlation between attributes:

• s(θa and θc) = s(θa) * s(θc)

• s(θa or θc) = s(θa) + (1-s(θa)) * s(θc)

Note: must normalize θa , θc into non-overlapping conditions

Page 62: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Foreign-key Join Selectivity/Hitrate Estimation

Foreign-key constraint:R1 matches at most once with R2

“each order matches on average with 7 lineitems” hitrate = 7

But what if R’2 (e.g. order) is an intermediate result?

• R’2 may have multiple key occurrences due to a previous join

• R’2 may have less key occurrences (missing keys) due to a select (or join).

Simple Approach (practicum): Hitrate *= |R’2|/|R2|

Page 63: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Aggr(R,[g1..gn],[..])

Can only predict groupby columns and size:

• Expected result size =min(|R|,distinct(g1) * …. * distinct(gn))

Page 64: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Histogram Propagation• order: histogram stays identical• project: histogram stays identical

» Expression (e.g. l_tax*l_price) not required for the practicum» possible to use cartesian product on histograms, followed by expression

evaluation and re-bucketizing.

• topn: not required for the practicum» Use last bucket (and backwards) to take highest N distinct values and

their frequencies

• aggr: not required for the practicum» Groupbys: distinct is multiplication of distincts, freq=1» Aggregates: only possible for global aggregates (no groupbys)

• fk-join: multiply totals by join hitrate» distinct = min(distinct,total) this is a simplicifation!

• Select: multiply totals by selectivity» distinct = min(distinct,total)

• Select (selection attribute):» Get totals/distincts from subset of buckets

Page 65: Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization

Practicum Assignment 2

• Get the XML metadata description for TPC-H– ps/pfd histograms also available

• Take our solution for your second query (assignment 1)

• For each operator in the tree give:– Selectivity– Intermediate Result size– Short description how you computed this– Explanation how to compute histograms on all result columns

• Sum all intermediate result sizes into total query cost

• DEADLINE: march 31