CS 440 Database Management Systems Query Optimization 1.

36
CS 440 Database Management Systems Query Optimization 1

description

Many query plans to execute a SQL query 3 S U T R S R T U S R T U Even more plans: multiple algorithms to execute each operation S R T U Sort-merge hash join Table-scan index-scan Table-scan index-scan Compute the join of R(A,B) S(B,C) T(C,D) U(D,E)

Transcript of CS 440 Database Management Systems Query Optimization 1.

Page 1: CS 440 Database Management Systems Query Optimization 1.

1

CS 440 Database Management Systems

Query Optimization

Page 2: CS 440 Database Management Systems Query Optimization 1.

DBMS Architecture

Query Executor

Buffer Manager

Storage Manager

Storage

Transaction Manager

Logging & Recovery

Lock Manager

Buffers Lock Tables

Main Memory

User/Web Forms/Applications/DBAquery transaction

Query Optimizer

Query Rewriter

Query Parser

Files & Access Methods

Past lectures

Today’s lecture

Page 3: CS 440 Database Management Systems Query Optimization 1.

3

Many query plans to execute a SQL query

S UTRSR

T

U

SR

T

U

• Even more plans: multiple algorithms to execute each operation

SR

T

U

Sort-merge

Sort-merge

hash join

Table-scan

index-scan

Table-scanindex-scan

• Compute the join of R(A,B) S(B,C) T(C,D) U(D,E)

Page 4: CS 440 Database Management Systems Query Optimization 1.

4

Query optimization: picking the fastest plan• Optimal approach plan– enumerate each possible plan– measure its performance by running it– pick the fastest one– What’s wrong?

• Rule-based optimization– Use a set of pre-defined rules to generate a fast plan • e.g. If there is an index over a table, use it for scan and join.

Page 5: CS 440 Database Management Systems Query Optimization 1.

5

Definitions• Statistics on table R:– T(R): Number of tuples in R– B(R): Number of blocks in R• B(R) = T(R ) / block size

– V(R,A): Number of distinct values of attribute A in R

Page 6: CS 440 Database Management Systems Query Optimization 1.

6

Review: Clustered index• The relation is stored on the disk according to the

order of index.

10

30

50

70

90

110

10

20

30

40

50

60

DATAINDEX

70

80

Page 7: CS 440 Database Management Systems Query Optimization 1.

7

Plans to select tuples from R: sA=a(R)

• We have a clustered index on R• Plans: – (Clustered) indexed-based scan– Table-scan (sequential access)

• Statistics on R– B(R)=5000, T(R)=200,000– V(R,A) = 2, one value appears in 95% of tuples.

• Clustered indexed scan vs. table-scan ?

Page 8: CS 440 Database Management Systems Query Optimization 1.

8

Query optimization methods• Rule-based optimizer fails – It uses static rules– The rules do not consider the distribution of the data.

• Cost-based optimization– predict the cost of each plan– search the plan space to find the fastest one– do it efficiently• Optimization itself should be fast!

Page 9: CS 440 Database Management Systems Query Optimization 1.

9

Cost-based optimization• Plan space– which plans to consider?– it is time consuming to explore all alternatives.

• Cost estimator– how to estimate the cost of each plan without executing it?– we would like to have accurate estimation

• Search algorithm– how to search the plan space fast?– we would like to avoid checking inefficient plans

Page 10: CS 440 Database Management Systems Query Optimization 1.

10

Space of query plans• Selection– algorithms: sequential, index-based– ordering: why does it matter?

• Join– algorithms: nested loop, sort-merge, hash– ordering

• Ordering/ Grouping– can an “interesting order” be produced by join/

selection?– algorithms: sorting, hash-based

Page 11: CS 440 Database Management Systems Query Optimization 1.

Reducing plan space• Multiple logical query plan for each SQL query Star(name, birthdate), StarsIn(movie, name, year) SELECT movie FROM Stars, StarsIn WHERE Star.name = StarsIn.name AND year = 1950

11

Generally FasterStarsIn Star

StarsIn.name = Star.name

s year=1950

StarsIn

Star

StarsIn.name = Star.name

year=1950

moviemovie

Page 12: CS 440 Database Management Systems Query Optimization 1.

Reducing plan space• Push selection down to reduce # of rows• Push projection down to reduce # of columns SELECT movie, name FROM Stars, StarsIn WHERE Star.name = StarsIn.name

12

StarsIn Star

StarsIn.name = Star.name

movei, name

StarsIn Star

StarsIn.name = Star.name

movie, name

movie, name movie, name

Less effective than pushing down selection.

Page 13: CS 440 Database Management Systems Query Optimization 1.

13

• The algorithm requires exponential computation!• System-R style considers only left-deep joins

Reducing plan space

SR

T

U

SR

T

U

T USR

• Left-deep trees allow us to generate all fully pipelined plans– Intermediate results not written to temporary files.– Not all left-deep trees are fully pipelined (e.g., SM join).

Page 14: CS 440 Database Management Systems Query Optimization 1.

14

• System R-style avoids the plans with Cartesian products– The size of a Cartesian product is generally larger

than (natural) joins.• Example: R(A,B), S(B,C), U(C,D)

(R U) S has a Cartesian product⋈ ⋈ pick (R S) U instead⋈ ⋈

• If cannot avoid Cartesian products, delay them.

Reducing plan space

Page 15: CS 440 Database Management Systems Query Optimization 1.

15

• Relative accuracy– Goal is to compare plans, not to predict exact cost– More of an art than an exact science

• Each operator: input size, cost, output size– estimate cost based on input size

• Example: sort-merge join of R ⋈ S is 3 B(R) + 3 B(S)

– estimate output size (for next operator) or selectivity• selectivity: ratio of output to input

Cost estimation

Page 16: CS 440 Database Management Systems Query Optimization 1.

16

Cost estimation: Selinger Style • Input: stats on each table– T(R): Number of tuples in R– B(R): Number of blocks in R• B(R) = T(R ) / block size

– V(R,A): Number of distinct values of attribute A in R

• Assumptions on attribute and predicate independence• When no estimate available, use magic numbers.• New alternative approach– Histogram of database

Page 17: CS 440 Database Management Systems Query Optimization 1.

17

Selectivity factors: selection• Point selection: S = sA=a(R)

– T(S) ranges from 0 to T(R) – V(R,A) + 1– consider its mean: F = 1 / V (R,A)

• Range selection: S = sA<a(R)– F = (max(A) – a) / (max(A) – min(A))– not-athematic inequality: use magic number

• F = 1 / 3

• Range selection: S = s b <A<a(R)– F = (a - b) / (max(A) – min(A))– If not athematic, use magic number

• F = 1 / 4

Page 18: CS 440 Database Management Systems Query Optimization 1.

18

Selectivity factors: selection

• Range selection: column in (set of values) – F: union of point selections

Page 19: CS 440 Database Management Systems Query Optimization 1.

19

Selectivity factors: selection

• S = sA=1 AND B<10(R)– multiply 1/V(R,A) for equality and 1/3 for inequality– T(R) = 10,000, V(R,A) = 50– T(S) = 10000 / (50 * 3) = 66

• S = sA=1 OR B<10(R)– sum of estimates of predicates minus their product– T(R) = 10,000, V(R,A) = 50– T(S) = 200 + 3333 – 66 = 3467

Page 20: CS 440 Database Management Systems Query Optimization 1.

20

• Containment of values assumption V(S,A) <= V (R,A): A values in S is a subset of A values in R

• Let’s assume V (S,A) <= V (R,A)– Each tuple t in S joins x tuple(s) in R– consider its mean: x = T(R) / V (R,A)– T(R ⋈A S) = T (S) * T(R) / V(R,A)

T(R ⋈A S) = T(R) * T(S) / max(V(R,A), V(S,A))

Selectivity factors: join predicates

Page 21: CS 440 Database Management Systems Query Optimization 1.

21

Search the plan space• Baseline: exhaustive search– enumerate all combinations and compare their costs – enormous space!

T USR SRT

U

SRT

U

• Search method parameters– plan tree development

• construction: bottom-up, top-down• modification: improve a somehow-connected tree

– algorithms• heuristic selections: make choices based on heuristics• hill climbing: find “nearby” plans with lowest cost• Dynamic programming: construction by greedy selection

Page 22: CS 440 Database Management Systems Query Optimization 1.

22

Plan search: System-R style• A.K.A: Selinger style optimization• Bottom-up – start from the ground relation (in FROM) – work up the tree to form a plan– compute the cost of larger plans based on its sub-trees.

• Dynamic programming – greedily remove sub-trees that are costly (useless)

Page 23: CS 440 Database Management Systems Query Optimization 1.

23

• Step 1: For each {Ri}:– size({Ri}) = TCARD(Ri)– plan({Ri}) = Ri– cost({Ri}) = cost of access to Ri• e.g. TCARD(Ri) if no index on Ri

• Step 2: For each {Ri, Rj}:– size({Ri,Rj}) = estimate of the size of join– plan({Ri,Rj}) = join algorithm– cost = cost function of size of Ri and Rj

• #I/O access of the chosen join algorithm– plan({Ri,Rj}): the join algorithm with smallest cost

Dynamic programming

Page 24: CS 440 Database Management Systems Query Optimization 1.

24

• Step i: For each S ⊆ {R1, …, Rn} of cardinality i do:– Compute size(S) – for every S1 ,S2 s.t. S = S1 S2

c = cost(S1) + cost(S2) + cost(S1 S⋈ 2)– cost(S) = the smallest C– plan(S) = the plan for cost(S)

• Return Plan({R1, …, Rn})

Dynamic programming

Page 25: CS 440 Database Management Systems Query Optimization 1.

25

• Let’s assume that the cost of each join is the size of its intermediate results.– to simplify the example– other cost measures, #I/O access, are possible.

• cost(R) = 0 (no intermediate results)• cost(R ⋈ S) = 0 (no intermediate results)• cost( (R ⋈ S) ⋈ T)

= cost(R ⋈ S) + cost(T) + size( R ⋈ S ) = size(R ⋈ S)

Dynamic programming: example

Page 26: CS 440 Database Management Systems Query Optimization 1.

26

• Relations: R, S, T, U• Number of tuples: 2000, 5000, 3000, 1000• We use a toy size estimation method:– size (A B) = 0.01 * T(A) * T(B)⋈

Dynamic programming: example

Page 27: CS 440 Database Management Systems Query Optimization 1.

27

Query Size Cost Plan

RS

RT

RU

ST

SU

TU

RST

RSU

RTU

STU

RSTU

Page 28: CS 440 Database Management Systems Query Optimization 1.

28

Query Size Cost Plan

RS 100k 0 RS

RT 60k 0 RT

RU 20k 0 UR

ST 150k 0 TS

SU 50k 0 US

TU 30k 0 UT

RST

RSU

RTU

STU

RSTU

Page 29: CS 440 Database Management Systems Query Optimization 1.

29

Query Size Cost Plan

RS 100k 0 RS

RT 60k 0 RT

RU 20k 0 UR

ST 150k 0 TS

SU 50k 0 US

TU 30k 0 UT

RST 3M 60k S(RT)

RSU 1M 20k S(UR)

RTU 0.6M 20k T(UR)

STU 1.5M 30k S(UT)

RSTU

Page 30: CS 440 Database Management Systems Query Optimization 1.

30

Query Size Cost Plan

RS 100k 0 RS

RT 60k 0 RT

RU 20k 0 UR

ST 150k 0 TS

SU 50k 0 US

TU 30k 0 UT

RST 3M 60k S(RT)

RSU 1M 20k S(UR)

RTU 0.6M 20k T(UR)

STU 1.5M 30k S(UT)

RSTU 30M 110k (US)(RT)

Page 31: CS 440 Database Management Systems Query Optimization 1.

31

Plan search: all operations• Base relations access– find all plans for accessing each base relations– push down selections and projections– choose good plans, discard bad ones• keep the cheapest plan for unordered and each interesting

order • Join ordering– use the bottom-up dynamic programming– consider only left-deep join trees: n! ordering for n tables– postpone Cartesian product

• Finally: grouping/ ordering– use interesting order– addition sorting

Page 32: CS 440 Database Management Systems Query Optimization 1.

32

Nested subqueries• Subqueries are optimized separately • Correlation: order of evaluation– uncorrelated queries

• nested subqueries do not reference outer subqueries• evaluate the most deeply nested subquery first

– correlated queries: nested subqueries reference the outer subqueries

Select name From employee XWhere salary > (Select salary

From employee Where employee_num

= X.manager)

Page 33: CS 440 Database Management Systems Query Optimization 1.

33

Nested subqueries – cont.• correlated queries: nested subqueries reference the outer

subqueriesSelect name From employee XWhere salary > (Select salary

From employee Where employee_num

= X.manager)• The nested subquery is evaluated once for each tuple in the

outer query.• If there are small number of distinct values in the outer

relation, it is worth sorting the tuples. – reduces the #evaluation of the nested query.

Page 34: CS 440 Database Management Systems Query Optimization 1.

34

Summary: optimization• Plan space– Huge number of alternatives, semantically equivalent

• Why important – Difference between good/bad plabs could be order of

magnitude • Idea goal – map a declarative query to the most efficient plan

• Conventional wisdom: at least avoid bad plans

Page 35: CS 440 Database Management Systems Query Optimization 1.

35

State of the art• Academic: always a core database research topic – Optimizing for interactive querying– Optimizing for novel parallel frameworks

• Industry: most optimizers use System-R style– They started with rule-based.• Oracle 7 and its prior versions used rule-based• Oracle 7 – 10: rule based and cost based• Oracle 10g (2003): cost-based

Page 36: CS 440 Database Management Systems Query Optimization 1.

36

• The importance of query optimization– difference between fast and slow plans

• Query optimization problem– find the fast plans efficiently.

• The components of a cost-based (system R style) query optimizer:– plan space definition– cost estimation– search algorithm

What you should know