CS 542 -- Query Execution

66

Click here to load reader

Transcript of CS 542 -- Query Execution

Page 1: CS 542 -- Query Execution

CS 542 Database Management Systems

Query Execution

J Singh

March 21, 2011

Page 2: CS 542 -- Query Execution

2© J Singh, 2011 2

This meeting

• Data Models for NoSQL Databases

• Preliminaries

– What are we shooting for?

• Reference Material for Benchmarks posted in blog

– Some slides from TPC-C SIGMOD „97 Presentation

• Query Execution

– Sort: Chapter 15

– Join: Sections 16.1 – 16.4

Page 3: CS 542 -- Query Execution

3© J Singh, 2011 3

Data Models for NoSQL Databases

• Class Discussion at Next Meeting.

– How would you represent many-to-many relationships? Also many-to-one and one-to-one.

• Cassandra. Brian Card

• MongoDB. Annies Ductan

• Redis. Jonathan Glumac

• Google App Engine. Sahel Mastoureshgh

• Amazon SimpleDB. Zahid Mian

• CouchDB. Robert Van Reenen

– 3-minute presentation (on 3/21) for 20 bonus points

Page 4: CS 542 -- Query Execution

4© J Singh, 2011 4

What are we shooting for?

• Good benchmarks

– Define the playing field

– Set the performance agenda

• Measure release-to-release progress

• Set goals (e.g., 10,000 tpmC, < 50 $/tpmC)

• Something managers can understand (!)

• Benchmark abuse

– Benchmarketing

– Benchmark wars

• more $ on ads than development

– To keep abuses to a minimum, Benchmarks are defined with precision and read like they are legal documents (example).

– Some companies include specific prohibitions against publishing benchmark results in their license agreements

Page 5: CS 542 -- Query Execution

5© J Singh, 2011 5

Benchmarks have a Lifetime

• Good benchmarks drive industry and technology forward.

• At some point, all reasonable advances have been made.

• Benchmarks can become counter productive by encouraging artificial optimizations.

• So, even good benchmarks become obsolete over time.

Page 6: CS 542 -- Query Execution

6© J Singh, 2011 6

Database Benchmarks

• Relational Database (OLTP) Benchmarks

– TPC = Transaction Processing Performance Council

• De facto industry standards body for OLTP performance

• Most TPC specs, info, results are on the web page: http://www.tpc.org

• TPC-C has been the workhorse of the industry, more in a minute

• TPC-E is more comprehensive

• Different problem spaces require different benchmarks

– Other benchmarks for analytics / decision support systems

– Two papers referenced on the course website on NoSQL / MapReduce

– Benchmarks define the problem set, not the technology

• E.g., if managing documents, create and use a document management benchmark, not one that was created to show off the capabilities of your DB.

Page 7: CS 542 -- Query Execution

7© J Singh, 2011 7

TPC-C‟s Five Transactions

• Workload Definition

– Transactions operate against a database of nine tables

– Transactions:

• New-order: enter a new order from a customer

• Payment: update customer balance to reflect a payment

• Delivery: deliver orders (done as a batch transaction)

• Order-status: retrieve status of customer‟s most recent order

• Stock-level: monitor warehouse inventory

– Specifies size of each table

– Specifies # of users and workflow (next slide)

– Specifies configuration requirements

• must be ACID, failure tolerant, distributed, …

• Response time requirement:

– 90% of each type of transaction must have a response time <= 5 seconds, except stock-level which is <= 20 seconds.

Result:

– How many TPC-C transactions can be supported?

– What is the $/tpm cost

Page 8: CS 542 -- Query Execution

8© J Singh, 2011 8

2

TPC-C Workflow

1

Select txn from menu:1. New-Order 45%

2. Payment 43%

3. Order-Status 4%

4. Delivery 4%

5. Stock-Level 4%

Input screen

Output screen

Measure menu Response Time

Measure txn Response Time

Keying time

Think time

3

Go back to 1

Cycle Time Decomposition(typical values, in seconds,

for weighted average txn)

Menu = 0.3

Keying = 9.6

Txn RT = 2.1

Think = 11.4

Average cycle time = 23.4

Page 9: CS 542 -- Query Execution

9© J Singh, 2011 9

TPC-C Results by DBMS

-

50

100

150

200

250

300

350

400

- 5,000 10,000 15,000 20,000 25,000 30,000

Throughput (tpmC)

Pri

ce

/Pe

rfo

rma

nc

e (

$/t

pm

C) Informix

Microsoft

Oracle

Sybase

TPC-C Results (by DBMS, as of 5/9/97)

• Stating the obvious…

– These results are not a comparison of databases

– They are a comparison of databases for the specific problem specified by the TPC-C benchmark

– Ensuring a level playing field is essential when defining a benchmark and conducting measurements

• Witness the Pavlo/Dean debate

Page 10: CS 542 -- Query Execution

10© J Singh, 2011 10

Benchmarks for Other Databases

• Class Discussion at Next Meeting.

– What benchmarks are appropriate for

• Key-value stores?

• Document databases?

• Network databases?

• Geospatial databases?

• Genomic databases?

• Time series databases?

• Other?

– General discussion, no bonus points

• Please let me know if I may call on you, and for which?

Page 11: CS 542 -- Query Execution

11© J Singh, 2011 11

Overview of Query Execution

parse

convert

apply laws

estimate result sizes

consider physical plans estimate costs

pick best

execute

{P1,P2,…..}

{(P1,C1),(P2,C2)...}

Pi

answer

SQL query

parse tree

logical query plan

“improved” l.q.p

l.q.p. +sizes

statistics

Page 12: CS 542 -- Query Execution

12© J Singh, 2011 12

An example to work with

• But first we must revisit Relational Algebra…

• Database:

– City, Country, CountryLanguage database.

• Example query: All cities in Finland with a population at least double of Aruba

SELECT [xyz]

FROM City, Country

WHERE

City.CountryCode = 'fin' AND

Country.Code = 'abw' AND

City.population > 2*Country.population;

Page 13: CS 542 -- Query Execution

13© J Singh, 2011 13

Relational Operators

• Selection Basics

– Idempotent

– Commutative

• Selection Conjunctions

– Useful when pruning

• Selection Disjunctions

– Equivalent to UNIONS

Page 14: CS 542 -- Query Execution

14© J Singh, 2011 14

Selection and Cross Product

• When Selection is followed by a Cross Product,

– for A(R S),

– Break A into three conditions such that A = r ⋀ s ⋀ rs where

• r only has the set of attributes only in R

• s only has the set of attributes only in S

• rs, has the set of attributes in both R and S

– Then, the following holds:

• A(R S) = r ⋀ s ⋀ rs(R S) = rs( r (R) s (S))

• In case you forgot…

– R ⋈A S = A(R S)

– This result helps us compute Theta-joins!

• Review Chapter 2 of the textbook for more; back to the example…

Page 15: CS 542 -- Query Execution

15© J Singh, 2011 15

An example to work with

• Database:

– City, Country, CountryLanguage database.

• Example query: All cities in Finland with a population at least double of Aruba

SELECT [xyz] FROM City, Country

WHERE

City.CountryCode = 'fin' AND

Country.Code = 'abw' AND

City.population > 2*Country.population;

• Algebra Representation

– xyz ( (T.cc = 'fin' ⋀ Y.cc = 'abw' ⋀ T.pop > 2*Y.pop) (T Y)), or

– continued…

Page 16: CS 542 -- Query Execution

16© J Singh, 2011 16

Example: Algebra Manipulation

• Algebra Representation

– xyz ( (T.cc = 'fin' ⋀ Y.cc = 'abw' ⋀ T.pop > 2*Y.pop) (T Y)), or

– xyz ( ( T.pop > 2*Y.pop) ( (T.cc = 'fin' ) (T) (Y.cc = 'abw' ) (Y) )

• Graphical Representation of Plan

Page 17: CS 542 -- Query Execution

17© J Singh, 2011 17

Visualizing Plan Execution

• The plan is a set of „operators‟

– The operators operate in parallel

• On different machines? On different processors? In different processes? In different threads? Yes, depends on the architecture.

– Each operator feeds its input to the next operator

• The “parallel operators” visualization allows for pipelining

– The output of one operator is the input to the next

– A operator can block if its inputs are not ready

– Design goal is for the operators to pipeline (if possible)

• Would like to start operating with partial data

– Takes advantage of as much parallelism as the problem allows

Page 18: CS 542 -- Query Execution

18© J Singh, 2011 18

Common Elements

• Key metrics of each component:

– How much RAM does it consume?

– How much Disk I/O does it require?

• Each component is implemented as an Iterator

– Base class for each operator. Three methods:

• Open(). May block if

– Input is not ready

– Unable to proceed till all data has been received

• GetNext(). Returns the next tuple.

– May block if the next tuple is not ready

– Returns NotFound when exhausted

• Close()

– Performs any cleanup and terminates

Page 19: CS 542 -- Query Execution

19© J Singh, 2011 19

Example: Table-scan operator

Open():

pass

GetNext():

for b in blocks:

for t in tuples of b:

if valid t: return t

return NotFound

Close():

pass

• Key Metrics:

– RAM: 1 block

– Disk I/O: Number of blocks

• Notes:

– Represents the operations T(=City) and Y(=Country)

– Used only if appropriate indexes don‟t exist

– Can use prefetching

• Not shown here

Page 20: CS 542 -- Query Execution

20© J Singh, 2011 20

Summary so far

• Benchmarks are critical for defining performance goals of the database

– TPC-C is a widely-used benchmark,

– TPC-E is broader in scope but less widespread

– Need to choose benchmarks to fit the problem at hand

• A query can be parsed into primitives for execution

– Parallelism & pipelining are essential for performance

Page 21: CS 542 -- Query Execution

CS-542 Database Management Systems

Query Execution Algorithms

Page 22: CS 542 -- Query Execution

22© J Singh, 2011 22

One-pass Algorithms

• Lend themselves nicely to pipelining (with minimum blocking)

• Good for

– Table-scans (as seen)

– Tuple-at-a-time operations (selection and projection)

– Full-relation binary operations (∪, ∩, -, ⋈, ) as long as one of

the operands can fit in memory

– Considering JOIN next, read others from book

Page 23: CS 542 -- Query Execution

23© J Singh, 2011 23

Open():

read S into memory

GetNext():

for b in blocks of R:

for t in tuples of b:

if t matches tuple s:

return join (t,s)

return NotFound

Close():

pass

Example: JOIN (R,S)

• Key Metrics:

– RAM: Blocks(S) + 1 block

– Disk I/O: Blocks(R) + Blocks(S)

• Notes:

– Can use prefetching for R

• Not shown here

Page 24: CS 542 -- Query Execution

24© J Singh, 2011 24

Nested-Loop Joins

• What if all of S won‟t fit into memory? We can do it chunk-by-chunk, a „chunk‟ is as many blocks of S that will fit

• Algorithm sketch:

(I/O operations shown in bold)

GetNext():

for c in chunks of S:

for b in blocks of R:

for t in tuples of b:

for s in tuples of c:

return join(t,s)

return NotFound

• Key Metrics

– RAM: M

– Disk I/O: Blocks(S)

+ k * Blocks(R)

where k = (size(S)/#chunks)

• Note how quickly performance deteriorates!

• We can do better

Page 25: CS 542 -- Query Execution

25© J Singh, 2011 25

Two-pass algorithms

• Sort-based two-pass algorithms

– The first pass does a sort on some parameter(s) of each operand

– The second pass algorithm relies on the sort results and can be pipelined

• Hash-based two-pass algorithms

• Do a prep-pass and write the result back to disk

• Compute the result in the second pass

Page 26: CS 542 -- Query Execution

26© J Singh, 2011 26

Two-pass idea: sort example

• Key Metrics

– For the first pass:

• RAM: M

• Disk I/O: 2 * Blocks(R)

– For the 2nd pass:

• RAM: C

• Disk I/O: Blocks(R)

1. For each of C chunks of M blocks, sort each chunk and write it back

– In the example, we have 4 chunks, each 6 blocks

2. Merge the result

ductan

lam

lo

molignano

rubright

zeljkovic

angulo

ficarra

glumac

jacek

mian

stolpestad

card

dean

mastoureshgh

qi

yee

zheng

ai

bahanshal

bhandare

nemane

tran

van reenen

molignano

lo

lam

ductan

zeljkovic

rubright

jacek

stolpestad

mian

angulo

ficarra

glumac

card

qi

yee

dean

zheng

mastoureshgh

bahanshal

van reenen

nemane

bhandare

ai

tran

ai

angulo

bahanshal

bhandare

card

dean

ductan

ficarra

glumac

jacek

lam

lo

mastoureshgh

mian

molignano

nemane

qi

rubright

stolpestad

tran

van reenen

yee

zeljkovic

zheng

Page 27: CS 542 -- Query Execution

27© J Singh, 2011 27

Naïve two-pass JOIN

1. Sort R and S on the common attributes of the JOIN

2. Merge the sorted R and S on the common attributes

– See section 15.4.9 of book for more details

• Also known as Sort-Join

• Key Metrics

– Sort

• RAM: M

• Disk I/O:

4 * (Blocks(R) + Blocks(S))

• 4, not 3 because we wrote the sort results back

– Join

• RAM: 2

• Disk I/O:

(Blocks(R) + Blocks(S))

– Total Operation

• RAM: M

• Disk I/O:

5 * (Blocks(R) + Blocks(S))

Page 28: CS 542 -- Query Execution

28© J Singh, 2011 28

Efficient two-pass JOIN

• Key Metrics

– Sort (only pass 1)

• RAM: M

• Disk I/O:

2 * (Blocks(R) + Blocks(S))

– Join

• RAM: 2

• Disk I/O: None additional

(Blocks(R) + Blocks(S))

– Total Operation

• RAM: M

• Disk I/O:

3 * (Blocks(R) + Blocks(S))

• Main idea:

– Combine pass 2 of the sort with join

Page 29: CS 542 -- Query Execution

29© J Singh, 2011 29

Hash Join

• Main Idea:

– Pass 1: Divide tuples in R and S into m hash buckets

• Read a block of R (or S)

• For each tuple in that block, find its hash i and move it to hash bucket i.

– Keep one block for each hash bucket in memory

– Write it out to disk when full

– Pass 2: For each i

• Read buckets Ri and Si and do their join.

• Key Metrics

– RAM: M

– Disk I/O:

3 * (Blocks(R) + Blocks(S))

– Disk I/O can be less if:

• Hash the bigger relation first

• Expect that many of the buckets will still be in memory

Page 30: CS 542 -- Query Execution

30© J Singh, 2011 30

Index-based Algorithms

• Refresher course on indexes and clustering

Concept Description

Clustered Relation Tuples are packed tightly in every block

Clustering Index Index used to pack the clustered relation- How many Clustering Indexes for a Clustered Relation?- How many Indexes for a Clustered Relation?

• The basic idea:

– Use the index to locate records and thus cut down on I/O

Page 31: CS 542 -- Query Execution

31© J Singh, 2011 31

Index-based Selection

• If the relation T has a clustering index on cc,

– All tuples will be contiguous

– Disk I/O: Blocks(T)/V(T, 'fin')

• Where V(T,cc) is the number of tuples with cc = 'fin„

• Sort of…

• If the relation T does not have a clustering index on cc,

– Tuples could be scattered

– Disk I/O: Tuples(T)/V(T, 'fin')

– Big difference!

• Consider the selection

– (T.cc = 'fin' ) (T)

Page 32: CS 542 -- Query Execution

32© J Singh, 2011 32

Index-based JOIN

• If, say, R has an index on Y,

– Same as a two-pass JOIN except that we don‟t have to first sort/hash on R

– If clustering index, Disk I/O,

Blocks(R)/V(R,Y) + 3 * Blocks(S)

– Otherwise,

Tuples(R)/V(R,Y) + 3 * Blocks(S)

• If both R and S are indexed,

– Disk I/O is reduced even further

• Consider the JOIN

– R(X,Y) ⋈ S(Y,Z), where Y is the common set of attributes of R and S

Page 33: CS 542 -- Query Execution

33© J Singh, 2011 33

Summary

• Execution primitives for pipelining

– One-pass algorithms should be used wherever possible

– Two-pass algorithms can usually be used no matter how big the problem

– Indexes help and should be taken advantage of where possible

Page 34: CS 542 -- Query Execution

34© J Singh, 2011 34

Query Optimization

parse

convert

apply laws

estimate result sizes

consider physical plans estimate costs

pick best

execute

{P1,P2,…..}

{(P1,C1),(P2,C2)...}

Pi

answer

SQL query

parse tree

logical query plan

“improved” l.q.p

l.q.p. +sizes

statistics

Based on slides from Prof. Garcia-Molina

Page 35: CS 542 -- Query Execution

35© J Singh, 2011 35

Desired Endpoint

• x=1 AND y=2 AND z<5 (R) • R ⋈ S ⋈ U

Example Physical Query Plans

two-passhash-join101 buffers

two-passhash-join101 buffers

TableScan(U)

TableScan(R) TableScan(S)

materialize

Filter(x=1 AND z<5)

IndexScan(R,y=2)

Page 36: CS 542 -- Query Execution

36© J Singh, 2011 36

Outline

• Convert SQL query to a parse tree

– Semantic checking: attributes, relation names, types

• Convert to a logical query plan (relational algebra expression)

– deal with subqueries

• Improve the logical query plan

– use algebraic transformations

– group together certain operators

– evaluate logical plan based on estimated size of relations

• Convert to a physical query plan

– search the space of physical plans

– choose order of operations

– complete the physical query plan

Page 37: CS 542 -- Query Execution

37© J Singh, 2011 37

Improving the Logical Query Plan

• There are numerous algebraic laws concerning relational algebra operations

• By applying them to a logical query plan judiciously, we can get an equivalent query plan that can be executed more efficiently

• Next we'll survey some of these laws

Page 38: CS 542 -- Query Execution

38© J Singh, 2011 38

Relational Operators (revisited)

• Selection Basics

– Idempotent

– Commutative

• Selection Conjunctions

– Useful when pruning

• Selection Disjunctions

– Equivalent to UNIONS

Page 39: CS 542 -- Query Execution

39© J Singh, 2011 39

Laws Involving Selection

• Selections usually reduce the size of the relation

• Usually good to do selections early,

– i.e., "push them down the tree"

• Also can be helpful to break up a complex selection into parts

Page 40: CS 542 -- Query Execution

40© J Singh, 2011 40

Selection and Binary Operators

• Must push selection to both arguments:– C (R U S) = C (R) U C (S)

• Must push to first arg, optional for 2nd:

– C (R - S) = C (R) - S

– C (R - S) = C (R) - C (S)

• Push to at least one arg with all attributes mentioned in C:– product, natural join, theta join, intersection

– e.g., C (R X S) = C (R) X S, if R has all the attributes in C

Page 41: CS 542 -- Query Execution

41© J Singh, 2011 41

Pushing Selection Up the Tree

• Suppose we have relations– StarsIn(title,year,starName)

– Movie(title,year,len,inColor,studioName)

• and a view– CREATE VIEW MoviesOf1996 AS

SELECT *

FROM Movie

WHERE year = 1996;

• and the query– SELECT starName, studioName

FROM MoviesOf1996 NATURAL JOIN StarsIn;

Page 42: CS 542 -- Query Execution

42© J Singh, 2011 42

The Straightforward Tree

starName,studioName

year=1996 StarsIn

MovieRemember the rule

C(R ⋈ S) = C(R) ⋈ S ?

Page 43: CS 542 -- Query Execution

43© J Singh, 2011 43

The Improved Logical Query Plan

starName,studioName

year=1996 StarsIn

Movie

starName,studioName

year=1996

Movie StarsIn

starName,studioName

year=1996 year=1996

Movie StarsIn

push selection

up treepush selection

down tree

Page 44: CS 542 -- Query Execution

44© J Singh, 2011 44

Laws Involving Projections

• Adding a projection lower in the tree can improve performance, since often tuple size is reduced

– Usually not as helpful as pushing selections down

• Consult textbook for details, will not be on the exam

Page 45: CS 542 -- Query Execution

45© J Singh, 2011 45

Joins and Products

• Recall from the definitions of relational algebra:

– R ⋈C S = C (R X S) (theta join)

where C equates same-name attributes in R and S

• To improve a logical query plan, replace a product followed by a selection with a join

– Join algorithms are usually faster than doing product followed by selection

Page 46: CS 542 -- Query Execution

46© J Singh, 2011 46

Summary of LQP Improvements

• Selections:– push down tree as far as possible

– if condition is an AND, split and push separately

– sometimes need to push up before pushing down

• Projections:– can be pushed down (sometimes, read book)

• Selection/product combinations:– can sometimes be replaced with join

Page 47: CS 542 -- Query Execution

47© J Singh, 2011 47

Outline

• Convert SQL query to a parse tree

– Semantic checking: attributes, relation names, types

• Convert to a logical query plan (relational algebra expression)

– deal with subqueries

• Improve the logical query plan

– use algebraic transformations

– group together certain operators

– evaluate logical plan based on estimated size of relations

• Convert to a physical query plan

– search the space of physical plans

– choose order of operations

– complete the physical query plan

Page 48: CS 542 -- Query Execution

48© J Singh, 2011 48

Grouping Assoc/Comm Operators

• Group together adjacent joins, adjacent unions, and adjacent intersections as siblings in the tree

• Sets up the logical QP for future optimization when physical QP is constructed: determine best order for doing a sequence of joins (or unions or intersections)

U D E FU

UA

B C

D E F

A B C

Page 49: CS 542 -- Query Execution

49© J Singh, 2011 49

Evaluating Logical Query Plans

• The transformations discussed so far intuitively seem like good ideas

• But how can we evaluate them more scientifically?

• Estimate size of relations, also helpful in evaluating physical query plans

• Coming up next…

Page 50: CS 542 -- Query Execution

CS-542 Database Management Systems

Plan Estimation, based on slides from Prof. Garcia-Molina

Page 51: CS 542 -- Query Execution

51© J Singh, 2011 51

Estimating Sizes of Relations

• Used in two places:

– to help decide between competing logical query plans

– to help decide between competing physical query plans

• Notation review:

– T(R): number of tuples in relation R

– B(R): minimum number of blocks needed to store R

• So far, we‟ve spelled it out Blocks(R)

– V(R,a): number of distinct values in R of attribute a

Page 52: CS 542 -- Query Execution

52© J Singh, 2011 52

Requirements for Estimation Rules

1. Give accurate estimates

2. Are easy (fast) to compute

3. Are logically consistent: estimated size should not depend on how the relation is computed

Here describe some simple heuristics.

All we really need is a scheme that properly ranks competing plans.

Page 53: CS 542 -- Query Execution

53© J Singh, 2011 53

Estimating Size of Selection (p1)

• Suppose selection condition is A = c, where A is an attribute and c is a constant.

• A reasonable estimate of the number of tuples in the result is:– T(R)/V(R,A), i.e., original number of tuples divided by number of

different values of A

• Good approximation if values of A are evenly distributed

• Also good approximation in some other, common, situations (see textbook)

Page 54: CS 542 -- Query Execution

54© J Singh, 2011 54

Estimating Size of Selection (p2)

• If condition is A < c:

– a good estimate is T(R)/3; intuition is that usually you ask about something that is true of less than half the tuples

• If condition is A ≠ c:

– a good estimate is T(R )

• If condition is the AND of several equalities and inequalities, estimate in series.

Page 55: CS 542 -- Query Execution

55© J Singh, 2011 55

Example

• Consider relation R(a,b,c) with 10,000 tuples and 50 different values for attribute a.

• Consider selecting all tuples from R with a = 10 and b < 20.

• Estimate of number of resulting tuples is

– 10,000*(1/50)*(1/3) = 67.

Page 56: CS 542 -- Query Execution

56© J Singh, 2011 56

Estimating Size of Selection (p3)

If condition has the form C1 OR C2, use:1. sum of estimate for C1 and estimate for C2,

2. unless that sum is > T(R) and the previous , or

3. assuming C1 and C2 are independent,

T(R)*(1 (1 f1)*(1 f2)),

where f1 is fraction of R satisfying C1 and

f2 is fraction of R satisfying C2

Page 57: CS 542 -- Query Execution

57© J Singh, 2011 57

Example

• Consider relation R(a,b) – 10,000 tuples and 50 different values for a.

• Consider selecting all tuples from R with a = 10 or b < 20.

• Estimate– Estimate for a = 10 is 10,000/50 = 200

– Estimate for b < 20 is 10,000/3 = 3333

– Estimate for combined condition is• 200 + 3333 = 3533 or

• 10,000*(1 (1 1/50)*(1 1/3)) = 3466

• Different, but not really

Page 58: CS 542 -- Query Execution

58© J Singh, 2011 58

Estimating Size of Natural Join

• Assume join is on a single attribute Y.

• Some possibilities:

1. R and S have disjoint sets of Y values, so size of join is 0

2. Y is the key of S and a foreign key of R, so size of join is T(R)

3. All the tuples of R and S have the same Y value, so size of join is T(R)*T(S)

• We need some assumptions…

Page 59: CS 542 -- Query Execution

59© J Singh, 2011 59

Join Estimation Rule

• Expected number of tuples in result is

– T(R)*T(S) / max(V(R,Y),V(S,Y))

• Why? Suppose V(R,Y) ≤ V(S,Y).

– There are T(R) tuples in R.

– Each of them has a 1/V(S,Y) chance of joining with a given tupleof S, creating T(S)/V(S,Y) new tuples

Page 60: CS 542 -- Query Execution

60© J Singh, 2011 60

Example

• Suppose we have– R(a,b) with T(R) = 1000 and V(R,b) = 20

– S(b,c) with T(S) = 2000, V(S,b) = 50, and V(S,c) = 100

– U(c,d) with T(U) = 5000 and V(U,c) = 500

• What is the estimated size of R ⋈ S ⋈ U?

– First join R and S (on attribute b): • estimated size of result, X, is T(R)*T(S)/max(V(R,b),V(S,b)) = 40,000

• number of values of c in X is the same as in S, namely 100

– Then join X with U (on attribute c): • estimated size of result is T(X)*T(U)/max(V(X,c),V(U,c)) = 400,000

Page 61: CS 542 -- Query Execution

61© J Singh, 2011 61

Summary of Estimation Rules

• Projection: exactly computable

• Product: exactly computable

• Selection: reasonable heuristics

• Join: reasonable heuristics

• The other operators are harder to estimate…

Page 62: CS 542 -- Query Execution

62© J Singh, 2011 62

Estimating Size Parameters

• Estimating the size of a relation depended on knowing T(R) and V(R,a)'s

• Estimating cost of a physical algorithm depends on also knowing B(R).

• How can the query compiler learn them?– Scan relation to learn T, V's, and then calculate B

– Can also keep a histogram of the values of attributes. Makes estimating join results more accurate

– Recomputed periodically, after some time or some number of updates, or if DB administrator thinks optimizer isn't choosing good plans

Page 63: CS 542 -- Query Execution

63© J Singh, 2011 63

Heuristics to Reduce Cost of LQP

• For each transformation of the tree being considered, estimate the "cost" before and after doing the transformation

• At this point, "cost" only refers to sizes of intermediate relations (we don't yet know about number of disk I/O's)

• Sum of sizes of all intermediate relations is the heuristic: if this sum is smaller after the transformation, then incorporate it

Page 64: CS 542 -- Query Execution

64© J Singh, 2011 64

Why couldn‟t we…

• A few questions to explore

– NoSQL has also been described as NoJOIN

• Could we use the techniques discussed here to implement JOINs on a NoSQL database?

– Could we implement the parallel operators as MapReduce jobs?

– Suitable topics in case you have not yet chosen a project

Page 65: CS 542 -- Query Execution

65© J Singh, 2011 65

Update on Projects

• Consider including benchmark results in your presentation

• There is no need to submit your code

– Key fragments can be included in your report, as seen in numerous papers

– Do include design of the code in your report

– Do not submit code. It will not be evaluated

• Pace yourself

– Plan to finish up your project coding in 2 weeks (by 4/4)

– Plan to write and perfect your report and PPT after that

– Budget your presentation time carefully.

• How is it going?

Page 66: CS 542 -- Query Execution

66© J Singh, 2011 66

Next week

• Query Optimization

• Suggested topic?

– We have half-a-lecture open to cover any topics of interest to everyone