CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

35
CSE 444: Lecture 24 Query Execution Monday, March 7, 2005

Transcript of CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Page 1: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

CSE 444: Lecture 24Query Execution

Monday, March 7, 2005

Page 2: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Outline

• External Sorting

• Sort-based algorithms

• An example

Page 3: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

The I/O Model of Computation

• In main memory: CPU time– Big O notation !

• In databases time is dominated by I/O cost– Big O too, but for I/O’s– Often big O becomes a constant

• Consequence: need to redesign certain algorithms

• See sorting next

Page 4: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Sorting

• Problem: sort 1 GB of data with 1MB of RAM.

• Where we need this: – Data requested in sorted order (ORDER BY)– Needed for grouping operations– First step in sort-merge join algorithm– Duplicate removal– Bulk loading of B+-tree indexes.

Page 5: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

2-Way Merge-sort:Requires 3 Buffers in RAM

• Pass 1: Read a page, sort it, write it.

• Pass 2, 3, …, etc.: merge two runs, write them

Main memory buffers

INPUT 1

INPUT 2

OUTPUT

DiskDisk

Runs of length LRuns of length 2L

Page 6: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Two-Way External Merge Sort• Assume block size is B = 4Kb

• Step 1 runs of length L = 4Kb• Step 2 runs of length L = 8Kb• Step 3 runs of length L = 16Kb• . . . . . .• Step 9 runs of length L = 1MB• . . .• Step 19 runs of length L = 1GB (why ?)

Need 19 iterations over the disk data to sort 1GB

Page 7: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Can We Do Better ?

• Hint:

We have 1MB of main memory, but only used 12KB

Page 8: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Cost Model for Our Analysis

• B: Block size ( = 4KB)

• M: Size of main memory ( = 1MB)

• N: Number of records in the file

• R: Size of one record

Page 9: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

External Merge-Sort

• Phase one: load M bytes in memory, sort– Result: runs of length M bytes ( 1MB )

M bytes of main memory

DiskDisk

. .

.. . .

M/R records

Page 10: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Phase Two

• Merge M/B – 1 runs into a new run (250 runs )

• Result: runs of length M (M/B – 1) bytes (250MB)

M bytes of main memory

DiskDisk

. .

.. . .

Input M/B

Input 1

Input 2. . . .

Output

Page 11: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Phase Three

• Merge M/B – 1 runs into a new run

• Result: runs of length M (M/B – 1)2 records (625GB)

M bytes of main memory

DiskDisk

. .

.. . .

Input M/B

Input 1

Input 2. . . .

Output

Page 12: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Cost of External Merge Sort

• Number of passes:

• How much data can we sort with 10MB RAM?– 1 pass 10MB data– 2 passes 25GB data (M/B = 2500)

• Can sort everything in 2 or 3 passes !

⎡ ⎤⎡ ⎤Size/Mlog1 1M/B−+

Page 13: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

External Merge Sort

• The xsort tool in the XML toolkit sorts using this algorithm

• Can sort 1GB of XML data in about 8 minutes

Page 14: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Two-Pass Algorithms Based on Sorting

• Assumption: multi-way merge sort needs only two passes

• Assumption: B(R) <= M2

• Cost for sorting: 3B(R)

Page 15: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Two-Pass Algorithms Based on Sorting

Duplicate elimination (R)

• Trivial idea: sort first, then eliminate duplicates

• Step 1: sort chunks of size M, write– cost 2B(R)

• Step 2: merge M-1 runs, but include each tuple only once– cost B(R)

• Total cost: 3B(R), Assumption: B(R) <= M2

Page 16: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Two-Pass Algorithms Based on Sorting

Grouping: a, sum(b) (R)

• Same as before: sort, then compute the sum(b) for each group of a’s

• Total cost: 3B(R)

• Assumption: B(R) <= M2

Page 17: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Two-Pass Algorithms Based on Sorting

R ∪ Sx = first(R)y = first(S)

While (_______________) do{ case x < y: output(x) x = next(R) case x=y:

case x > y;}

x = first(R)y = first(S)

While (_______________) do{ case x < y: output(x) x = next(R) case x=y:

case x > y;}

Completethe programin class:

Page 18: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Two-Pass Algorithms Based on Sorting

R ∩ Sx = first(R)y = first(S)

While (_______________) do{ case x < y:

case x=y:

case x > y;}

x = first(R)y = first(S)

While (_______________) do{ case x < y:

case x=y:

case x > y;}

Completethe programin class:

Page 19: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Two-Pass Algorithms Based on Sorting

R - S

Completethe programin class:

x = first(R)y = first(S)

While (_______________) do{ case x < y:

case x=y:

case x > y;

}

x = first(R)y = first(S)

While (_______________) do{ case x < y:

case x=y:

case x > y;

}

Page 20: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Two-Pass Algorithms Based on Sorting

Binary operations: R ∪ S, R ∩ S, R – S• Idea: sort R, sort S, then do the right thing• A closer look:

– Step 1: split R into runs of size M, then split S into runs of size M. Cost: 2B(R) + 2B(S)

– Step 2: merge M/2 runs from R; merge M/2 runs from S; ouput a tuple on a case by cases basis

• Total cost: 3B(R)+3B(S)• Assumption: B(R)+B(S)<= M2

Page 21: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Two-Pass Algorithms Based on Sorting

R |x|R.A =S.B S

x = first(R)y = first(S)

While (_______________) do{ case x.A < y.B:

case x.A=y.B:

case x.A > y.B;

}

x = first(R)y = first(S)

While (_______________) do{ case x.A < y.B:

case x.A=y.B:

case x.A > y.B;

}

Completethe programin class:

R(A,C) sorted on AS(B,D) sorted on B

Page 22: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Two-Pass Algorithms Based on Sorting

Join R |x| S• Start by sorting both R and S on the join attribute:

– Cost: 4B(R)+4B(S) (because need to write to disk)

• Read both relations in sorted order, match tuples– Cost: B(R)+B(S)

• Difficulty: many tuples in R may match many in S– If at least one set of tuples fits in M, we are OK– Otherwise need nested loop, higher cost

• Total cost: 5B(R)+5B(S)• Assumption: B(R) <= M2, B(S) <= M2

Page 23: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Two-Pass Algorithms Based on Sorting

Join R |x| S

• If the number of tuples in R matching those in S is small (or vice versa) we can compute the join during the merge phase

• Total cost: 3B(R)+3B(S)

• Assumption: B(R) + B(S) <= M2

Page 24: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Summary of External Join Algorithms

• Block Nested Loop Join:B(S) + B(R)*B(S)/M

• Partitioned Hash Join:3B(R)+3B(S)Assuming min(B(R),B(S)) <= M2

• Merge Join3B(R)+3B(S)Assuming B(R)+B(S) <= M2

• Index JoinB(R) + T(R)B(S)/V(S,a)Assuming…

Page 25: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Example

Product(pname, maker), Company(cname, city)

• How do we execute this query ?

Select Product.pnameFrom Product, CompanyWhere Product.maker=Company.cname and Company.city = “Seattle”

Select Product.pnameFrom Product, CompanyWhere Product.maker=Company.cname and Company.city = “Seattle”

Page 26: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Example

Product(pname, maker), Company(cname, city)

Assume:

Clustered index: Product.pname, Company.cname

Unclustered index: Product.maker, Company.city

Page 27: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

><

city=“Seattle”

Product(pname,maker)

Company(cname,city)

maker=cname

Logical Plan:

Page 28: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

><

city=“Seattle”

Product(pname,maker)

Company(cname,city)

cname=maker

Physical plan 1:

Index-basedselection

Index-basedjoin

Page 29: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

><

city=“Seattle”

Product(pname,maker)

Company(cname,city)

maker=cname

Physical plans 2a and 2b:

Index-scan

Merge-join

Scan and sort (2a)index scan (2b)

Which one is better ??Which one is better ??

Page 30: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

><

city=“Seattle”

Product(pname,maker)

Company(cname,city)

cname=maker

Physical plan 1:

Index-basedselection

Index-basedjoin

T(Company) / V(Company, city)

T(Product) / V(Product, maker)

Total cost:

T(Company) / V(Company, city)

T(Product) / V(Product, maker)

Total cost:

T(Company) / V(Company, city)

T(Product) / V(Product, maker)

Page 31: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

><

city=“Seattle”

Product(pname,maker)

Company(cname,city)

maker=cname

Physical plans 2a and 2b:

Table-scan

Merge-join

Scan and sort (2a)index scan (2b)

B(Company)

3B(Product)

T(Product)

No extra cost(why ?)

Total cost:(2a): 3B(Product) + B(Company) (2b): T(Product) + B(Company)

Total cost:(2a): 3B(Product) + B(Company) (2b): T(Product) + B(Company)

Page 32: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Plan 1: T(Company)/V(Company,city) T(Product)/V(Product,maker)Plan 2a: B(Company) + 3B(Product)Plan 2b: B(Company) + T(Product)

Plan 1: T(Company)/V(Company,city) T(Product)/V(Product,maker)Plan 2a: B(Company) + 3B(Product)Plan 2b: B(Company) + T(Product)

Which one is better ??Which one is better ??

It depends on the data !!It depends on the data !!

Page 33: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Example

• Case 1: V(Company, city) T(Company)

• Case 2: V(Company, city) << T(Company)

T(Company) = 5,000 B(Company) = 500 M = 100T(Product) = 100,000 B(Product) = 1,000

We may assume V(Product, maker) T(Company) (why ?)

T(Company) = 5,000 B(Company) = 500 M = 100T(Product) = 100,000 B(Product) = 1,000

We may assume V(Product, maker) T(Company) (why ?)

V(Company,city) = 2,000V(Company,city) = 2,000

V(Company,city) = 20V(Company,city) = 20

Page 34: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Which Plan is Best ?

Plan 1: T(Company)/V(Company,city) T(Product)/V(Product,maker)Plan 2a: B(Company) + 3B(Product)Plan 2b: B(Company) + T(Product)

Plan 1: T(Company)/V(Company,city) T(Product)/V(Product,maker)Plan 2a: B(Company) + 3B(Product)Plan 2b: B(Company) + T(Product)

Case 1:

Case 2:

Page 35: CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Lessons

• Need to consider several physical plan– even for one, simple logical plan

• No magic “best” plan: depends on the data

• In order to make the right choice– need to have statistics over the data– the B’s, the T’s, the V’s