Ch. 13 (Silberchatz): Query Processing Overview Measures of Query Cost Selection Operation ...

Ch. 13 (Silberchatz):Ch. 13 (Silberchatz): Query Processing Query Processing

Overview Overview Measures of Query CostMeasures of Query Cost Selection Operation Selection Operation Sorting Sorting Join Operation Join Operation Other OperationsOther Operations Evaluation of ExpressionsEvaluation of Expressions

Basic Steps in Query Basic Steps in Query ProcessingProcessing

1.1. Parsing and translationParsing and translation

2.2. OptimizationOptimization

3.3. EvaluationEvaluation

Basic Steps in Query Basic Steps in Query Processing (Cont.)Processing (Cont.)

Parsing and translationParsing and translation Translates the query into a parse tree representation. This is Translates the query into a parse tree representation. This is

then translated into relational algebra.then translated into relational algebra. Parser checks syntax, verifies relationsParser checks syntax, verifies relations

EvaluationEvaluation The query-execution engine takes a query-evaluation plan, The query-execution engine takes a query-evaluation plan,

executes that plan, and returns the answers to the query.executes that plan, and returns the answers to the query.

Optimization and evaluationOptimization and evaluation A relational algebra expression may have many A relational algebra expression may have many

equivalent expressionsequivalent expressions E.g., E.g., balancebalance25002500((balancebalance((account)) account)) is equivalent to is equivalent to

balancebalance((balancebalance25002500((account))account))

Each relational algebra operation can be evaluated Each relational algebra operation can be evaluated using one of using one of several different algorithmsseveral different algorithms Correspondingly, a relational-algebra expression can be Correspondingly, a relational-algebra expression can be

evaluated in many ways. evaluated in many ways.

Annotated expression specifying detailed evaluation Annotated expression specifying detailed evaluation strategy is called an strategy is called an evaluation-planevaluation-plan.. E.g., can use an index on E.g., can use an index on balancebalance to find accounts with to find accounts with

balance < 2500, or can perform complete relation scan and balance < 2500, or can perform complete relation scan and discard accounts with balance discard accounts with balance 2500 2500

Query OptimizationQuery Optimization Amongst all equivalent evaluation plans choose the one with Amongst all equivalent evaluation plans choose the one with

lowest cost. lowest cost. Cost is estimated using statistical information from the database Cost is estimated using statistical information from the database

catalogcatalog• e.g. number of tuples in each relation, size of tuples, etc.e.g. number of tuples in each relation, size of tuples, etc.

In this chapter we studyIn this chapter we study How to measure query costsHow to measure query costs Algorithms for evaluating relational algebra operationsAlgorithms for evaluating relational algebra operations How to combine algorithms for individual operations in order to How to combine algorithms for individual operations in order to

evaluate a complete expressionevaluate a complete expression

In Chapter 14In Chapter 14 We study how to optimize queries, that is, how to find an evaluation We study how to optimize queries, that is, how to find an evaluation

plan with lowest estimated costplan with lowest estimated cost

Measures of Query CostMeasures of Query Cost

Generally measured as the total elapsed time for Generally measured as the total elapsed time for answering queryanswering query Many factors contribute to time costMany factors contribute to time cost

• disk accesses, CPUdisk accesses, CPU, or even network , or even network communicationcommunication

Typically Typically disk accessdisk access is the predominant cost, and is is the predominant cost, and is also relatively easy to estimate. Precise measures also relatively easy to estimate. Precise measures take into accounttake into account:: Number of seeks * average-seek-costNumber of seeks * average-seek-cost Number of blocks read * average-block-read-costNumber of blocks read * average-block-read-cost Number of blocks written * average-block-write-costNumber of blocks written * average-block-write-cost

• Cost to write a block is greater than cost to read a block Cost to write a block is greater than cost to read a block data is read back after being written to ensure that the write was data is read back after being written to ensure that the write was

successfulsuccessful

Measures of Query Cost (Cont.)Measures of Query Cost (Cont.) For simplicity, we just use For simplicity, we just use number of block transfersnumber of block transfers

from diskfrom disk as the cost measure and ignore: as the cost measure and ignore: the difference in cost between the difference in cost between sequentialsequential and and randomrandom I/O I/O CPU costs CPU costs cost to writing output to diskcost to writing output to disk

Costs depends on the Costs depends on the size of the buffer in main size of the buffer in main memorymemory Having more memory reduces need for disk accessHaving more memory reduces need for disk access Amount of real memory available to buffer depends on other Amount of real memory available to buffer depends on other

concurrent OS processes, and hard to determine ahead of concurrent OS processes, and hard to determine ahead of actual execution actual execution

Often use Often use worst case estimatesworst case estimates, assuming only the minimum , assuming only the minimum amount of memory needed for the operation is availableamount of memory needed for the operation is available

Selection OperationSelection OperationFile scanFile scan: search algorithms that locate and retrieve records : search algorithms that locate and retrieve records

that fulfill a selection condition.that fulfill a selection condition.

A1)A1) Linear Linear search:search: Scan each file block and test all records Scan each file block and test all records to see whether they satisfy the selection condition.to see whether they satisfy the selection condition. Cost estimate (number of disk blocks scanned) = Cost estimate (number of disk blocks scanned) = bbr r

• bbr r is the number of blocks containing records from relation is the number of blocks containing records from relation rr

If selection is on a key attribute, If selection is on a key attribute, average cost =average cost = bbr r /2/2

• stop on finding recordstop on finding record Linear search can be applied regardless of selection condition, Linear search can be applied regardless of selection condition,

ordering of records in the file, or availability of indicesordering of records in the file, or availability of indices

Selection Operation (Cont.)Selection Operation (Cont.)

A2)A2) Binary search: Binary search: Applicable if selection is an equality Applicable if selection is an equality comparison on the attribute on which file is ordered. comparison on the attribute on which file is ordered. Assume that the blocks of a relation are stored contiguously Assume that the blocks of a relation are stored contiguously Cost estimate:Cost estimate:

loglog22((bbrr)): cost of locating the first tuple by a binary search on the : cost of locating the first tuple by a binary search on the

blocksblocks• PlusPlus number of blocks containing records that satisfy selection number of blocks containing records that satisfy selection

condition condition Size of the output relation/average nb tuples per blockSize of the output relation/average nb tuples per block

Selections Using IndicesSelections Using IndicesIndex scanIndex scan:: search algorithms that use an index search algorithms that use an index

selection condition must be on search-key of index.selection condition must be on search-key of index.

A3)A3) Primary index on candidate key, equalityPrimary index on candidate key, equality: Retrieve a single : Retrieve a single record that satisfies the corresponding equality condition record that satisfies the corresponding equality condition

CostCost = = HTHTii + 1+ 1, for a B+ tree of height , for a B+ tree of height HTHTii

A4)A4) Primary index on nonkey, equalityPrimary index on nonkey, equality:: Retrieve multiple Retrieve multiple records. records.

Records will be on consecutive blocks Records will be on consecutive blocks CostCost = = HTHTii + number of blocks containing retrieved records+ number of blocks containing retrieved records

A5)A5) Equality on search-key of secondary index: Equality on search-key of secondary index: Retrieve a single record if the search-key is a candidate keyRetrieve a single record if the search-key is a candidate key

• Cost = HTCost = HTii + 1+ 1 Retrieve multiple records if search-key is not a candidate keyRetrieve multiple records if search-key is not a candidate key

• Cost = Cost = HTHTii + + number of records retrieved number of records retrieved Can be very expensive because each record may be on a different block Can be very expensive because each record may be on a different block one block access for each retrieved recordone block access for each retrieved record

Selections with ComparisonsSelections with Comparisons Selections of the form Selections of the form AAV V ((rr) or ) or A A VV((rr)) by using by using

a linear file scan or binary search,a linear file scan or binary search, or indices in the following ways:or indices in the following ways:

A6A6) primary index, comparison) primary index, comparison (relation is sorted on A) (relation is sorted on A)• For For A A V V(r)(r) use index to find first tuple use index to find first tuple v v and scan relation sequentially and scan relation sequentially

from therefrom there• For For AAVV (r) just scan relation sequentially till first tuple > (r) just scan relation sequentially till first tuple > v; v; do not use indexdo not use index

A7)A7) secondary index, comparisonsecondary index, comparison. . • For For A A V V(r)(r) use index to find first index entry use index to find first index entry v v and scan index sequentially and scan index sequentially

from there, to find pointers to records. from there, to find pointers to records.• For For AAV V (r)(r) just scan leaf pages of index finding pointers to records, till first just scan leaf pages of index finding pointers to records, till first

entry > entry > vv• In either case, retrieve records that are pointed to requires an I/O for each In either case, retrieve records that are pointed to requires an I/O for each

record therefore linear file scan may be cheaper if many records are to be record therefore linear file scan may be cheaper if many records are to be fetched!fetched!

Complex selections: conjunctionComplex selections: conjunction11 22. . . . . . nn((r) r)

A8)A8) Conjunctive selection using one indexConjunctive selection using one index Select a combination of Select a combination of ii and algorithms A2 through A7 that results in and algorithms A2 through A7 that results in

the least cost for the least cost for ii ( (r).r). Test other conditions on tuple after fetching it into memory buffer.Test other conditions on tuple after fetching it into memory buffer.

A9)A9) Conjunctive selection using multiple-key index Conjunctive selection using multiple-key index Use appropriate composite (multiple-key) index if available.Use appropriate composite (multiple-key) index if available.

A10)A10) Conjunctive selection by intersection of identifiersConjunctive selection by intersection of identifiers Requires indices with record pointers. Requires indices with record pointers. Use corresponding index for each condition, and take intersection of Use corresponding index for each condition, and take intersection of

all the obtained sets of record pointers. all the obtained sets of record pointers. Then fetch records from fileThen fetch records from file If some conditions do not have appropriate indices, apply test in If some conditions do not have appropriate indices, apply test in

memory.memory.

Complex Selections: disjunctionComplex Selections: disjunction11 2 2 . . . . . . nn ((r). r).

A11)A11) Disjunctive selection by union of identifiersDisjunctive selection by union of identifiers:: Applicable if Applicable if all all conditions have available indices. conditions have available indices.

• Otherwise use linear scan.Otherwise use linear scan. Use corresponding index for each condition, and take union of Use corresponding index for each condition, and take union of

all the obtained sets of record pointers. all the obtained sets of record pointers. Then fetch records from fileThen fetch records from file

NegationNegation: : ((r)r) Use linear scan on fileUse linear scan on file

If very few records satisfy If very few records satisfy , , and an index is applicable to and an index is applicable to • Find satisfying records using index and fetch from fileFind satisfying records using index and fetch from file

SortingSorting

We may build an index on the relation, and then We may build an index on the relation, and then use the index to read the relation in sorted order.use the index to read the relation in sorted order. May lead to one disk block access for each tuple => May lead to one disk block access for each tuple =>

expensive!expensive!

For relations that fit in memory, techniques like For relations that fit in memory, techniques like quicksort can be used. For relations that don’t fit quicksort can be used. For relations that don’t fit in memory, in memory, external sort-mergeexternal sort-merge is a good is a good choice. choice.

External Sort-Merge (1)External Sort-Merge (1)

1.1. Create sorted runsCreate sorted runs. . Let Let ii be 0 initially. be 0 initially. Repeatedly do the following till the end of the relation:Repeatedly do the following till the end of the relation: (a) Read (a) Read MM blocks of relation into memory blocks of relation into memory (b) Sort the in-memory blocks (b) Sort the in-memory blocks (c) Write sorted data to run (c) Write sorted data to run RRii; increment ; increment i.i.

Let the final value ofLet the final value of i i be be NN

Let M denote memory size (in pages).

External Sort-Merge (2)External Sort-Merge (2)

2.2. Merge the runs (N-way merge).Merge the runs (N-way merge). We assume (for We assume (for now) that now) that NN < < MM. . 1.1. Use Use NN blocks of memory to buffer input runs, and 1 blocks of memory to buffer input runs, and 1

block to buffer output. Read the first block of each run block to buffer output. Read the first block of each run into its buffer pageinto its buffer page

2.2. repeatrepeat1.1. Select the first record (in sort order) among all buffer pagesSelect the first record (in sort order) among all buffer pages2.2. Write the record to the output buffer. If the output buffer is full Write the record to the output buffer. If the output buffer is full

write it to disk.write it to disk.3.3. Delete the record from its input buffer page.Delete the record from its input buffer page.

IfIf the buffer page becomes empty the buffer page becomes empty thenthen read the next block (if any) of the run into the buffer. read the next block (if any) of the run into the buffer.

untiluntil all input buffer pages are empty: all input buffer pages are empty:

External Sort-Merge (N>M)External Sort-Merge (N>M)

If If ii MM, several merge , several merge passespasses are required. are required. In each pass, contiguous groups of In each pass, contiguous groups of M M - 1 runs are - 1 runs are

merged. merged. A pass reduces the number of runs by a factor of A pass reduces the number of runs by a factor of MM -1, -1,

and creates runs longer by the same factor. and creates runs longer by the same factor. • E.g. If M=11, and there are 90 runs, one pass reduces the E.g. If M=11, and there are 90 runs, one pass reduces the

number of runs to 9, each 10 times the size of the initial runsnumber of runs to 9, each 10 times the size of the initial runs Repeated passes are performed till all runs have been Repeated passes are performed till all runs have been

merged into one.merged into one.

ExampleExample

Cost AnalysisCost Analysis

Total number of merge passes required: Total number of merge passes required:

loglogMM–1–1((bbrr/M)/M).. Disk accesses for initial run creation as well as Disk accesses for initial run creation as well as

in each pass is in each pass is 22bbrr

for final pass, we don’t count write cost for final pass, we don’t count write cost • we ignore final write cost for all operations since the output of we ignore final write cost for all operations since the output of

an operation may be sent to the parent operation without an operation may be sent to the parent operation without being written to diskbeing written to disk

Thus total number of disk accesses for external sorting:Thus total number of disk accesses for external sorting:

bbr r ( 2 ( 2 loglogMM–1–1((bbr r / M)/ M) + 1) + 1)

Join OperationJoin Operation Several different algorithms to implement joinsSeveral different algorithms to implement joins

Nested-loop joinNested-loop join Block nested-loop joinBlock nested-loop join Indexed nested-loop joinIndexed nested-loop join Merge-joinMerge-join Hash-joinHash-join

Choice based on Choice based on cost estimatecost estimate Examples use the following informationExamples use the following information

Number of records of Number of records of customercustomer: 10,000 : 10,000 depositordepositor: 5000: 5000 Number of blocks of Number of blocks of customercustomer: 400 : 400 depositordepositor: 100: 100

Nested-Loop (NL) JoinNested-Loop (NL) Join To compute the theta join To compute the theta join rr ss

for eachfor each tuple tuple ttrr in in rr do begin do begin

for each tuple for each tuple ttss in in ss do begin do begin

test pair (test pair (ttrr,t,tss) to) to see if they satisfy the join condition see if they satisfy the join condition

if they do, add if they do, add ttrr.t.tss to the result.to the result.

endendendend

rr is called the is called the outerouter relationrelation and and ss the the inner relationinner relation of of the join.the join.

Requires no indices and can be used with any kind of Requires no indices and can be used with any kind of join condition.join condition.

ExpensiveExpensive since it examines every pair of tuples in the since it examines every pair of tuples in the two relations. two relations.

NL Join (Cont.)NL Join (Cont.) In the In the worst caseworst case, the estimated cost is , the estimated cost is

nnrr bbss + + b brr disk accesses if there is enough memory only to hold one disk accesses if there is enough memory only to hold one block of each relation,block of each relation,

If the If the smaller relation fits entirely in memorysmaller relation fits entirely in memory, use that as the , use that as the inner relation. Reduces cost to inner relation. Reduces cost to bbrr + + bbss disk accesses.disk accesses.

ExEx: : Assuming worst case memory availability cost estimate is: Assuming worst case memory availability cost estimate is: 5000 5000 400 + 100 = 400 + 100 = 2,000,1002,000,100 disk accesses with disk accesses with depositor depositor as outer as outer

relation, and relation, and 10000 10000 100 + 400 = 100 + 400 = 1,000,4001,000,400 disk accesses with disk accesses with customer customer as the as the

outer relation.outer relation. If smaller relation (If smaller relation (depositor)depositor) fits entirely in memory, the cost estimate fits entirely in memory, the cost estimate

will be will be 500500 disk accesses. disk accesses.

Block Nested-Loop (BNL) JoinBlock Nested-Loop (BNL) Join Variant of nested-loop join in which every block of inner Variant of nested-loop join in which every block of inner

relation is paired with every block of outer relation.relation is paired with every block of outer relation.

for each for each block block BBrr ofof rr do begin do begin

for eachfor each block block BBss of of s s do begindo begin

for eachfor each tuple tuple ttrr in in BBr r do begindo begin

for each for each tuple tuple ttss in in BBss do begindo begin

Check if (Check if (ttrr,t,tss) ) satisfy the join condition satisfy the join condition

if they do, add if they do, add ttrr • • ttss to the result.to the result.

endendendend

endendendend

BNL Join (Cont.)BNL Join (Cont.)

Worst case estimateWorst case estimate: : bbrr b bss + b + brr block accesses. block accesses. Each block in the inner relation Each block in the inner relation ss is read once for each is read once for each blockblock in in

the outer relation (instead of once for each tuple in the outer the outer relation (instead of once for each tuple in the outer relation)relation)

Best caseBest case: : bbrr ++ b bss block accesses.block accesses. Improvements to NL and BNL algorithms:Improvements to NL and BNL algorithms:

In BNL, use In BNL, use M — M — 2 disk blocks as blocking unit for outer 2 disk blocks as blocking unit for outer relations, where relations, where MM = memory size in blocks; use remaining two = memory size in blocks; use remaining two blocks to buffer inner relation and outputblocks to buffer inner relation and output

• Cost = Cost = bbr r / (M-2)/ (M-2) bbss + b + brr If equi-join attribute forms a key or inner relation, stop inner loop If equi-join attribute forms a key or inner relation, stop inner loop

on first matchon first match Scan inner loop forward and backward alternately, to make use Scan inner loop forward and backward alternately, to make use

of the blocks remaining in buffer (with LRU replacement)of the blocks remaining in buffer (with LRU replacement)

Indexed Nested-Loop (INL) JoinIndexed Nested-Loop (INL) Join Index lookups can replace file scans ifIndex lookups can replace file scans if

join is an join is an equi-joinequi-join or or natural joinnatural join and and an an indexindex is available on the is available on the inner relation’s join attributeinner relation’s join attribute

• Can construct an index just to compute a join.Can construct an index just to compute a join.

For each tuple For each tuple ttrr in the outer relation in the outer relation r,r, use the index to look use the index to look up tuples in up tuples in ss that satisfy the join condition with tuple that satisfy the join condition with tuple ttrr..

Worst caseWorst case: buffer has space for only 1 page of : buffer has space for only 1 page of rr, for each , for each tuple in tuple in rr, perform an index lookup on , perform an index lookup on s.s.

Cost of the join: Cost of the join: bbrr + + nnrr cc cc is the cost of traversing index and fetching all matching is the cost of traversing index and fetching all matching ss tuples for one tuple tuples for one tuple

or or rr cc can be estimated as cost of a single selection on can be estimated as cost of a single selection on ss using the join condition. using the join condition. If indices are available on join attributes of both If indices are available on join attributes of both r r and and s, s, use the relation with use the relation with

fewer tuples as the outer relation.fewer tuples as the outer relation.

Example of INL Join CostsExample of INL Join Costsdepositor customerdepositor customer, , with with depositordepositor as the outer relation. as the outer relation.

Let Let customercustomer have a primary B have a primary B++-tree index on the join attribute -tree index on the join attribute customer-name, customer-name, which contains 20 entries in each index node.which contains 20 entries in each index node.

SinceSince customer customer has 10,000 tuples, the height of the tree is 4, and has 10,000 tuples, the height of the tree is 4, and one more access is needed to find the actual dataone more access is needed to find the actual data

depositordepositor has 5000 tuples has 5000 tuples Cost of block nested loops joinCost of block nested loops join

• 400*100 + 100 = 40,100 disk accesses assuming worst case memory (may 400*100 + 100 = 40,100 disk accesses assuming worst case memory (may be significantly less with more memory)be significantly less with more memory)

Cost of indexed nested loops joinCost of indexed nested loops join• 100 + 5000 * 5 = 25,100 disk accesses.100 + 5000 * 5 = 25,100 disk accesses.

• CPU cost likely to be less than that for block nested loops join CPU cost likely to be less than that for block nested loops join

Merge-JoinMerge-Join1.1. Sort both relations on their join attribute (if not already sorted Sort both relations on their join attribute (if not already sorted

on the join attributes).on the join attributes).

2.2. Merge the sorted relations to join themMerge the sorted relations to join them1.1. Join step is similar to the merge stage of the sort-merge algorithm. Join step is similar to the merge stage of the sort-merge algorithm.

2.2. Main difference is handling of duplicate values in join attribute: Main difference is handling of duplicate values in join attribute: every pair with same value on join attribute must be matchedevery pair with same value on join attribute must be matched

Merge-Join (Cont.)Merge-Join (Cont.)

Can be used only for Can be used only for equi-joinsequi-joins and and natural joinsnatural joins Each block needs to be read only once, assuming all Each block needs to be read only once, assuming all

tuples for any given value of the join attributes fit in tuples for any given value of the join attributes fit in memorymemory

Thus number of block accesses for merge-join is Thus number of block accesses for merge-join is bbrr + b + bss + the cost of sorting if relations are + the cost of sorting if relations are

unsorted.unsorted. Refinement of sort-mergeRefinement of sort-merge

Page 462 Raghu 3rd editionPage 462 Raghu 3rd edition

Hybrid Merge-JoinHybrid Merge-Join

If If one relation is sortedone relation is sorted, and the other has a , and the other has a secondary secondary BB++-tree index-tree index on the join attribute on the join attribute Merge the sorted relation with the leaf entries of the BMerge the sorted relation with the leaf entries of the B++-tree . -tree . Sort the result on the addresses of the unsorted relation’s Sort the result on the addresses of the unsorted relation’s

tuplestuples Scan the unsorted relation in physical address order and Scan the unsorted relation in physical address order and

merge with previous result, to replace addresses by the merge with previous result, to replace addresses by the actual tuplesactual tuples• Sequential scan more efficient than random lookupSequential scan more efficient than random lookup

Hash-JoinHash-Join Applicable for equi-joins and natural joins.Applicable for equi-joins and natural joins. A hash functionA hash function hh is used to partition tuples of both relations is used to partition tuples of both relations hh maps maps JoinAttrsJoinAttrs values to values to {0, 1, ..., {0, 1, ..., nn},}, where where JoinAttrs JoinAttrs

denotes the common attributes of denotes the common attributes of rr and and s s used in the natural used in the natural join. join. rr00, r, r11, . . ., r, . . ., rnn denote partitions of denote partitions of r r tuplestuples

• Each tuple Each tuple ttrr r r is put in partition is put in partition rrii where where i = h(ti = h(trr [JoinAttrs]).[JoinAttrs]).

ss00, s, s11. . ., s. . ., snn denotes partitions of denotes partitions of ss tuples tuples

• Each tuple Each tuple ttss ss is put in partition is put in partition ssii, where , where i = h(ti = h(tss [JoinAttrs]).[JoinAttrs]).

NoteNote: : In book, In book, rri i is denoted as is denoted as HHri, ri, ssi i is denoted as is denoted as HHssi i andand n n is is

denoted as denoted as nnh. h.

Hash-Join (Cont.)Hash-Join (Cont.)

r r tuples in tuples in rrii need only to be compared with need only to be compared with s s tuples in tuples in ssi.i. Need not Need not

be compared with be compared with ss tuples in any other partition, tuples in any other partition, since:since: an an rr tuple and an tuple and an s s tuple that satisfy the join condition will have the same tuple that satisfy the join condition will have the same

value for the join attributes.value for the join attributes. If that value is hashed to some value If that value is hashed to some value ii, the , the rr tuple has to be in tuple has to be in rrii and the and the s s

tuple in tuple in ssii..

Hash-Join AlgorithmHash-Join Algorithm

1.1. Partition the relation Partition the relation ss using hashing function using hashing function hh. When . When partitioning a relation, one block of memory is reserved partitioning a relation, one block of memory is reserved as the output buffer for each partition.as the output buffer for each partition.

2.2. Partition Partition rr similarly. similarly.

3.3. For each For each i:i:(a)(a) LoadLoad ssii into memory and build an in-memory hash index into memory and build an in-memory hash index

on it using the join attribute. This hash index uses a different on it using the join attribute. This hash index uses a different hash function than the earlier one hash function than the earlier one hh..

(b)(b) Read the tuples in Read the tuples in rrii from the disk one by one. For each from the disk one by one. For each

tuple tuple ttrr locate each matching tuple locate each matching tuple ttss in in ssii using the in-memory using the in-memory

hash index. Output the concatenation of their attributes.hash index. Output the concatenation of their attributes.

The hash-join of r and s is computed as follows.

Relation s is called the build input and r is called the probe input.

HJ (Raghu)HJ (Raghu)

Partition both relations using hash fn h: R tuples in partition i will only match S tuples in partition i.

Read in a partition of S, hash it using h2 (<> h!). Scan matching partition of R, search for matches.

Partitionsof R & S

Input bufferfor Ri

Hash table for partitionSi (k < M-1 pages)

M main memory buffersDisk

Output buffer

Disk

Join Result

hashfnh2

h2

M main memory buffers DiskDisk

Original Relation OUTPUT

2INPUT

1

hashfunction

h M-1

Partitions

1

2

M-1

. . .

Hash-Join algorithm (Cont.)Hash-Join algorithm (Cont.)

The value The value nn and the hash function and the hash function hh are chosen such that are chosen such that each each ssii should fit in memory. should fit in memory.

#partitions n <#partitions n <== M-1 M-1, in partitioning phase, in partitioning phase Assuming uniformly sized partitions, each partition of S Assuming uniformly sized partitions, each partition of S

has size has size bbss/(M-1)/(M-1)

In probing phase:In probing phase: the number of pages required for an hash table for each partition the number of pages required for an hash table for each partition

of S is: of S is: bbss/(M-1)/(M-1) * f * f where f is a “ where f is a “fudge factorfudge factor”, typically around ”, typically around

1.21.2 Two more pages are required, one for scanning S and another for Two more pages are required, one for scanning S and another for

outputoutput Therefore,Therefore, M > M > bbss/(M-1)/(M-1) * f * f + 2+ 2, i.e., , i.e., M must be >M must be > sbfx

Cost of Hash-JoinCost of Hash-Join

Cost of hash join is: Cost of hash join is: 3(3(bbrr ++ b bss))

If the entire build input can be kept in main memory, If the entire build input can be kept in main memory, nn can be set to 0 and the algorithm does not partition the can be set to 0 and the algorithm does not partition the relations into temporary files. Cost estimate goes relations into temporary files. Cost estimate goes down to down to bbrr + + b bss..

Example of Cost of Hash-JoinExample of Cost of Hash-Join

Assume that memory size is 20 blocksAssume that memory size is 20 blocks

bbdepositordepositor= 100 and = 100 and bbcustomercustomer = 400.= 400.

depositor depositor is to be used as build input. Partition it is to be used as build input. Partition it into five partitions, each of size 20 blocks. This into five partitions, each of size 20 blocks. This partitioning can be done in one pass.partitioning can be done in one pass.

Similarly, partition Similarly, partition customercustomer into five into five partitions,each of size 80. This is also done in partitions,each of size 80. This is also done in one pass.one pass.

Therefore total cost: 3(100 + 400) = 1500 block Therefore total cost: 3(100 + 400) = 1500 block transfers transfers ignores cost of writing partially filled blocksignores cost of writing partially filled blocks

customer depositor

Recursive partitioningRecursive partitioning

Required if number of partitions Required if number of partitions n n is greater than is greater than number of pages number of pages MM of memory of memory.. Partitioning has to be done in repeated passesPartitioning has to be done in repeated passes Instead of partitioning Instead of partitioning nn ways, use ways, use M – M – 1 partitions for s1 partitions for s Further partition the Further partition the M – M – 1 partitions using a different hash 1 partitions using a different hash

functionfunction Use same partitioning method on Use same partitioning method on rr Rarely required: e.g., recursive partitioning not needed for Rarely required: e.g., recursive partitioning not needed for

relations of 1GB or less with memory size of 2MB, with relations of 1GB or less with memory size of 2MB, with block size of 4KB.block size of 4KB.

Cost of Hash-Join w/ recursive Cost of Hash-Join w/ recursive partitioningpartitioning

If If recursive partitioningrecursive partitioning required, number of passes required, number of passes required for partitioningrequired for partitioning s s is is loglogM–M–11((bbss) – 1) – 1. This is . This is because each final partitionbecause each final partition of of ss should fit in memory. should fit in memory. The number of partitions of probe relation The number of partitions of probe relation rr is the same as is the same as

that for build relation that for build relation ss; the number of passes for partitioning ; the number of passes for partitioning of of rr is also the same as for is also the same as for ss. .

Therefore it is best to choose the smaller relation as the Therefore it is best to choose the smaller relation as the build relation.build relation.

Total cost estimate is: Total cost estimate is: 22(b(brr + b + bs s loglogM–M–11((bbss) – 1) – 1 + + bbrr + b + bss

Handling of OverflowsHandling of Overflows Hash-table overflowHash-table overflow occurs in partition occurs in partition ssii if if ssii does not does not

fit in memory. Reasons could be:fit in memory. Reasons could be: Many tuples in s with same value for join attributesMany tuples in s with same value for join attributes Bad hash functionBad hash function

Partitioning is said to be Partitioning is said to be skewedskewed if some partitions have if some partitions have significantly more tuples than some otherssignificantly more tuples than some others Overflow resolutionOverflow resolution can be done in build phase can be done in build phase

• Partition Partition ssii is further partitioned using different hash function. is further partitioned using different hash function.

• Partition Partition rrii must be similarly partitioned. must be similarly partitioned. Overflow avoidanceOverflow avoidance performs partitioning carefully to avoid performs partitioning carefully to avoid

overflows during build phaseoverflows during build phase• E.g. partition build relation into many partitions, then combine themE.g. partition build relation into many partitions, then combine them

Both approaches fail with large numbers of duplicatesBoth approaches fail with large numbers of duplicates• Fallback option: use block nested loops join on overflowedFallback option: use block nested loops join on overflowed

partitions partitions

Hybrid Hash–JoinHybrid Hash–Join Useful when memory size are relatively large, and the build Useful when memory size are relatively large, and the build

input is bigger than memory.input is bigger than memory. Main feature of hybrid hash joinMain feature of hybrid hash join: Keep the first partition of : Keep the first partition of

the build relation in memory. the build relation in memory. E.g. With memory size of 25 blocks, E.g. With memory size of 25 blocks, depositor depositor can be partitioned into can be partitioned into

five partitions, each of size 20 blocks.five partitions, each of size 20 blocks. Division of memory:Division of memory:

The first partition occupies 20 blocks of memoryThe first partition occupies 20 blocks of memory 1 block is used for input, and 1 block each for buffering the other 4 1 block is used for input, and 1 block each for buffering the other 4

partitions.partitions. customercustomer is similarly partitioned into five partitions each of size 80; is similarly partitioned into five partitions each of size 80;

the first is used right away for probing, instead of being written out the first is used right away for probing, instead of being written out and read back.and read back.

Cost of 3(80 + 320) + 20 +80 = 1300 block transfers for hybrid hash Cost of 3(80 + 320) + 20 +80 = 1300 block transfers for hybrid hash join, instead of 1500 with plain hash-join.join, instead of 1500 with plain hash-join.

Hybrid hash-join most useful if Hybrid hash-join most useful if MM >> >> sbfx

Hash-Join vs Sort Merge JoinHash-Join vs Sort Merge Join

(Assuming refinement of sort merge join) I/O: 3*(B(R)+B(S)) Memory requirement: hash join is lower

sqrt(min(B(R), B(S)) < sqrt(B(R) + B(S)) Hash join wins when two relations have very different sizes

Other factors: Hash join performance depends on the quality of the hash

• Might not get evenly sized buckets SMJ can be adapted for inequality join predicates SMJ wins if R and/or S are already sorted SMJ wins if the result needs to be in sorted order

What about nested-loop join?

May be best if many tuples join Example: non-equality joins that are not very

selective Necessary for black-box predicates

Example: . WHERE user_defined_pred(R.A, S.B)

General Join Conditions (Raghu)General Join Conditions (Raghu) Equalities over several attributes (e.g., Equalities over several attributes (e.g., R.sid=S.sid R.sid=S.sid ANDAND

R.rname=S.snameR.rname=S.sname):): For For Index NLIndex NL, build index on, build index on <<sid, snamesid, sname>> (if S is inner); or use (if S is inner); or use

existing indexes on existing indexes on sidsid or or snamesname.. For For Sort-MergeSort-Merge and and Hash JoinHash Join, sort/partition on combination of , sort/partition on combination of

the two join columns.the two join columns.

Inequality conditions (e.g., Inequality conditions (e.g., R.rname < S.snameR.rname < S.sname):): For For Index NLIndex NL, need (clustered!) B+ tree index., need (clustered!) B+ tree index.

• Range probes on inner; # matches likely to be much higher than for Range probes on inner; # matches likely to be much higher than for equality joins.equality joins.

Hash Join, Sort Merge JoinHash Join, Sort Merge Join not applicable. not applicable. Block NLBlock NL quite likely to be the best join method here. quite likely to be the best join method here.

Complex JoinsComplex Joins Join with a conjunctive conditionJoin with a conjunctive condition::

r r 11 2 2... ... nn ss Either use Either use nested loops/block nested loopsnested loops/block nested loops, or, or Compute the result of one of the simpler joins Compute the result of one of the simpler joins r r ii ss

• final result comprises those tuples in the intermediate result that satisfy the final result comprises those tuples in the intermediate result that satisfy the remaining conditionsremaining conditions

11 . . . . . . i i –1–1 i i +1+1 . . . . . . nn

Join with a disjunctive conditionJoin with a disjunctive condition

r r 1 1 2 2 ... ... nn s s Either use Either use nested loops/block nested loopsnested loops/block nested loops, or, or Compute as the union of the records in individual joins Compute as the union of the records in individual joins r r i i s:s:

((r r 11 ss) ) ( (r r 22 s) s) . . . . . . ( (r r nn s) s)

Duplicate Elimination and ProjectionDuplicate Elimination and Projection

Duplicate eliminationDuplicate elimination can be implemented via hashing or can be implemented via hashing or sorting.sorting. On On sortingsorting duplicates will come adjacent to each other, and duplicates will come adjacent to each other, and

all but one set of duplicates can be deleted. all but one set of duplicates can be deleted. OptimizationOptimization: : duplicates can be deleted during run generation as well as at duplicates can be deleted during run generation as well as at intermediate merge steps in external sort-merge.intermediate merge steps in external sort-merge.

HashingHashing is similar – duplicates will come into the same is similar – duplicates will come into the same bucket.bucket.

ProjectionProjection is implemented by performing projection on is implemented by performing projection on each tuple followed by duplicate elimination. each tuple followed by duplicate elimination.

AggregationAggregation

Can be implemented in a manner similar to duplicate Can be implemented in a manner similar to duplicate elimination.elimination. Sorting or hashingSorting or hashing can be used to bring tuples in the same can be used to bring tuples in the same

group together, and then the aggregate functions can be group together, and then the aggregate functions can be applied on each group.applied on each group.

OptimizationOptimization: : combine tuples in the same group during run combine tuples in the same group during run generation and intermediate merges, by computing partial generation and intermediate merges, by computing partial aggregate valuesaggregate values• For For count, min, max, sumcount, min, max, sum: keep aggregate values on tuples found : keep aggregate values on tuples found

so far in the group. so far in the group. When combining partial aggregate for count, add up the aggregatesWhen combining partial aggregate for count, add up the aggregates

• For For avgavg, keep sum and count, and divide sum by count at the end, keep sum and count, and divide sum by count at the end

Set OperationsSet OperationsSet operationsSet operations ((, , and and ):): can either use variant of can either use variant of

merge-join after sorting, or variant of hash-join.merge-join after sorting, or variant of hash-join.

E.g., Set operations using E.g., Set operations using hashinghashing::1.1. PartitionPartition both relations using the same hash function, thereby creating both relations using the same hash function, thereby creating, , rr1, .., 1, ..,

rrn n and and ss1, 1, ss2.., 2.., ssnn

2.2. Process each partition Process each partition ii as follows. Using a different hashing function, build as follows. Using a different hashing function, build an in-memory hash index on an in-memory hash index on rrii after it is brought into memory. after it is brought into memory.

3. 3. r r ss:: Add tuples in Add tuples in ssii to the hash index if they are not already in it. At end to the hash index if they are not already in it. At end

of of ssii add the tuples in the hash index to the result. add the tuples in the hash index to the result.

rr ss:: output tuples in output tuples in ssii to the result if they are already there in the hash to the result if they are already there in the hash

index.index.

rr – – s:s: for each tuple in for each tuple in ssii, , if it is there in the hash index, delete it from the if it is there in the hash index, delete it from the

index. At end of index. At end of ssii add remaining tuples in the hash index to the result. add remaining tuples in the hash index to the result.

Outer JoinOuter JoinCan be computed either asCan be computed either as:: a join followed by addition of a join followed by addition of

null-padded non-participating tuples or by modifying the null-padded non-participating tuples or by modifying the join algorithms.join algorithms. Modifying Modifying merge joinmerge join to compute to compute r sr s

• In In r sr s, non participating tuples are those in , non participating tuples are those in r r – – RR((r sr s))

• Modify merge-join to compute Modify merge-join to compute r r ss: : During merging, for every tuple During merging, for every tuple ttrr from from rr that do not match any tuple in that do not match any tuple in ss, , output output ttrr padded with nulls. padded with nulls.

• Right outer-join and full outer-join can be computed similarly.Right outer-join and full outer-join can be computed similarly. Modifying Modifying hash joinhash join to compute to compute r sr s

• If If rr is probe relation, output non-matching is probe relation, output non-matching rr tuples padded with nulls tuples padded with nulls

• If If rr is build relation, when probing keep track of which is build relation, when probing keep track of which rr tuples matched tuples matched ss tuples. At end of tuples. At end of ssii output non-matched output non-matched rr tuples padded with nulls tuples padded with nulls

Evaluation of ExpressionsEvaluation of Expressions

So far: we have seen algorithms for individual So far: we have seen algorithms for individual operationsoperations

Alternatives for evaluating an entire expression Alternatives for evaluating an entire expression treetree MaterializationMaterialization: generate results of an expression : generate results of an expression

whose inputs are relations or are already computed, whose inputs are relations or are already computed, materializematerialize (store) it on disk. Repeat. (store) it on disk. Repeat.

PipeliningPipelining: pass on tuples to parent operations even : pass on tuples to parent operations even as an operation is being executedas an operation is being executed

MaterializationMaterializationMaterialized evaluationMaterialized evaluation: : evaluate one operation at a time, starting evaluate one operation at a time, starting

at the lowest-level. Use intermediate results materialized into at the lowest-level. Use intermediate results materialized into temporary relations to evaluate next-level operations.temporary relations to evaluate next-level operations.

E.g., in figure below, compute and storeE.g., in figure below, compute and store

then compute and store its join with then compute and store its join with customer, customer, and finally and finally compute the projections on compute the projections on customer-name. customer-name. )(2500 accountbalance

Materialization (Cont.)Materialization (Cont.)

Materialized evaluation is always applicableMaterialized evaluation is always applicable Cost of writing results to disk and reading them back Cost of writing results to disk and reading them back

can be quite highcan be quite high Our cost formulas for operations ignore cost of writing Our cost formulas for operations ignore cost of writing

results to disk, soresults to disk, so• Overall cost = Sum of costs of individual operations + Overall cost = Sum of costs of individual operations +

cost of writing intermediate results to disk cost of writing intermediate results to disk

Double bufferingDouble buffering: use two output buffers for each : use two output buffers for each operation, when one is full, write it to disk while the operation, when one is full, write it to disk while the other is getting filledother is getting filled Allows overlap of disk writes with computation and reduces Allows overlap of disk writes with computation and reduces

execution timeexecution time

PipeliningPipeliningPipelined evaluationPipelined evaluation:: evaluate several operations evaluate several operations

simultaneously, passing the results of one operation simultaneously, passing the results of one operation on to the next.on to the next. E.g., in previous expression tree, don’t store result ofE.g., in previous expression tree, don’t store result of

instead, pass tuples directly to the join. Similarly, don’t instead, pass tuples directly to the join. Similarly, don’t store result of join, pass tuples directly to projection. store result of join, pass tuples directly to projection.

Much cheaper than materialization: no need to store a Much cheaper than materialization: no need to store a temporary relation to disk.temporary relation to disk.

Pipelining may not always be possible – e.g., sort, hash-join. Pipelining may not always be possible – e.g., sort, hash-join. For pipelining to be effective, use evaluation algorithms that For pipelining to be effective, use evaluation algorithms that

generate output tuples even as tuples are received for inputs generate output tuples even as tuples are received for inputs to the operation. to the operation.

)(2500 accountbalance

Demand-driven or lazy pipeliningDemand-driven or lazy pipelining System repeatedly requests next tuple from top level operationSystem repeatedly requests next tuple from top level operation Each operation Each operation requests next tuple from children operations as requests next tuple from children operations as

requiredrequired, in order to output its next tuple, in order to output its next tuple In between calls, operation has to maintain “In between calls, operation has to maintain “statestate” so it knows ” so it knows

what to return nextwhat to return next Each operation is implemented as an Each operation is implemented as an iteratoriterator implementing the implementing the

following operationsfollowing operations• open()open()

E.g. file scan: initialize file scan, store pointer to beginning of file as stateE.g. file scan: initialize file scan, store pointer to beginning of file as state E.g.merge join: sort relations and store pointers to beginning of sorted relations as E.g.merge join: sort relations and store pointers to beginning of sorted relations as

statestate

• next()next() E.g. for file scan: output next tuple, and advance and store file pointerE.g. for file scan: output next tuple, and advance and store file pointer E.g. for merge join: continue with merge from earlier state till E.g. for merge join: continue with merge from earlier state till

next output tuple is found. Save pointers as iterator state.next output tuple is found. Save pointers as iterator state.

• close()close()

Produce-driven or eager pipeliningProduce-driven or eager pipelining

Operators Operators produce tuples eagerly and pass produce tuples eagerly and pass them up to their parentsthem up to their parents Buffer maintained between operators, child puts Buffer maintained between operators, child puts

tuples in buffer, parent removes tuples from buffertuples in buffer, parent removes tuples from buffer If buffer is full, child waits till there is space in the If buffer is full, child waits till there is space in the

buffer, and then generates more tuplesbuffer, and then generates more tuples

System schedules operations that have space in System schedules operations that have space in output buffer and can process more input tuplesoutput buffer and can process more input tuples

Evaluation Algorithms for Evaluation Algorithms for PipeliningPipelining

Some algorithms are not able to output results even Some algorithms are not able to output results even as they get input tuplesas they get input tuples E.g. merge join, or hash joinE.g. merge join, or hash join These result in intermediate results being written to disk and These result in intermediate results being written to disk and

then read back alwaysthen read back always Algorithm variantsAlgorithm variants are possible to generate (at least are possible to generate (at least

some) results on the fly, as input tuples are read insome) results on the fly, as input tuples are read in E.g. hybrid hash join generates output tuples even as probe E.g. hybrid hash join generates output tuples even as probe

relation tuples in the in-memory partition (partition 0) are relation tuples in the in-memory partition (partition 0) are read inread in

Pipelined join techniquePipelined join technique

Hybrid hash join, modified to buffer partition 0 tuples of Hybrid hash join, modified to buffer partition 0 tuples of both relations in-memory, reading them as they both relations in-memory, reading them as they become available, and output results of any matches become available, and output results of any matches between partition 0 tuplesbetween partition 0 tuples When a new rWhen a new r00 tuple is found, match it with existing s tuple is found, match it with existing s00

tuples, output matches, and save it in rtuples, output matches, and save it in r00 Symmetrically for sSymmetrically for s00 tuples tuples

Complex JoinsComplex Joins loan depositor customerloan depositor customer

Strategy 1.Strategy 1. Compute Compute depositor depositor customercustomer; ; use result to computeuse result to compute loanloan ((depositordepositor customercustomer))

Strategy 2.Strategy 2. Computer Computer loan depositorloan depositor first, and then join the result with first, and then join the result with customer.customer.

Strategy 3.Strategy 3. Perform the pair of joins at once. Build and index on Perform the pair of joins at once. Build and index on loanloan for for loan-number,loan-number, and on and on customer customer forfor customer-name. customer-name.• For each tuple For each tuple t t in in depositor, depositor, look up the corresponding tuples in look up the corresponding tuples in customercustomer and the and the

corresponding tuples in corresponding tuples in loan.loan.

• Each tuple of Each tuple of deposit deposit is examined exactly once.is examined exactly once.

Strategy 3 combines two operations into one special-purpose Strategy 3 combines two operations into one special-purpose operation that is more efficient than implementing two joins of two operation that is more efficient than implementing two joins of two relations.relations.

Ch. 13 (Silberchatz): Query Processing Overview Measures of Query Cost Selection Operation ...

Documents

Transcript of Ch. 13 (Silberchatz): Query Processing Overview Measures of Query Cost Selection Operation ...