CS4604:Introduc0ontoDatabaseManagementSystems
B.AdityaPrakashLecture#10:QueryProcessing
Prakash2016 VTCS4604 2
Outline
§ introduc?on§ selec?on§ projec?on§ join§ set&aggregateopera?ons
Prakash2016 VTCS4604 3
Introduc0on
§ Today’stopic:QUERYPROCESSING§ Somedatabaseopera?onsareEXPENSIVE§ Cangreatlyimproveperformancebybeing“smart”– e.g.,canspeedup1,000,000xovernaïveapproach
Prakash2016 VTCS4604 4
Introduc0on(cnt’d)
§ Mainweaponsare:– cleverimplementa?ontechniquesforoperators– exploi?ng“equivalencies”ofrela?onaloperators– usingsta?s?csandcostmodelstochooseamongthese.
Prakash2016 VTCS4604 5
AReallyBadQueryOp0mizer
§ ForeachSelect-From-Wherequeryblock– docartesianproductsfirst– thendoselec?ons– etc,ie.:
• GROUPBY;HAVING• projec?ons• ORDERBY
§ Incrediblyinefficient– Hugeintermediateresults!
× σpredicates
tables …
Prakash2016 VTCS4604 6
Cost-basedQuerySub-System
Query Parser
Query Optimizer
Plan Generator Plan Cost Estimator
Query Plan Evaluator
Catalog Manager
Usuallythereisaheuris?cs-basedrewri?ngstepbeforethecost-basedsteps.
Schema Sta?s?cs
Select * From Blah B Where B.blah = blah
Queries
Prakash2016 VTCS4604 7
TheQueryOp0miza0onGame§ “Op?mizer”isabitofamisnomer…§ Goalistopicka“good”(i.e.,lowexpectedcost)plan.– Involveschoosingaccessmethods,physicaloperators,operatororders,…
– No?onofcostisbasedonanabstract“costmodel”
Prakash2016 VTCS4604 8
Rela0onalOpera0ons§ Wewillconsiderhowtoimplement:– Selec3on(σ)Selectsasubsetofrowsfromrela?on.– Projec3on(π)Deletesunwantedcolumnsfromrela?on.– Join()Allowsustocombinetworela?ons.– Set-difference(-)Tuplesinreln.1,butnotinreln.2.– Union(∪)Tuplesinreln.1andinreln.2.– Aggrega3on(SUM,MIN,etc.)andGROUPBY
§ Recall:opscanbecomposed!§ Later(akerspringbreak),we’llseehowtoop3mizequerieswithmanyops
▹◃
Prakash2016 VTCS4604 9
SchemaforExamples
§ Similartooldschema;rnameaddedforvaria?ons.§ Sailors:– Eachtupleis50byteslong,80tuplesperpage,500pages.– N=500,pS=80.
§ Reserves:– Eachtupleis40byteslong,100tuplesperpage,1000pages.– M=1000,pR=100.
Sailors(sid:integer,sname:string,ra3ng:integer,age:real)Reserves(sid:integer,bid:integer,day:dates,rname:string)
Prakash2016 VTCS4604 10
SimpleSelec0ons
§ Oftheform§ Ques?on:howbesttoperform?
SELECT*FROMReservesRWHERER.rname<‘C%’
σ R attr valueop R. ( )
Prakash2016 VTCS4604 11
SimpleSelec0ons
§ A:Dependson:– whatindexes/accesspathsareavailable– whatistheexpectedsizeoftheresult(intermsofnumberoftuplesand/ornumberofpages)
Prakash2016 VTCS4604 12
SimpleSelec0ons
§ Sizeofresultapproximatedas sizeofR*reduc3onfactor– “reduc?onfactor”isalsocalledselec3vity.– es?mateofreduc?onfactorsisbasedonsta?s?cs–wewilldiscussshortly.
Prakash2016 VTCS4604 13
Alterna0vesforSimpleSelec0ons
§ Withnoindex,unsorted:– Mustessen?allyscanthewholerela?on– costisM(#pagesinR).For“reserves”=1000I/Os.
Prakash2016 VTCS4604 14
SimpleSelec0ons(cnt’d)
§ Withnoindex,sorted:– costofbinarysearch+numberofpagescontainingresults.
– Forreserves=10I/Os+⎡selec?vity*#pages⎤
Prakash2016 VTCS4604 15
SimpleSelec0ons(cnt’d)
§ Withanindexonselec?onauribute:– Useindextofindqualifyingdataentries,– thenretrievecorrespondingdatarecords.– (Hashindexusefulonlyforequalityselec?ons.)
Prakash2016 VTCS4604 16
UsinganIndexforSelec0ons
§ Costdependson#qualifyingtuples,andclustering.– Cost:• findingqualifyingdataentries(typicallysmall)• pluscostofretrievingrecords(couldbelargew/oclustering).
Prakash2016 VTCS4604 17
Selec0onsusingIndex(cnt’d)
Index entries
Data entries
direct search for
(Index File) (Data file)
Data Records
data entries
Data entries
Data Records
CLUSTEREDUNCLUSTERED
Prakash2016 VTCS4604 18
Selec0onsusingIndex(cnt’d)– Inexample“reserves”rela?on,if10%oftuplesqualify(100pages,10,000tuples).• Withaclusteredindex,costisliulemorethan100I/Os;• ifunclustered,couldbeupto10,000I/Os!unless…
Prakash2016 VTCS4604 19
Selec0onsusingIndex(cnt’d)§ Importantrefinementforunclusteredindexes:
1.Findqualifyingdataentries.2.Sorttherid’softhedatarecordstoberetrieved.3.Fetchridsinorder.Thisensuresthateachdatapageislookedatjustonce(though#ofsuchpageslikelytobehigherthanwithclustering).
Prakash2016 VTCS4604 20
GeneralSelec0onCondi0ons
§ Q:Whatwouldyoudo?(day
Prakash2016 VTCS4604 21
GeneralSelec0onCondi0ons
§ Q:Whatwouldyoudo?§ A:trytofindaselec?ve(clustering)index.Specifically:
(day
Prakash2016 VTCS4604 22
GeneralSelec0onCondi0ons
§ Converttoconjunc3venormalform(CNF):– (day
Prakash2016 VTCS4604 23
GeneralSelec0onCondi0ons
§ AB-treeindexmatches(aconjunc?onof)termsthatinvolveonlyauributesinaprefixofthesearchkey.– Indexonmatchesa=5ANDb=3,butnotb=3.
§ ForHashindex,musthaveallauributesinsearchkey
(day
Prakash2016 VTCS4604 24
TwoApproachestoGeneralSelec0ons
§ Firstapproach:Findthecheapestaccesspath,retrievetuplesusingit,andapplyanyremainingtermsthatdon’tmatchtheindex
§ Secondapproach:getridsfromfirstindex;ridsfromsecondindex;intersectandfetch.
SKIP
Prakash2016 VTCS4604 25
TwoApproachestoGeneralSelec0ons
§ Firstapproach:Findthecheapestaccesspath,retrievetuplesusingit,andapplyanyremainingtermsthatdon’tmatchtheindex:– Cheapestaccesspath:AnindexorfilescanwithfewestI/Os.
– Termsthatmatchthisindexreducethenumberoftuplesretrieved;othertermshelpdiscardsomeretrievedtuples,butdonotaffectnumberoftuples/pagesfetched.
SKIP
Prakash2016 VTCS4604 26
CheapestAccessPath-Example§ Considerday<8/9/94ANDbid=5ANDsid=3.
§ AB+treeindexondaycanbeused;– then,bid=5andsid=3mustbecheckedforeachretrievedtuple.
§ Similarly,ahashindexoncouldbeused;– Then,day
Prakash2016 VTCS4604 27
CheapestAccessPath-cnt’d
§ Considerday<8/9/94ANDbid=5ANDsid=3.
§ HowaboutaB+treeon?§ HowaboutaB+treeon?§ HowaboutaHashindexon?
SKIP
Prakash2016 VTCS4604 28
Intersec0onofRIDs
§ Secondapproach:ifwehave2ormorematchingindexes(w/Alterna?ves(2)or(3)fordataentries):– Getsetsofridsofdatarecordsusingeachmatchingindex.
– Thenintersectthesesetsofrids.– Retrievetherecordsandapplyanyremainingterms.
SKIP
Prakash2016 VTCS4604 29
Intersec0onofRIDs(cnt’d)
§ EXAMPLE:Considerday
Prakash2016 VTCS4604 30
TheProjec0onOpera0on
§ Issueisremovingduplicates.§ Basicapproach:sor?ng– 1.ScanR,extractonlytheneededaurs(why?)– 2.Sorttheresul?ngset– 3.RemoveadjacentduplicatesCost:Reserveswithsizera?o0.25=250pages.With20bufferpagescansortin2passes,so1000+250+2*2*250+250=2500I/Os
SELECTDISTINCTR.sid,R.bidFROMReservesR
Prakash2016 VTCS4604 31
Projec0on
§ Canimprovebymodifyingexternalsortalgorithm(seechapter13):– ModifyPass0ofexternalsorttoeliminateunwantedfields.
– Modifymergingpassestoeliminateduplicates.Cost:forabovecase:read1000pages,writeout250inrunsof40pages,mergeruns=1000+250+250=1500.
SKIP
Prakash2016 VTCS4604 32
DiscussionofProjec0on
§ Ifanindexontherela?oncontainsallwantedauributesinitssearchkey,candoindex-onlyscan.– Applyprojec?ontechniquestodataentries(muchsmaller!)
Prakash2016 VTCS4604 33
DiscussionofProjec0on
§ Ifanordered(i.e.,tree)indexcontainsallwantedauributesasprefixofsearchkey,candoevenbeuer:– Retrievedataentriesinorder(index-onlyscan),discardunwantedfields,compareadjacenttuplestocheckforduplicates.
AB-treeindexmatches(aconjunc?onof)termsthatinvolveonlyauributesinaprefixofthesearchkey.– Indexonmatchesa=5ANDb=3,butnotb=3.
ForHashindex,musthaveallauributesinsearchkey
Prakash2016 VTCS4604 34
Joins
§ Joinsareverycommon.§ Joinscanbeveryexpensive(crossproductin
worstcase).§ Manyapproachestoreducejoincost.
Prakash2016 VTCS4604 35
Joins
§ Jointechniqueswewillcover:– Nested-loopsjoin– Index-nestedloopsjoin– Sort-mergejoin– Hashjoin
Prakash2016 VTCS4604 36
EqualityJoinsWithOneJoinColumn
§ Inalgebra:RS.Common!Mustbecarefullyop?mized.R×Sislarge;so,R×Sfollowedbyaselec?onisinefficient.
§ Remember,joinisassocia?veandcommuta?ve.
SELECT*FROMReservesR1,SailorsS1WHERER1.sid=S1.sid
▹◃
Prakash2016 VTCS4604 37
EqualityJoins
§ Assume:– MpagesinR,pRtuplesperpage,mtuplestotal– NpagesinS,pStuplesperpage,ntuplestotal– Inourexamples,RisReservesandSisSailors.
§ Wewillconsidermorecomplexjoincondi?onslater.
§ Costmetric:#ofI/Os.Wewillignoreoutputcosts.
Prakash2016 VTCS4604 38
Nestedloops
§ Algorithm#0:(naive)nestedloop(SLOW!)
R(A,..)
S(A, ......) m
n
Prakash2016 VTCS4604 39
Nestedloops
§ Algorithm#0:(naive)nestedloop(SLOW!)foreachtuplerofR
foreachtuplesofSprint,iftheymatch
R(A,..)
S(A, ......) m
n
Prakash2016 VTCS4604 40
Nestedloops
§ Algorithm#0:(naive)nestedloop(SLOW!)foreachtuplerofR
foreachtuplesofSprint,iftheymatch
R(A,..)
S(A, ......) m
n
outer relationinner relation
Prakash2016 VTCS4604 41
Nestedloops
§ Algorithm#0:whyisitbad?§ howmanydiskaccesses(‘M’and‘N’arethenumberofblocksfor‘R’and‘S’)?
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
Prakash2016 VTCS4604 42
Nestedloops
§ Algorithm#0:whyisitbad?§ howmanydiskaccesses(‘M’and‘N’arethenumberofblocksfor‘R’and‘S’)?M+m*N
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
Prakash2016 VTCS4604 43
SimpleNestedLoopsJoin
§ Actualnumber(pR*M)*N+M=100*1000*500+1000I/Os.– At10ms/IO,Total:???
§ Whatifsmallerrela?on(S)wasouter?
§ Whatassump?onsarebeingmadehere?
Prakash2016 VTCS4604 44
SimpleNestedLoopsJoin
§ Actualnumber§ (pR*M)*N+M=100*1000*500+1000I/Os.– At10ms/IO,Total:~6days(!)
§ Whatifsmallerrela?on(S)wasouter?– slightlybeuer
§ Whatassump?onsarebeingmadehere?– 1bufferforeachtable(and1foroutput)
Prakash2016 VTCS4604 45
Nestedloops
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
• Algorithm #1: Blocked nested-loop join – read in a block of R
• read in a block of S – print matching tuples COST?
Prakash2016 VTCS4604 46
Nestedloops
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
• Algorithm #1: Blocked nested-loop join – read in a block of R
• read in a block of S – print matching tuples COST= M+M*N
Prakash2016 VTCS4604 47
Nestedloops
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
• Which one should be the outer relation?
COST= M+M*N
Prakash2016 VTCS4604 48
Nestedloops
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
• Which one should be the outer relation? • A: the smallest (page-wise)
COST= M+M*N
Prakash2016 VTCS4604 49
Nestedloops
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
• M=1000, N=500 • Cost = 1000 + 1000*500 = 501,000 • = 5010 sec ~ 1.4h COST= M+M*N
Prakash2016 VTCS4604 50
Nestedloops
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
• M=1000, N=500 - if smaller is outer: • Cost = 500 + 1000*500 = 500,500 • = 5005 sec ~ 1.4h COST= N+M*N
Prakash2016 VTCS4604 51
Nestedloops
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
• What if we have B buffers available?
Prakash2016 VTCS4604 52
Nestedloops
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
• What if we have B buffers available? • A: give B-2 buffers to outer, 1 to inner, 1 for
output
Prakash2016 VTCS4604 53
Nestedloops
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
• Algorithm #1: Blocked nested-loop join – read in B-2 blocks of R
• read in a block of S – print matching tuples COST= ?
Prakash2016 VTCS4604 54
Nestedloops
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
• Algorithm #1: Blocked nested-loop join – read in B-2 blocks of R
• read in a block of S – print matching tuples COST= M+M/(B-2)*N
Prakash2016 VTCS4604 55
Nestedloops
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
• and, actually: • Cost = M + ceiling(M/(B-2)) * N
COST= M+M/(B-2)*N
Prakash2016 VTCS4604 56
Nestedloops
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
• If smallest (outer) fits in memory • (ie., B= N+2), • Cost =? COST= N+N/(B-2)*M
Prakash2016 VTCS4604 57
Nestedloops
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
• If smallest (outer) fits in memory • (ie., B= N+2), • Cost =N+M (minimum!) COST= N+N/(B-2)*M
Prakash2016 VTCS4604 58
Nestedloops-guidelines
§ pickasouterthesmallesttable(=fewestpages)
§ fitasmuchofitinmemoryaspossible§ loopovertheinner
Prakash2016 VTCS4604 59
§ useanexis?ngindex,orevenbuildoneonthefly
§ cost:M+m*c(c:look-upcost)
IndexNLjoin
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
Prakash2016 VTCS4604 60
§ cost:M+m*c(c:look-upcost)§ ‘c’dependswhethertheindexisclusteredornot.
IndexNLjoin
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
Prakash2016 VTCS4604 61
Joins
§ Jointechniqueswewillcover:– Nested-loopsjoin– Index-nestedloopsjoin– Sort-mergejoin– Hashjoin
Prakash2016 VTCS4604 62
Sort-mergejoin
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
• sort both on joining attributed • scan each and merge • Cost, given B buffers?
Prakash2016 VTCS4604 63
Sort-mergejoin
R(A,..)
S(A, ......) M pages,
m tuples N pages,
n tuples
• Cost, given B buffers? • ~ 2*M*logM/logB + 2*N* logN/logB + M + N
Prakash2016 VTCS4604 64
Sort-MergeJoin§ Usefulif
Prakash2016 VTCS4604 65
Sort-MergeJoin§ Usefulif– oneorbothinputsarealreadysortedonjoinauribute(s)
– outputisrequiredtobesortedonjoinauributes(s)
§ “Merge”phasecanrequiresomebacktrackingifduplicatevaluesappearinjoincolumn
Prakash2016 VTCS4604 66
ExampleofSort-MergeJoin
sid sname rating age22 dustin 7 45.028 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0
sid bid day rname28 103 12/4/96 guppy28 103 11/3/96 yuppy31 101 10/10/96 dustin31 102 10/12/96 lubber31 101 10/11/96 lubber58 103 11/12/96 dustin
Prakash2016 VTCS4604 67
ExampleofSort-MergeJoin
§ With35,100or300bufferpages,bothReservesandSailorscanbesortedin2passes;totaljoincost:7500.
§ (while Block Nested Loop (BNL) cost: 2,500 to 15,000 I/Os)
Prakash2016 VTCS4604 68
Sort-mergejoin
§ Worstcaseformergingphase?
§ Cost?
Prakash2016 VTCS4604 69
Refinements
§ Alltherefinementsofexternalsor?ng§ plusoverlappingofthemergingofsor?ngwiththemergingofjoining.
Prakash2016 VTCS4604 70
Joins
§ Jointechniqueswewillcover:– Nested-loopsjoin– Index-nestedloopsjoin– Sort-mergejoin– Hashjoin
Prakash2016 VTCS4604 71
§ hashjoin:usehashingfunc?onh()– hash‘R’into(0,1,...,‘max’)buckets– hash‘S’intobuckets(samehashfunc?on)– joineachpairofmatchingbuckets
Hashjoins
R(A, ...) S(A, ......) 0
1
max
Prakash2016 VTCS4604 72
– howtojoineachpairofpar??onsHr-i,Hs-i?– A:buildanotherhashtableforHs-i,andprobeitwitheachtupleofHr-i
Hashjoin-details
R(A, ...) S(A, ......)
Hr-0
0
1
max
Hs-0
Prakash2016 VTCS4604 73
Hashjoin-details
§ Inmoredetail:§ Choosethe(page-wise)smallest-ifitfitsinmemory,do~NL– and,actually,buildahashtable(withh2()!=h())– andprobeit,witheachtupleoftheother
Prakash2016 VTCS4604 74
§ whatifHs-iistoolargetofitinmain-memory?
§ A:recursivepar??oning§ moredetails(overflows,hybridhashjoins):inbook
§ costofhashjoin?(ifwehaveenoughbuffers:)3(M+N)(why?Seenextslide)
Hashjoindetails
Prakash2016 VTCS4604 75
CostofHash-Join
§ Inpar??oningphase,read+writebothrelns;2(M+N).Inmatchingphase,readbothrelns;M+NI/Os.
§ Inourrunningexample,thisisatotalof4500I/Os.
Prakash2016 VTCS4604 76
§ [costofhashjoin?(ifwehaveenoughbuffers:)3(M+N)]
§ Whatis‘enough’?sqrt(N),orsqrt(M)?
Hashjoindetails
Prakash2016 VTCS4604 77
§ [costofhashjoin?(ifwehaveenoughbuffers:)3(M+N)]
§ Whatis‘enough’?sqrt(N),orsqrt(M)?§ A:sqrt(smallest)(why?)– Becauseyouonlyneedenoughmemorytoholdthehashtablepar??onsofthesmallertableinmemorysoB>sizeofsmaller/B-1èB~sqrt(size-of-smaller)
Hashjoindetails Details
Prakash2016 VTCS4604 78
Sort-MergeJoinvs.HashJoin
§ Givenaminimumamountofmemorybothhaveacostof3(M+N)I/Os.
(min.memoryforsort-merge=sqrt(largertable)usingaggressiverefinements---intextbook)(min.memoryforhash=sqrt(smallertable)---seepreviousslides)
Prakash2016 VTCS4604 79
Sort-MergevsHashjoin
§ HashJoinPros:– ??– ??– ??
§ Sort-MergeJoinPros:– ??
Prakash2016 VTCS4604 80
Sort-MergevsHashjoin
§ HashJoinPros:– Superiorifrela?onsizesdiffergreatly– Showntobehighlyparallelizable(beyondscopeofclass)
§ Sort-MergeJoinPros:– ??
Prakash2016 VTCS4604 81
Sort-MergevsHashjoin
§ HashJoinPros:– Superiorifrela?onsizesdiffergreatly– Showntobehighlyparallelizable(beyondscopeofclass)
§ Sort-MergeJoinPros:– Lesssensi?vetodataskew– Resultissorted(mayhelp“upstream”operators)– goesfasterifoneorbothinputsalreadysorted
Prakash2016 VTCS4604 82
GeneralJoinCondi0ons
§ Equali?esoverseveralauributes(e.g.,R.sid=S.sidANDR.rname=S.sname):– allpreviousmethodsapply,usingthecompositekey
Prakash2016 VTCS4604 83
GeneralJoinCondi0ons
§ Inequalitycondi?ons(e.g.,R.rname<S.sname):§ whichmethodss?llapply?– NL– indexNL– Sortmerge– Hashjoin
Prakash2016 VTCS4604 84
GeneralJoinCondi0ons
§ Inequalitycondi?ons(e.g.,R.rname<S.sname):§ whichmethodss?llapply?– NL (probably,thebest!)– indexNL (onlyifclusteredindex)– Sortmerge (doesnotapply!)(why?)– Hashjoin (doesnotapply!)(why?)
Prakash2016 VTCS4604 85
SetOpera0ons
§ Intersec?onandcross-product:specialcasesofjoin
§ Union(Dis?nct)andExcept:similar;we’lldounion:
§ Effec?vely:concatenate;usesor?ngorhashing§ Sor?ngbasedapproachtounion:– Sortbothrela?ons(oncombina?onofallauributes).– Scansortedrela?onsandmergethem.– Alterna3ve:MergerunsfromPass0forbothrela?ons.
SKIP
Prakash2016 VTCS4604 86
SetOpera0ons,cont’d
§ Hashbasedapproachtounion:– Par??onRandSusinghashfunc?onh.– ForeachS-par??on,buildin-memoryhashtable(usingh2),scancorrespondingR-par??onandaddtuplestotablewhilediscardingduplicates.
SKIP
Prakash2016 VTCS4604 87
AggregateOpera0ons(AVG,MIN,etc.)
§ Withoutgrouping:– Ingeneral,requiresscanningtherela?on.– GivenindexwhosesearchkeyincludesallauributesintheSELECTorWHEREclauses,candoindex-onlyscan.
SKIP
Prakash2016 VTCS4604 88
Summary§ Avirtueofrela?onalDBMSs:– queriesarecomposedofafewbasicoperators– Theimplementa?onoftheseoperatorscanbecarefullytuned
– Importanttodothis!§ Manyalterna?veimplementa?ontechniquesforeachoperator– Nouniversallysuperiortechniqueformostoperators.
“it depends” [Guy Lohman (IBM)]
Prakash2016 VTCS4604 89
Summarycont’d
§ Mustconsideravailablealterna?vesforeachopera?oninaqueryandchoosebestonebasedonsystemsta?s?cs,etc.– Partofthebroadertaskofop?mizingaquerycomposedofseveralops.
Top Related