CPS216: Advanced Database Systems Notes 02:Query Processing (Overview)
Data Warehousing and Data Cube -...
Transcript of Data Warehousing and Data Cube -...
CompSci 516DatabaseSystems
Lecture22DataWarehousing
andDataCube
Instructor:Sudeepa Roy
DukeCS,Fall2017 CompSci516:DatabaseSystems 1
ReadingMaterial
• [RG]– Chapter25
• Gray-Chaudhuri-Bosworth-Layman-Reichart-Venkatrao-Pellow-Pirahesh,ICDE1996“DataCube:ARelationalAggregationOperatorGeneralizingGroup-By,Cross-Tab,andSub-Totals”
• Harinarayan-Rajaraman-Ullman,SIGMOD1996“Implementingdatacubesefficiently”
DukeCS,Fall2017 CompSci516:DatabaseSystems 2
Acknowledgement:• Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.RamakrishnanandDr.Gehrke.• SomeslideshavebeenpreparedbyProf.Shivnath Babu
DataWarehousing
DukeCS,Fall2017 CompSci516:DatabaseSystems 3
4
Warehousing
• Growingindustry:$8billionwaybackin1998• DatawarehousevendorlikeTeradata
• big“Petabytescale”customers• Apple,Walmart(2008-2.5PB),eBay(2013-primaryDW9.2PB,otherbigdata40PB,singletablewith1trillionrows),Verizon,AT&T,BankofAmerica
• supportsdataintoandoutofHadoop
• Lotsofbuzzwords,hype– slice&dice,rollup,MOLAP,pivot,...
Ack:SlidebyProf.Shivnath Babu
https://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/
5
MotivatingExamples
• Forecasting• Comparingperformanceofunits• Monitoring,detectingfraud• Visualization
Introduction• Organizationsanalyzecurrentandhistoricaldata
– toidentifyusefulpatterns– tosupportbusinessstrategies
• Emphasisisoncomplex,interactive,exploratoryanalysisofverylargedatasets
• Createdbyintegratingdatafromacrossallpartsofanenterprise
• Dataisfairlystatic
• Relevantonceagainfortherecent“BigDataanalysis”– tofigureoutwhatwecanreuse,whatwecannot
DukeCS,Fall2017 CompSci516:DatabaseSystems 6
ThreeComplementaryTrends
• DataWarehousing(DW):– Consolidatedatafrommanysourcesinonelargerepository– Loading,periodicsynchronizationofreplicas– Semanticintegration
• OLAP:– ComplexSQLqueriesandviews.– Queriesbasedonspreadsheet-styleoperationsand
“multidimensional”viewofdata.– Interactiveand“online”queries.
• DataMining:– Exploratorysearchforinterestingtrendsandanomalies– Nextlecture!
DukeCS,Fall2017 CompSci516:DatabaseSystems 7
DataWarehousing• Acollectionofdecisionsupporttechnologies• Toenablepeopleinindustry/organizationstomakebetter
decisions– SupportsOLAP(On-LineAnalyticalProcessing)
• Applicationsin– Manufacturing– Retail– Finance– Transportation– Healthcare– …
• Typicallymaintainedseparatelyfrom“OperationalDatabases”– OperationalDatabasessupportOLTP(On-LineTransactionProcessing)
8DukeCS,Fall2017 CompSci516:DatabaseSystems
9
WhyaWarehouse?
• TwoApproaches:– Query-Driven(Lazy)–Warehouse(Eager)
Source Source
?
10
AdvantagesofWarehousing
• Highqueryperformance• Queriesnotvisibleoutsidewarehouse• Localprocessingatsourcesunaffected• Canoperatewhensourcesunavailable• CanquerydatanotstoredinaDBMS• Extrainformationatwarehouse–Modify,summarize(storeaggregates)– Addhistoricalinformation
11
Query-DrivenApproach
Client Client
Wrapper Wrapper Wrapper
Mediator
Source Source Source
12
AdvantagesofQuery-Driven
• Noneedtocopydata– lessstorage– noneedtopurchasedata
• Moreup-to-datedata• Queryneedscanbeunknown• Onlyqueryinterfaceneededatsources• Maybelessdrainingonsources
OLTP DataWarehousing/OLAPMostlyupdates Mostlyreads
Applications:Orderentry,salesupdate,bankingtransactions
Applications:Decisionsupportinindustry/organization
Detailed,up-to-datedata Summarized, historicaldata(frommultipleoperationaldb,growsovertime)
Structured,repetitive,shorttasks Queryintensive,adhoc,complexqueries
Eachtransactionreads/updatesonlyafewtuples(tensof)
Eachquerycanaccesses manyrecords,andperformmanyjoins,scans,aggregates
MB-GBdata GB-TB data
Typicallyclericalusers Decisionmakers,analysts asusers
Important:Consistency,recoverability,Maximizingtr.throughput
Important:QuerythroughputResponse times
13DukeCS,Fall2017 CompSci516:DatabaseSystems
DataMarts
• smallerdatawarehouse• subsetsofdataonselectedsubjects• e.g.Marketingdatamartcanincludecustomer,product,sales
• Department-focused,noenterprise-wideconsensusneeded
• Butmayleadtocomplexintegrationproblemsinthelongrun
14DukeCS,Fall2017 CompSci516:DatabaseSystems
ROLAPandMOLAP
• RelationalOLAP(ROLAP)– OntopofstandardrelationalDBMS– DataisstoredinrelationalDBMS– SupportsextensionstoSQLtoaccessmulti-dimensionaldata
• MultidimensionalOLAP(MOLAP)– Directlystoresmultidimensionaldatainspecialdatastructures(e.g.arrays)
15DukeCS,Fall2017 CompSci516:DatabaseSystems
DataWarehousingtoMining
• Integrateddataspanninglongtimeperiods,oftenaugmentedwithsummaryinformation
• Severalgigabytestoterabytescommon
• Interactiveresponsetimesexpectedforcomplexqueries;ad-hocupdatesuncommon
EXTERNALDATASOURCES
EXTRACTTRANSFORM
LOADREFRESH
DATAWAREHOUSE
MetadataRepository
SUPPORTS
OLAPDATAMINING
DukeCS,Fall2017 CompSci516:DatabaseSystems 16
WarehousingIssues• SemanticIntegration: Whengettingdatafrommultiplesources,musteliminatemismatches– e.g.,differentcurrencies,schemas
• HeterogeneousSources: Mustaccessdatafromavarietyofsourceformatsandrepositories– Replicationcapabilitiescanbeexploitedhere
• Load,Refresh,Purge: Mustloaddata,periodicallyrefreshit,andpurgetoo-olddata
• MetadataManagement: Mustkeeptrackofsource,loadingtime,andotherinformationforalldatainthewarehouse
DukeCS,Fall2017 CompSci516:DatabaseSystems 17
DWArchitecture
18
• ExtractdatafrommultipleoperationalDBandexternalsources
• Clean/integrate/transform/store• Refreshperiodically
– updatebaseandderiveddata– admindecideswhenandhowDukeCS,Fall2017 CompSci516:DatabaseSystems
• MainDWandseveraldatamarts(possibly)• Managedbyoneormoreserversand
frontendtools• Additionalmetadataand
monitoring/admintools
ROLAP:StarSchema
• Toreflectmulti-dimensionalviewsofdata
• Singlefacttable
• Singletableforeverydimension
• Eachtupleinthefacttableconsistsof– pointers(foreignkey)toeach
ofthedimensions(multi-dimensionalcoordinates)
– numericvalueforthosecoordinates
• Eachdimensiontablecontainsattributesofthatdimension
19
NosupportforattributehierarchiesDukeCS,Fall2017 CompSci516:DatabaseSystems
DimensionHierarchies
• Foreachdimension,thesetofvaluescanbeorganizedinahierarchy:
PRODUCT TIME LOCATION
categoryweekmonthstate
pnamedatecity
year
quartercountry
DukeCS,Fall2017 CompSci516:DatabaseSystems 20
ROLAP:SnowflakeSchema
21
• Refinesstar-schema• Dimensionalhierarchyis
explicitlyrepresented
• (+)Dimensiontableseasiertomaintain– supposethe“category
descriptionisbeingchanged
• (-)Needadditionaljoins
• FactConstellations– Multiplefacttablessharesome
dimensionaltables– e.g.ProjectedandActual
Expensesmaysharemanydimensions
DukeCS,Fall2017 CompSci516:DatabaseSystems
Motivation:OLAPQueries• Dataanalystsareinterestedinexploringtrendsandanomalies– Possiblybyvisualization(Excel)- 2Dor3Dplots– “DimensionalityReduction”bysummarizingdataandcomputing
aggregates– InfluencedbySQLandbyspreadsheets.– Acommonoperationistoaggregate ameasureoveroneormore
dimensions.
• Findtotalunitsalesforeach1. Model2. Model,brokenintoyears3. Year,brokenintocolors4. Year5. Model,brokenintocolor,….
22
Sales(Model,Year,Color,Units)
DukeCS,Fall2017 CompSci516:DatabaseSystems
OLAPand
DataCube
DukeCS,Fall2017 CompSci516:DatabaseSystems 23
HistogramsAtabulatedfrequencyofcomputedvalues
SELECT Year, COUNT(Units) as total
FROM Sales
GROUP BY Year
ORDER BY Year
MayrequireanestedSELECTtocompute
Sales(Model,Year,Color,Units)
Year->
total->
Roll-Ups• Analysisreportsstartatacoarselevel,
gotofinerlevels• Orderofattributematters• Notrelationaldata(emptycellsno
keys)
Model Year Color Model,Year,Color Model,Year Model
Chevy 1994 Black 50
Chevy 1994 White 40
90
Chevy 1995 Black 115
Chevy 1995 White 85
200
290
GROUPBY
Sales(Model,Year,Color,Units)
Roll-ups
Drill-downs
Roll-Ups• Anotherrepresentation(ChrisDate’96)• Relational,but
– longattributenames– hardtoexpressinSQLandrepetition
Model Year Color Model,Year,Color Model,Year Model
Chevy 1994 Black 50 90 290
Chevy 1994 White 40 90 290
Chevy 1995 Black 85 200 290
Chevy 1995 Black 115 200 290
GROUPBY
Sales(Model,Year,Color,Units)
‘ALL’ ConstructEasiertovisualizeroll-upifallowALLtofillinthesuper-aggregates
SELECT Model, Year, Color, SUM(Units)
FROM Sales
WHERE Model = ‘Chevy’
GROUP BY Model, Year, Color
UNION
SELECT Model, Year, ‘ALL’, SUM(Units)
FROM Sales
WHERE Model = ‘Chevy’
GROUP BY Model, Year
UNION
…
UNION
SELECT ‘ALL’, ‘ALL’, ‘ALL’, SUM(Units)
FROM Sales
WHERE Model = ‘Chevy’;
Model Year Color Units
Chevy 1994 Black 50
Chevy 1994 White 40
Chevy 1994 ‘ALL’ 90
Chevy 1995 Black 85
Chevy 1995 White 115
Chevy 1995 ‘ALL’ 200
Chevy ‘ALL’ ‘ALL’ 290
Sales(Model,Year,Color,Units)
Model Year Color Units
Chevy 1994 Black 50
Chevy 1994 White 40
Chevy 1994 ‘ALL’ 90
Chevy 1995 Black 85
Chevy 1995 White 115
Chevy 1995 ‘ALL’ 200
Chevy ‘ALL’ ‘ALL’ 290
Model Year Color Model,Year,Color Model,Year Model
Chevy 1994 Black 50
Chevy 1994 White 40
90
Chevy 1995 Black 115
Chevy 1995 White 85
200
290
TraditionalRoll-Up ‘ALL’Roll-Up
• Roll-upsareasymmetric
Sales(Model,Year,Color,Units)
CrossTabulation• Ifwemadetheroll-upsymmetric,wewouldgetacross-tabulation• Generalizestohigherdimensions
SELECT Model, ‘ALL’, Color, SUM(Units)
FROM Sales
WHERE Model = ‘Chevy’
GROUP BY Model, Color
Chevy 1994 1995 Total(ALL)
Black 50 85 135
White 40 115 155
Total(ALL) 90 200 290
IstheproblemsolvedwithCross-TabandGROUP-BYswith‘ALL’?• RequiresalotofGROUPBYs(64for6-dimension)• Toocomplextooptimize (64scans,64sort/hash,slow)
Sales(Model,Year,Color,Units)
NaïveApproach
Runanumberofqueries
SELECT sum(units)FROM Sales
SELECT Color, sum(units)FROM SalesGROUP BY Color
SELECT Year, sum(units)FROM SalesGROUP BY Year
SELECT Model, Year, sum(units)FROM SalesGROUP BY Model, Year….
• DatacubegeneralizesHistogram,Roll-Ups,Cross-Tabs
• MorecomplextodothesewithGROUP-BY
30
• Howmanysub-queries?• Howmanysub-queriesfor
8attributes?
Sales(Model,Year,Color,Units)
YearColor
Mod
el
TotalUnitsales
DukeCS,Fall2017 CompSci516:DatabaseSystems
DataCube:Intuition
SELECT ‘ALL’, ‘ALL’, ‘ALL’, sum(units)
FROM SalesUNIONSELECT ‘ALL’, ‘ALL’, Color, sum(units)
FROM SalesGROUP BY Color
UNIONSELECT ‘ALL’, Year, ‘ALL’, sum(units)FROM Sales
GROUP BY YearUNION SELECT Model, Year, ‘ALL’, sum(units)FROM SalesGROUP BY Model, Year
UNION ….
31
Sales(Model,Year,Color,Units)
YearColor
Mod
el
TotalUnitsales
DukeCS,Fall2017 CompSci516:DatabaseSystems
DataCube
Ack:fromslidesbyLaurelOrrandJeremyHyrkas,UW
DataCube• Computestheaggregateonallpossiblecombinationsof
groupbycolumns.
• IfthereareNattributes,thereare2N-1super-aggregates.
• IfthecardinalityoftheNattributesareC1,...,CN,thenthereareatotalof(C1+1)...(CN+1)valuesinthecube.
• ROLL-UPissimilarbutjustlooksatNaggregates
DataCube Syntax• SQLServer
SELECT Model, Year, Color, sum(units)FROM Sales
GROUP BY Model, Year, Color
WITH CUBE
Sales(Model,Year,Color,Units)
TypesofAggregates• Distributive:inputcanbepartitionedintodisjointsets
andaggregatedseparatelyo COUNT,SUM,MIN
• Algebraic:canbecomposedofdistributiveaggregateso AVG
• Holistic:aggregatemustbecomputedovertheentireinputseto MEDIAN
• EfficientcomputationoftheCUBEoperatordependsonthetypeofaggregate
– DistributiveandAlgebraicaggregatesmotivateoptimizations
ImplementingDataCube
DukeCS,Fall2017 CompSci516:DatabaseSystems 36
BasicIdeas
• Needtocomputeallgroup-by-s:– ABCD,ABC,ABD,BCD,AB,AC,AD,BC,BD,CD,A,B,C,D
• ComputeGROUP-BYsfrompreviouslycomputedGROUP-BYs– e.g.firstABCD– thenABCorACD– thenABorAC…
• WhichorderABCDissorted,mattersforsubsequentcomputations
– if(ABCD)isthesortedorder,ABCischeap,ACDorBCDisexpensive
Notations
• ABCD– group-byonattributesA,B,C,D– noguaranteeontheorderoftuples
• (ABCD)– sortedaccordingtoA->B->C->D
• ABCDand(ABCD)and(BCDA)– allcontainthesameresults– butindifferentsortedorder
Optimization1:SmallestParent
• ComputeGROUP-BYfromthesmallest(size)previouslycomputedGROUP-BYasaparent
– ABcanbecomputedfromABC,ABD,orABCD
– ABCorABDbetterthanABCD– EvenABCorABDmayhave
differentsizes,trytochoosethesmallerparent
39
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all
DukeCS,Fall2017 CompSci516:DatabaseSystems
LATTICESTRUCTUREofdatacube
Optimization2:CacheResults
• CacheresultofoneGROUP-BYinmemorytoreducediskI/O– ComputeABfromABCwhile
ABCisstillinmemory
40
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all
DukeCS,Fall2017 CompSci516:DatabaseSystems
Optimization3:AmortizeDiskScans
• AmortizediskreadsformultipleGROUP-BYs– SupposetheresultforABCD
isstoredondisk– ComputeallofABC,ABD,
ACD,BCDsimultaneouslyinonescanofABCD
41
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all
DukeCS,Fall2017 CompSci516:DatabaseSystems
Optimization4,5(next)
• 4.Share-sort– forsort-basedalgorithms– pipe-sortalgorithm– coveredinclass
• 5.Shared-partition– forhash-basedalgorithms– pipe-hashalgorithm
• UseshashtablestocomputesmallerGROUP-Bys
• IfthehashtablesforABandACfitinmemory,computebothinonescanofABC
• OtherwisecanpartitiononA,andcancomputeHTsofABandACindifferentpartitions
– notcovered(seepaper)
42
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all
DukeCS,Fall2017 CompSci516:DatabaseSystems
PipeSort:Idea
• Combinestwooptimizations:“shared-sorts”and“smallest-parent”
• Alsoincludes“cache-results”and“amortized-scans”
PipeSort:Share-sortoptimization• Datasortedinoneorder• ComputeallGROUP-BYsprefixedinthatorder• ComputeonetupleofABCD,propagateupwardinthe
pipelinebyasinglescan
• Example:– GROUP-BY overattributesABCD– Sortrawdataby(ABCD)– Compute(ABCD)->(ABC)->(AB)->(A) inpipelinedfashion– Noadditionalsortneeded
• BUT,mayhaveaconflictwith“smallest-parent”optimization– (ABD)->(AB)couldbeabetterchoice– Figureoutthebestparentchoicebyrunningaweighted-matching
algorithmlayerbylayer
(ABCD)
(ABC)
(AB)
(A)
SearchLattice• Directededge=>oneattributeless
andpossiblecomputation• Levelkcontainskattributes
– all=0attribute
• Twopossiblecostsforeachedgeeij=i --->j
• A(eij):i issortedforj– (BCA)->(BC)
• S(eij):i isNOTsortedforj– e.g.ABC->(BCA)->(BC)orhash
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all Level0
Level1
Level2
Level3
Level4
A B C suma1 b1 c1 5a1 b1 c2 10a1 b2 c3 8a2 b2 c1 2a2 b2 c3 11
NotSortedA B C suma2 b2 c3 11a1 b1 c2 10a2 b2 c1 2a1 b1 c1 5a1 b2 c3 8
Sorted
A B suma1 b1 15a1 b2 8a2 b2 13
• Noparenthesis:orderoftuplescanbearbitrary
PipeSort Output
• OutputsasubgraphO– eachnodehasasingleparent– eachnodehasasortedorderofattributes
• ifparent’ssortedorderisaprefix,cost=A(eij),elseS(eij)
– MarkbyAorS– AtmostoneA-out-edge– Note:forsomenodes,theremaybenogreenA-out-edge
ACBD
ACB ABD ACD BDC
AC AD BC BD CDAB
A B C D
all Level0
Level1
Level2
Level3
Level4
Sorted(A)Not-Sorted(S)
Goal:FindOwithmintotalcost
Outline:PipeSort Algorithm(1)• Gofromlevel0toN-1
– hereN=4
• Foreachlevelk,findthebestwaytoconstructitfromlevelk+1
• uses“min-costweightedbipartitematching”
• createsknewcopiesofnodesatlevelk+1
• edgesfromoriginalcopy– costA(eij)
• edgesfromnewcopies– costS(eij)
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all Level0
Level1
Level2
Level3
Level4
Outline:PipeSort Algorithm(2)• Illustrationwithasmallerexample• Levelk=1fromlevelk+1=2
– onenewcopy(dottededges)– oneexistingcopy(solidedge)
• Assumptionforsimplicity– samecostforalloutgoingedges– A(eij)=A(eij’)forallj,j’– S(eij)=S(eij’)foralli,i’
ABC
AC BCAB
A B C
all Level0
Level1
Level2
Level3
2,10 5,12 13,20
Aftercomputingthe plan,executeallpipelines
Outline:PipeSort Algorithm(3)
1.Firstpipelineisexecutedbyonescanofthedata
2.Sort(CBAD)->(BADC),computethesecondpipeline
3.…..
A,S costs
SeepaperforanotherPipeHash algorithm
ACBD
ACB ABD ACD BDC
AC AD BC BD CD
A B C D
all
AB