Data Warehousing and Data Cube -...

CompSci 516DatabaseSystems

Lecture22DataWarehousing

andDataCube

Instructor:Sudeepa Roy

DukeCS,Fall2017 CompSci516:DatabaseSystems 1

ReadingMaterial

• [RG]– Chapter25

• Gray-Chaudhuri-Bosworth-Layman-Reichart-Venkatrao-Pellow-Pirahesh,ICDE1996“DataCube:ARelationalAggregationOperatorGeneralizingGroup-By,Cross-Tab,andSub-Totals”

• Harinarayan-Rajaraman-Ullman,SIGMOD1996“Implementingdatacubesefficiently”


Acknowledgement:• Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.RamakrishnanandDr.Gehrke.• SomeslideshavebeenpreparedbyProf.Shivnath Babu

DataWarehousing


4

Warehousing

• Growingindustry:$8billionwaybackin1998• DatawarehousevendorlikeTeradata

• big“Petabytescale”customers• Apple,Walmart(2008-2.5PB),eBay(2013-primaryDW9.2PB,otherbigdata40PB,singletablewith1trillionrows),Verizon,AT&T,BankofAmerica

• supportsdataintoandoutofHadoop

• Lotsofbuzzwords,hype– slice&dice,rollup,MOLAP,pivot,...

Ack:SlidebyProf.Shivnath Babu

https://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/

5

MotivatingExamples

• Forecasting• Comparingperformanceofunits• Monitoring,detectingfraud• Visualization

Introduction• Organizationsanalyzecurrentandhistoricaldata

– toidentifyusefulpatterns– tosupportbusinessstrategies

• Emphasisisoncomplex,interactive,exploratoryanalysisofverylargedatasets

• Createdbyintegratingdatafromacrossallpartsofanenterprise

• Dataisfairlystatic

• Relevantonceagainfortherecent“BigDataanalysis”– tofigureoutwhatwecanreuse,whatwecannot


ThreeComplementaryTrends

• DataWarehousing(DW):– Consolidatedatafrommanysourcesinonelargerepository– Loading,periodicsynchronizationofreplicas– Semanticintegration

• OLAP:– ComplexSQLqueriesandviews.– Queriesbasedonspreadsheet-styleoperationsand

“multidimensional”viewofdata.– Interactiveand“online”queries.

• DataMining:– Exploratorysearchforinterestingtrendsandanomalies– Nextlecture!


DataWarehousing• Acollectionofdecisionsupporttechnologies• Toenablepeopleinindustry/organizationstomakebetter

decisions– SupportsOLAP(On-LineAnalyticalProcessing)

• Applicationsin– Manufacturing– Retail– Finance– Transportation– Healthcare– …

• Typicallymaintainedseparatelyfrom“OperationalDatabases”– OperationalDatabasessupportOLTP(On-LineTransactionProcessing)

8DukeCS,Fall2017 CompSci516:DatabaseSystems

9

WhyaWarehouse?

• TwoApproaches:– Query-Driven(Lazy)–Warehouse(Eager)

Source Source

?

10

AdvantagesofWarehousing

• Highqueryperformance• Queriesnotvisibleoutsidewarehouse• Localprocessingatsourcesunaffected• Canoperatewhensourcesunavailable• CanquerydatanotstoredinaDBMS• Extrainformationatwarehouse–Modify,summarize(storeaggregates)– Addhistoricalinformation

11

Query-DrivenApproach

Client Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

12

AdvantagesofQuery-Driven

• Noneedtocopydata– lessstorage– noneedtopurchasedata

• Moreup-to-datedata• Queryneedscanbeunknown• Onlyqueryinterfaceneededatsources• Maybelessdrainingonsources

OLTP DataWarehousing/OLAPMostlyupdates Mostlyreads

Applications:Orderentry,salesupdate,bankingtransactions

Applications:Decisionsupportinindustry/organization

Detailed,up-to-datedata Summarized, historicaldata(frommultipleoperationaldb,growsovertime)

Structured,repetitive,shorttasks Queryintensive,adhoc,complexqueries

Eachtransactionreads/updatesonlyafewtuples(tensof)

Eachquerycanaccesses manyrecords,andperformmanyjoins,scans,aggregates

MB-GBdata GB-TB data

Typicallyclericalusers Decisionmakers,analysts asusers

Important:Consistency,recoverability,Maximizingtr.throughput

Important:QuerythroughputResponse times


DataMarts

• smallerdatawarehouse• subsetsofdataonselectedsubjects• e.g.Marketingdatamartcanincludecustomer,product,sales

• Department-focused,noenterprise-wideconsensusneeded

• Butmayleadtocomplexintegrationproblemsinthelongrun


ROLAPandMOLAP

• RelationalOLAP(ROLAP)– OntopofstandardrelationalDBMS– DataisstoredinrelationalDBMS– SupportsextensionstoSQLtoaccessmulti-dimensionaldata

• MultidimensionalOLAP(MOLAP)– Directlystoresmultidimensionaldatainspecialdatastructures(e.g.arrays)


DataWarehousingtoMining

• Integrateddataspanninglongtimeperiods,oftenaugmentedwithsummaryinformation

• Severalgigabytestoterabytescommon

• Interactiveresponsetimesexpectedforcomplexqueries;ad-hocupdatesuncommon

EXTERNALDATASOURCES

EXTRACTTRANSFORM

LOADREFRESH

DATAWAREHOUSE

MetadataRepository

SUPPORTS

OLAPDATAMINING


WarehousingIssues• SemanticIntegration: Whengettingdatafrommultiplesources,musteliminatemismatches– e.g.,differentcurrencies,schemas

• HeterogeneousSources: Mustaccessdatafromavarietyofsourceformatsandrepositories– Replicationcapabilitiescanbeexploitedhere

• Load,Refresh,Purge: Mustloaddata,periodicallyrefreshit,andpurgetoo-olddata

• MetadataManagement: Mustkeeptrackofsource,loadingtime,andotherinformationforalldatainthewarehouse


DWArchitecture

18

• ExtractdatafrommultipleoperationalDBandexternalsources

• Clean/integrate/transform/store• Refreshperiodically

– updatebaseandderiveddata– admindecideswhenandhowDukeCS,Fall2017 CompSci516:DatabaseSystems

• MainDWandseveraldatamarts(possibly)• Managedbyoneormoreserversand

frontendtools• Additionalmetadataand

monitoring/admintools

ROLAP:StarSchema

• Toreflectmulti-dimensionalviewsofdata

• Singlefacttable

• Singletableforeverydimension

• Eachtupleinthefacttableconsistsof– pointers(foreignkey)toeach

ofthedimensions(multi-dimensionalcoordinates)

– numericvalueforthosecoordinates

• Eachdimensiontablecontainsattributesofthatdimension

19

NosupportforattributehierarchiesDukeCS,Fall2017 CompSci516:DatabaseSystems

DimensionHierarchies

• Foreachdimension,thesetofvaluescanbeorganizedinahierarchy:

PRODUCT TIME LOCATION

categoryweekmonthstate

pnamedatecity

year

quartercountry


ROLAP:SnowflakeSchema

21

• Refinesstar-schema• Dimensionalhierarchyis

explicitlyrepresented

• (+)Dimensiontableseasiertomaintain– supposethe“category

descriptionisbeingchanged

• (-)Needadditionaljoins

• FactConstellations– Multiplefacttablessharesome

dimensionaltables– e.g.ProjectedandActual

Expensesmaysharemanydimensions

DukeCS,Fall2017 CompSci516:DatabaseSystems

Motivation:OLAPQueries• Dataanalystsareinterestedinexploringtrendsandanomalies– Possiblybyvisualization(Excel)- 2Dor3Dplots– “DimensionalityReduction”bysummarizingdataandcomputing

aggregates– InfluencedbySQLandbyspreadsheets.– Acommonoperationistoaggregate ameasureoveroneormore

dimensions.

• Findtotalunitsalesforeach1. Model2. Model,brokenintoyears3. Year,brokenintocolors4. Year5. Model,brokenintocolor,….

22

Sales(Model,Year,Color,Units)


OLAPand

DataCube


HistogramsAtabulatedfrequencyofcomputedvalues

SELECT Year, COUNT(Units) as total

FROM Sales

GROUP BY Year

ORDER BY Year

MayrequireanestedSELECTtocompute


Year->

total->

Roll-Ups• Analysisreportsstartatacoarselevel,

gotofinerlevels• Orderofattributematters• Notrelationaldata(emptycellsno

keys)

Model Year Color Model,Year,Color Model,Year Model

Chevy 1994 Black 50

Chevy 1994 White 40

90

Chevy 1995 Black 115

Chevy 1995 White 85

200

290

GROUPBY


Roll-ups

Drill-downs

Roll-Ups• Anotherrepresentation(ChrisDate’96)• Relational,but

– longattributenames– hardtoexpressinSQLandrepetition


Chevy 1994 Black 50 90 290

Chevy 1994 White 40 90 290

Chevy 1995 Black 85 200 290

Chevy 1995 Black 115 200 290

GROUPBY


‘ALL’ ConstructEasiertovisualizeroll-upifallowALLtofillinthesuper-aggregates

SELECT Model, Year, Color, SUM(Units)

FROM Sales

WHERE Model = ‘Chevy’

GROUP BY Model, Year, Color

UNION

SELECT Model, Year, ‘ALL’, SUM(Units)

FROM Sales


GROUP BY Model, Year

UNION

…

UNION

SELECT ‘ALL’, ‘ALL’, ‘ALL’, SUM(Units)

FROM Sales

WHERE Model = ‘Chevy’;

Model Year Color Units

Chevy 1994 Black 50

Chevy 1994 White 40

Chevy 1994 ‘ALL’ 90

Chevy 1995 Black 85

Chevy 1995 White 115


Chevy ‘ALL’ ‘ALL’ 290


Model Year Color Units

Chevy 1994 Black 50

Chevy 1994 White 40


Chevy 1995 Black 85

Chevy 1995 White 115


Chevy ‘ALL’ ‘ALL’ 290


Chevy 1994 Black 50

Chevy 1994 White 40

90

Chevy 1995 Black 115

Chevy 1995 White 85

200

290

TraditionalRoll-Up ‘ALL’Roll-Up

• Roll-upsareasymmetric


CrossTabulation• Ifwemadetheroll-upsymmetric,wewouldgetacross-tabulation• Generalizestohigherdimensions

SELECT Model, ‘ALL’, Color, SUM(Units)

FROM Sales


GROUP BY Model, Color

Chevy 1994 1995 Total(ALL)

Black 50 85 135

White 40 115 155

Total(ALL) 90 200 290

IstheproblemsolvedwithCross-TabandGROUP-BYswith‘ALL’?• RequiresalotofGROUPBYs(64for6-dimension)• Toocomplextooptimize (64scans,64sort/hash,slow)


NaïveApproach

Runanumberofqueries

SELECT sum(units)FROM Sales

SELECT Color, sum(units)FROM SalesGROUP BY Color

SELECT Year, sum(units)FROM SalesGROUP BY Year

SELECT Model, Year, sum(units)FROM SalesGROUP BY Model, Year….

• DatacubegeneralizesHistogram,Roll-Ups,Cross-Tabs

• MorecomplextodothesewithGROUP-BY

30

• Howmanysub-queries?• Howmanysub-queriesfor

8attributes?


YearColor

Mod

el

TotalUnitsales


DataCube:Intuition

SELECT ‘ALL’, ‘ALL’, ‘ALL’, sum(units)

FROM SalesUNIONSELECT ‘ALL’, ‘ALL’, Color, sum(units)

FROM SalesGROUP BY Color

UNIONSELECT ‘ALL’, Year, ‘ALL’, sum(units)FROM Sales

GROUP BY YearUNION SELECT Model, Year, ‘ALL’, sum(units)FROM SalesGROUP BY Model, Year

UNION ….

31


YearColor

Mod

el

TotalUnitsales


DataCube

Ack:fromslidesbyLaurelOrrandJeremyHyrkas,UW

DataCube• Computestheaggregateonallpossiblecombinationsof

groupbycolumns.

• IfthereareNattributes,thereare2N-1super-aggregates.

• IfthecardinalityoftheNattributesareC1,...,CN,thenthereareatotalof(C1+1)...(CN+1)valuesinthecube.

• ROLL-UPissimilarbutjustlooksatNaggregates

DataCube Syntax• SQLServer

SELECT Model, Year, Color, sum(units)FROM Sales

GROUP BY Model, Year, Color

WITH CUBE


TypesofAggregates• Distributive:inputcanbepartitionedintodisjointsets

andaggregatedseparatelyo COUNT,SUM,MIN

• Algebraic:canbecomposedofdistributiveaggregateso AVG

• Holistic:aggregatemustbecomputedovertheentireinputseto MEDIAN

• EfficientcomputationoftheCUBEoperatordependsonthetypeofaggregate

– DistributiveandAlgebraicaggregatesmotivateoptimizations

ImplementingDataCube


BasicIdeas

• Needtocomputeallgroup-by-s:– ABCD,ABC,ABD,BCD,AB,AC,AD,BC,BD,CD,A,B,C,D

• ComputeGROUP-BYsfrompreviouslycomputedGROUP-BYs– e.g.firstABCD– thenABCorACD– thenABorAC…

• WhichorderABCDissorted,mattersforsubsequentcomputations

– if(ABCD)isthesortedorder,ABCischeap,ACDorBCDisexpensive

Notations

• ABCD– group-byonattributesA,B,C,D– noguaranteeontheorderoftuples

• (ABCD)– sortedaccordingtoA->B->C->D

• ABCDand(ABCD)and(BCDA)– allcontainthesameresults– butindifferentsortedorder

Optimization1:SmallestParent

• ComputeGROUP-BYfromthesmallest(size)previouslycomputedGROUP-BYasaparent

– ABcanbecomputedfromABC,ABD,orABCD

– ABCorABDbetterthanABCD– EvenABCorABDmayhave

differentsizes,trytochoosethesmallerparent

39

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all


LATTICESTRUCTUREofdatacube

Optimization2:CacheResults

• CacheresultofoneGROUP-BYinmemorytoreducediskI/O– ComputeABfromABCwhile

ABCisstillinmemory

40

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all


Optimization3:AmortizeDiskScans

• AmortizediskreadsformultipleGROUP-BYs– SupposetheresultforABCD

isstoredondisk– ComputeallofABC,ABD,

ACD,BCDsimultaneouslyinonescanofABCD

41

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all


Optimization4,5(next)

• 4.Share-sort– forsort-basedalgorithms– pipe-sortalgorithm– coveredinclass

• 5.Shared-partition– forhash-basedalgorithms– pipe-hashalgorithm

• UseshashtablestocomputesmallerGROUP-Bys

• IfthehashtablesforABandACfitinmemory,computebothinonescanofABC

• OtherwisecanpartitiononA,andcancomputeHTsofABandACindifferentpartitions

– notcovered(seepaper)

42

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all


PipeSort:Idea

• Combinestwooptimizations:“shared-sorts”and“smallest-parent”

• Alsoincludes“cache-results”and“amortized-scans”

PipeSort:Share-sortoptimization• Datasortedinoneorder• ComputeallGROUP-BYsprefixedinthatorder• ComputeonetupleofABCD,propagateupwardinthe

pipelinebyasinglescan

• Example:– GROUP-BY overattributesABCD– Sortrawdataby(ABCD)– Compute(ABCD)->(ABC)->(AB)->(A) inpipelinedfashion– Noadditionalsortneeded

• BUT,mayhaveaconflictwith“smallest-parent”optimization– (ABD)->(AB)couldbeabetterchoice– Figureoutthebestparentchoicebyrunningaweighted-matching

algorithmlayerbylayer

(ABCD)

(ABC)

(AB)

(A)

SearchLattice• Directededge=>oneattributeless

andpossiblecomputation• Levelkcontainskattributes

– all=0attribute

• Twopossiblecostsforeachedgeeij=i --->j

• A(eij):i issortedforj– (BCA)->(BC)

• S(eij):i isNOTsortedforj– e.g.ABC->(BCA)->(BC)orhash

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all Level0

Level1

Level2

Level3

Level4

A B C suma1 b1 c1 5a1 b1 c2 10a1 b2 c3 8a2 b2 c1 2a2 b2 c3 11

NotSortedA B C suma2 b2 c3 11a1 b1 c2 10a2 b2 c1 2a1 b1 c1 5a1 b2 c3 8

Sorted

A B suma1 b1 15a1 b2 8a2 b2 13

• Noparenthesis:orderoftuplescanbearbitrary

PipeSort Output

• OutputsasubgraphO– eachnodehasasingleparent– eachnodehasasortedorderofattributes

• ifparent’ssortedorderisaprefix,cost=A(eij),elseS(eij)

– MarkbyAorS– AtmostoneA-out-edge– Note:forsomenodes,theremaybenogreenA-out-edge

ACBD

ACB ABD ACD BDC

AC AD BC BD CDAB

A B C D

all Level0

Level1

Level2

Level3

Level4

Sorted(A)Not-Sorted(S)

Goal:FindOwithmintotalcost

Outline:PipeSort Algorithm(1)• Gofromlevel0toN-1

– hereN=4

• Foreachlevelk,findthebestwaytoconstructitfromlevelk+1

• uses“min-costweightedbipartitematching”

• createsknewcopiesofnodesatlevelk+1

• edgesfromoriginalcopy– costA(eij)

• edgesfromnewcopies– costS(eij)

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all Level0

Level1

Level2

Level3

Level4

Outline:PipeSort Algorithm(2)• Illustrationwithasmallerexample• Levelk=1fromlevelk+1=2

– onenewcopy(dottededges)– oneexistingcopy(solidedge)

• Assumptionforsimplicity– samecostforalloutgoingedges– A(eij)=A(eij’)forallj,j’– S(eij)=S(eij’)foralli,i’

ABC

AC BCAB

A B C

all Level0

Level1

Level2

Level3

2,10 5,12 13,20

Aftercomputingthe plan,executeallpipelines

Outline:PipeSort Algorithm(3)

1.Firstpipelineisexecutedbyonescanofthedata

2.Sort(CBAD)->(BADC),computethesecondpipeline

3.…..

A,S costs

SeepaperforanotherPipeHash algorithm

ACBD

ACB ABD ACD BDC

AC AD BC BD CD

A B C D

all

AB

Data Warehousing and Data Cube -...

Documents

Transcript of Data Warehousing and Data Cube -...