IG DATA SYSTEMSpages.cs.wisc.edu/~paris/cs564-f16/lectures/lecture-18.pdf– OLAP (decision support...

28
BIG D ATA SYSTEMS CS 564- Fall 2016 ACKs: Magda Balazinska

Transcript of IG DATA SYSTEMSpages.cs.wisc.edu/~paris/cs564-f16/lectures/lecture-18.pdf– OLAP (decision support...

BIG DATA SYSTEMS

CS564- Fall2016

ACKs:MagdaBalazinska

BIG DATA

Definitionfromindustry:• highvolume• highvariety• highvelocity

2CS564[Fall2015]- ParisKoutris

VOLUME

• Databasesparallelizeeasily;techniquesavailablefromthe80’s(GAMMAproject)– datapartitioning– parallelqueryprocessing

• SQLisembarrassinglyparallel

3CS564[Fall2015]- ParisKoutris

VARIETY

• complexworkloads:– MachineLearningtasks:e.g.clickprediction,topicmodeling,SVM,k-means

• varioustypesofdata:– textdata– semi-structureddata– graphdata– multimedia(video,photos)

4CS564[Fall2015]- ParisKoutris

VELOCITY

• dataisgeneratedveryfastandneedstobeprocessedveryfast– realtimeanalytics– datastreaming(eachdataitemcanbeprocessedonlyonce!)

5CS564[Fall2015]- ParisKoutris

ANOTHER V:VERACITY

Thedatacollectedisoftenuncertain• inconsistentdata• incompletedata• ambiguousdata

Example:sensordata

6CS564[Fall2015]- ParisKoutris

DATA LANDSCAPE

7CS564[Fall2015]- ParisKoutris

SOME EXAMPLES

• Greenplum:foundedin2003acquiredbyEMCin2010.Aparallelshared-nothingDBMS

• Vertica:foundedin2005andacquiredbyHPin2011.Aparallelcolumn-storeshared-nothingDBMS

• AsterData:foundedin2005acquiredbyTeradatain2011.Aparallel,shared-nothing,MapReduce-baseddataprocessingsystem

• Netezza:foundedin2000andacquiredbyIBMin2010.Aparallelshared-nothingDBMS

8CS564[Fall2015]- ParisKoutris

WE WILL SEE 2APPROACHES

• Paralleldatabases,startedatthe80s– OLTP(transactionprocessing)– OLAP(decisionsupportqueries)

• MapReduce– firstdevelopedbyGoogle,publishedin2004– onlyfordecisionsupportqueries– ecosystemaroundit:Hadoop,PigLatin,Hive,…

9CS564[Fall2015]- ParisKoutris

PARALLEL DBMS

• Thegoalistoimproveperformancebyexecutingmultipleoperationsinparallel(scale-out)

• Terminologytomeasureperformance:– Speed-up:usingmoreprocessors,howmuchfasterdoesthetaskrun(ifproblemsizeisfixed)?

– Scale-up:usingmoreprocessors,doesperformanceremainthesameasweincreasetheproblemsize?

10CS564[Fall2015]- ParisKoutris

SCALE-UP VS SCALE-OUT

Scale-up• usingmorepowerfulmachines,moreprocessors/RAMpermachine

Scale-out• usingalargernumberofservers

11CS564[Fall2015]- ParisKoutris

ARCHITECTURES

• Sharedmemory– nodesshareRAM+disk– easytoprogram,expensivetoscale

• Shareddisk– nodesaccessthesamedisk,hardtoscale

• Sharednothing– nodeshavetheirownRAM+disk– connectedthroughafastnetwork

12CS564[Fall2015]- ParisKoutris

PARALLEL QUERY EVALUATION

• Inter-queryparallelism:– eachqueryrunsononeprocessor

• Inter-operatorparallelism:– eachqueryrunsonmultipleprocessors– anoperatorrunsononeprocessor

• Intra-operatorparallelism:– Anoperatorrunsonmultipleprocessors

13CS564[Fall2015]- ParisKoutris

PARALLEL DATA STORAGE

Horizontaldatapartitioning• block partitioned• hash partitioned• rangepartitioned

Uniformvsskewedpartitioning

14CS564[Fall2015]- ParisKoutris

PARALLEL QUERY EVALUATION

• ParallelSelection

• ParallelJoin– hashjoin– broadcastjoin

15CS564[Fall2015]- ParisKoutris

MAPREDUCE

• Google[Dean2004]• Opensourceimplementation:Hadoop• MapReduce:– high-levelprogrammingmodelandimplementationforlarge-scaleparalleldataprocessing

– designedtosimplifytaskofwritingparallelprograms

16CS564[Fall2015]- ParisKoutris

MAPREDUCE

• HidesmessydetailsinMapReduce runtimelibrary– automaticparallelization– loadbalancing– networkanddisktransferoptimizations– handlingoffailures– robustness

17CS564[Fall2015]- ParisKoutris

MAPREDUCE PIPELINE

• readthepartitioneddata(HDFS,GFS)• Map:extractsomethingyoucareaboutfromeachrecord

• ShuffleandSort(donebythesystem)• Reduce:aggregate,summarize,filter,transform• writetheresults

18CS564[Fall2015]- ParisKoutris

MAPREDUCE DATAFLOW

19CS564[Fall2015]- ParisKoutris

source:Hadoop– TheDefinitiveGuide,byTomWhite

DATA MODEL

• Afile=abagof(key,value)pairs

• AMapReduce program:– Input:abagof(inputkey,value)pairs– Output:abagof(outputkey,value)pairs

20CS564[Fall2015]- ParisKoutris

THE MAP FUNCTION

UserprovidestheMAP function:• Input:(inputkey,value)• Output:bagof(intermediatekey,value)

Thesystemappliesthemapfunctioninparalleltoall(inputkey,value)pairsintheinputfile

21CS564[Fall2015]- ParisKoutris

THE REDUCE FUNCTION

UserprovidestheREDUCE function:• Input:(intermediatekey,bagofvalues)• Output:bagof(outputkey,values)

Thesystemgroupsallpairswiththesameintermediatekey,andpassesthebagofvaluestotheREDUCEfunction

22CS564[Fall2015]- ParisKoutris

EXAMPLE:WORD COUNT

• Countthenumberofoccurrencesofeachwordinalargecollectionofdocuments

• EachDocument– key=documentid(did)– value=setofwords(word)

23CS564[Fall2015]- ParisKoutris

MAPREDUCE JOBS

• AMapReduce jobconsistsofonesingle“query”– e.g.countthewordsinalldocs

• Morecomplexqueriesmayconsistofmultiplejobs

24CS564[Fall2015]- ParisKoutris

MAPREDUCE ECOSYSTEM

Lotsofextensionstoaddresslimitations:• CapabilitiestowriteDAGsofMapReduce jobs• Declarativelanguages• Mostcompaniesusebothtypesofengines(MRandDBMS),withincreasedintegration

• PotentialreplacementtoMapReduce:Spark

25CS564[Fall2015]- ParisKoutris

MAPREDUCE ECOSYSTEM

PIGLatin(Yahoo!)• Newlanguage,likeRelationalAlgebra• opensourceHive(Facebook)• SQL-likelanguage• opensourceSQL/Tenzing (Google)• SQLonMR• Proprietary– morphedintoBigQuery

26CS564[Fall2015]- ParisKoutris

PARALLEL DBMSVS MAPREDUCE

ParallelDBMS:• Relationaldatamodelandschema• Declarativequerylanguage:SQL• Caneasilycombineoperatorsintocomplexqueries• Queryoptimization,indexing,andphysicaltuning• Streamsdatafromoneoperatortothenextwithoutblocking

27CS564[Fall2015]- ParisKoutris

PARALLEL DBMSVS MAPREDUCE

MapReduce:• datamodelisafilewithkey-valuepairs• noneedto“loaddata”beforeprocessing• easytowriteuser-definedoperators• caneasilyaddnodestothecluster• intra-queryfault-tolerancethankstoresultsondisk• Arguablymorescalable,butalsoneedsmorenodes

28CS564[Fall2015]- ParisKoutris