IG DATA SYSTEMSpages.cs.wisc.edu/~paris/cs564-f16/lectures/lecture-18.pdf– OLAP (decision support...

BIG DATA SYSTEMS

CS564- Fall2016

ACKs:MagdaBalazinska

BIG DATA

Definitionfromindustry:• highvolume• highvariety• highvelocity

2CS564[Fall2015]- ParisKoutris

VOLUME

• Databasesparallelizeeasily;techniquesavailablefromthe80’s(GAMMAproject)– datapartitioning– parallelqueryprocessing

• SQLisembarrassinglyparallel


VARIETY

• complexworkloads:– MachineLearningtasks:e.g.clickprediction,topicmodeling,SVM,k-means

• varioustypesofdata:– textdata– semi-structureddata– graphdata– multimedia(video,photos)


VELOCITY

• dataisgeneratedveryfastandneedstobeprocessedveryfast– realtimeanalytics– datastreaming(eachdataitemcanbeprocessedonlyonce!)


ANOTHER V:VERACITY

Thedatacollectedisoftenuncertain• inconsistentdata• incompletedata• ambiguousdata

Example:sensordata


DATA LANDSCAPE


SOME EXAMPLES

• Greenplum:foundedin2003acquiredbyEMCin2010.Aparallelshared-nothingDBMS

• Vertica:foundedin2005andacquiredbyHPin2011.Aparallelcolumn-storeshared-nothingDBMS

• AsterData:foundedin2005acquiredbyTeradatain2011.Aparallel,shared-nothing,MapReduce-baseddataprocessingsystem

• Netezza:foundedin2000andacquiredbyIBMin2010.Aparallelshared-nothingDBMS


WE WILL SEE 2APPROACHES

• Paralleldatabases,startedatthe80s– OLTP(transactionprocessing)– OLAP(decisionsupportqueries)

• MapReduce– firstdevelopedbyGoogle,publishedin2004– onlyfordecisionsupportqueries– ecosystemaroundit:Hadoop,PigLatin,Hive,…


PARALLEL DBMS

• Thegoalistoimproveperformancebyexecutingmultipleoperationsinparallel(scale-out)

• Terminologytomeasureperformance:– Speed-up:usingmoreprocessors,howmuchfasterdoesthetaskrun(ifproblemsizeisfixed)?

– Scale-up:usingmoreprocessors,doesperformanceremainthesameasweincreasetheproblemsize?


SCALE-UP VS SCALE-OUT

Scale-up• usingmorepowerfulmachines,moreprocessors/RAMpermachine

Scale-out• usingalargernumberofservers


ARCHITECTURES

• Sharedmemory– nodesshareRAM+disk– easytoprogram,expensivetoscale

• Shareddisk– nodesaccessthesamedisk,hardtoscale

• Sharednothing– nodeshavetheirownRAM+disk– connectedthroughafastnetwork


PARALLEL QUERY EVALUATION

• Inter-queryparallelism:– eachqueryrunsononeprocessor

• Inter-operatorparallelism:– eachqueryrunsonmultipleprocessors– anoperatorrunsononeprocessor

• Intra-operatorparallelism:– Anoperatorrunsonmultipleprocessors


PARALLEL DATA STORAGE

Horizontaldatapartitioning• block partitioned• hash partitioned• rangepartitioned

Uniformvsskewedpartitioning


PARALLEL QUERY EVALUATION

• ParallelSelection

• ParallelJoin– hashjoin– broadcastjoin


MAPREDUCE

• Google[Dean2004]• Opensourceimplementation:Hadoop• MapReduce:– high-levelprogrammingmodelandimplementationforlarge-scaleparalleldataprocessing

– designedtosimplifytaskofwritingparallelprograms


MAPREDUCE

• HidesmessydetailsinMapReduce runtimelibrary– automaticparallelization– loadbalancing– networkanddisktransferoptimizations– handlingoffailures– robustness


MAPREDUCE PIPELINE

• readthepartitioneddata(HDFS,GFS)• Map:extractsomethingyoucareaboutfromeachrecord

• ShuffleandSort(donebythesystem)• Reduce:aggregate,summarize,filter,transform• writetheresults


MAPREDUCE DATAFLOW


source:Hadoop– TheDefinitiveGuide,byTomWhite

DATA MODEL

• Afile=abagof(key,value)pairs

• AMapReduce program:– Input:abagof(inputkey,value)pairs– Output:abagof(outputkey,value)pairs


THE MAP FUNCTION

UserprovidestheMAP function:• Input:(inputkey,value)• Output:bagof(intermediatekey,value)

Thesystemappliesthemapfunctioninparalleltoall(inputkey,value)pairsintheinputfile


THE REDUCE FUNCTION

UserprovidestheREDUCE function:• Input:(intermediatekey,bagofvalues)• Output:bagof(outputkey,values)

Thesystemgroupsallpairswiththesameintermediatekey,andpassesthebagofvaluestotheREDUCEfunction


EXAMPLE:WORD COUNT

• Countthenumberofoccurrencesofeachwordinalargecollectionofdocuments

• EachDocument– key=documentid(did)– value=setofwords(word)


MAPREDUCE JOBS

• AMapReduce jobconsistsofonesingle“query”– e.g.countthewordsinalldocs

• Morecomplexqueriesmayconsistofmultiplejobs


MAPREDUCE ECOSYSTEM

Lotsofextensionstoaddresslimitations:• CapabilitiestowriteDAGsofMapReduce jobs• Declarativelanguages• Mostcompaniesusebothtypesofengines(MRandDBMS),withincreasedintegration

• PotentialreplacementtoMapReduce:Spark


MAPREDUCE ECOSYSTEM

PIGLatin(Yahoo!)• Newlanguage,likeRelationalAlgebra• opensourceHive(Facebook)• SQL-likelanguage• opensourceSQL/Tenzing (Google)• SQLonMR• Proprietary– morphedintoBigQuery


PARALLEL DBMSVS MAPREDUCE

ParallelDBMS:• Relationaldatamodelandschema• Declarativequerylanguage:SQL• Caneasilycombineoperatorsintocomplexqueries• Queryoptimization,indexing,andphysicaltuning• Streamsdatafromoneoperatortothenextwithoutblocking


PARALLEL DBMSVS MAPREDUCE

MapReduce:• datamodelisafilewithkey-valuepairs• noneedto“loaddata”beforeprocessing• easytowriteuser-definedoperators• caneasilyaddnodestothecluster• intra-queryfault-tolerancethankstoresultsondisk• Arguablymorescalable,butalsoneedsmorenodes


IG DATA SYSTEMSpages.cs.wisc.edu/~paris/cs564-f16/lectures/lecture-18.pdf– OLAP (decision support...

Documents

Transcript of IG DATA SYSTEMSpages.cs.wisc.edu/~paris/cs564-f16/lectures/lecture-18.pdf– OLAP (decision support...