IG DATA SYSTEMSpages.cs.wisc.edu/~paris/cs564-f16/lectures/lecture-18.pdf– OLAP (decision support...
Transcript of IG DATA SYSTEMSpages.cs.wisc.edu/~paris/cs564-f16/lectures/lecture-18.pdf– OLAP (decision support...
BIG DATA
Definitionfromindustry:• highvolume• highvariety• highvelocity
2CS564[Fall2015]- ParisKoutris
VOLUME
• Databasesparallelizeeasily;techniquesavailablefromthe80’s(GAMMAproject)– datapartitioning– parallelqueryprocessing
• SQLisembarrassinglyparallel
3CS564[Fall2015]- ParisKoutris
VARIETY
• complexworkloads:– MachineLearningtasks:e.g.clickprediction,topicmodeling,SVM,k-means
• varioustypesofdata:– textdata– semi-structureddata– graphdata– multimedia(video,photos)
4CS564[Fall2015]- ParisKoutris
VELOCITY
• dataisgeneratedveryfastandneedstobeprocessedveryfast– realtimeanalytics– datastreaming(eachdataitemcanbeprocessedonlyonce!)
5CS564[Fall2015]- ParisKoutris
ANOTHER V:VERACITY
Thedatacollectedisoftenuncertain• inconsistentdata• incompletedata• ambiguousdata
Example:sensordata
6CS564[Fall2015]- ParisKoutris
SOME EXAMPLES
• Greenplum:foundedin2003acquiredbyEMCin2010.Aparallelshared-nothingDBMS
• Vertica:foundedin2005andacquiredbyHPin2011.Aparallelcolumn-storeshared-nothingDBMS
• AsterData:foundedin2005acquiredbyTeradatain2011.Aparallel,shared-nothing,MapReduce-baseddataprocessingsystem
• Netezza:foundedin2000andacquiredbyIBMin2010.Aparallelshared-nothingDBMS
8CS564[Fall2015]- ParisKoutris
WE WILL SEE 2APPROACHES
• Paralleldatabases,startedatthe80s– OLTP(transactionprocessing)– OLAP(decisionsupportqueries)
• MapReduce– firstdevelopedbyGoogle,publishedin2004– onlyfordecisionsupportqueries– ecosystemaroundit:Hadoop,PigLatin,Hive,…
9CS564[Fall2015]- ParisKoutris
PARALLEL DBMS
• Thegoalistoimproveperformancebyexecutingmultipleoperationsinparallel(scale-out)
• Terminologytomeasureperformance:– Speed-up:usingmoreprocessors,howmuchfasterdoesthetaskrun(ifproblemsizeisfixed)?
– Scale-up:usingmoreprocessors,doesperformanceremainthesameasweincreasetheproblemsize?
10CS564[Fall2015]- ParisKoutris
SCALE-UP VS SCALE-OUT
Scale-up• usingmorepowerfulmachines,moreprocessors/RAMpermachine
Scale-out• usingalargernumberofservers
11CS564[Fall2015]- ParisKoutris
ARCHITECTURES
• Sharedmemory– nodesshareRAM+disk– easytoprogram,expensivetoscale
• Shareddisk– nodesaccessthesamedisk,hardtoscale
• Sharednothing– nodeshavetheirownRAM+disk– connectedthroughafastnetwork
12CS564[Fall2015]- ParisKoutris
PARALLEL QUERY EVALUATION
• Inter-queryparallelism:– eachqueryrunsononeprocessor
• Inter-operatorparallelism:– eachqueryrunsonmultipleprocessors– anoperatorrunsononeprocessor
• Intra-operatorparallelism:– Anoperatorrunsonmultipleprocessors
13CS564[Fall2015]- ParisKoutris
PARALLEL DATA STORAGE
Horizontaldatapartitioning• block partitioned• hash partitioned• rangepartitioned
Uniformvsskewedpartitioning
14CS564[Fall2015]- ParisKoutris
PARALLEL QUERY EVALUATION
• ParallelSelection
• ParallelJoin– hashjoin– broadcastjoin
15CS564[Fall2015]- ParisKoutris
MAPREDUCE
• Google[Dean2004]• Opensourceimplementation:Hadoop• MapReduce:– high-levelprogrammingmodelandimplementationforlarge-scaleparalleldataprocessing
– designedtosimplifytaskofwritingparallelprograms
16CS564[Fall2015]- ParisKoutris
MAPREDUCE
• HidesmessydetailsinMapReduce runtimelibrary– automaticparallelization– loadbalancing– networkanddisktransferoptimizations– handlingoffailures– robustness
17CS564[Fall2015]- ParisKoutris
MAPREDUCE PIPELINE
• readthepartitioneddata(HDFS,GFS)• Map:extractsomethingyoucareaboutfromeachrecord
• ShuffleandSort(donebythesystem)• Reduce:aggregate,summarize,filter,transform• writetheresults
18CS564[Fall2015]- ParisKoutris
DATA MODEL
• Afile=abagof(key,value)pairs
• AMapReduce program:– Input:abagof(inputkey,value)pairs– Output:abagof(outputkey,value)pairs
20CS564[Fall2015]- ParisKoutris
THE MAP FUNCTION
UserprovidestheMAP function:• Input:(inputkey,value)• Output:bagof(intermediatekey,value)
Thesystemappliesthemapfunctioninparalleltoall(inputkey,value)pairsintheinputfile
21CS564[Fall2015]- ParisKoutris
THE REDUCE FUNCTION
UserprovidestheREDUCE function:• Input:(intermediatekey,bagofvalues)• Output:bagof(outputkey,values)
Thesystemgroupsallpairswiththesameintermediatekey,andpassesthebagofvaluestotheREDUCEfunction
22CS564[Fall2015]- ParisKoutris
EXAMPLE:WORD COUNT
• Countthenumberofoccurrencesofeachwordinalargecollectionofdocuments
• EachDocument– key=documentid(did)– value=setofwords(word)
23CS564[Fall2015]- ParisKoutris
MAPREDUCE JOBS
• AMapReduce jobconsistsofonesingle“query”– e.g.countthewordsinalldocs
• Morecomplexqueriesmayconsistofmultiplejobs
24CS564[Fall2015]- ParisKoutris
MAPREDUCE ECOSYSTEM
Lotsofextensionstoaddresslimitations:• CapabilitiestowriteDAGsofMapReduce jobs• Declarativelanguages• Mostcompaniesusebothtypesofengines(MRandDBMS),withincreasedintegration
• PotentialreplacementtoMapReduce:Spark
25CS564[Fall2015]- ParisKoutris
MAPREDUCE ECOSYSTEM
PIGLatin(Yahoo!)• Newlanguage,likeRelationalAlgebra• opensourceHive(Facebook)• SQL-likelanguage• opensourceSQL/Tenzing (Google)• SQLonMR• Proprietary– morphedintoBigQuery
26CS564[Fall2015]- ParisKoutris
PARALLEL DBMSVS MAPREDUCE
ParallelDBMS:• Relationaldatamodelandschema• Declarativequerylanguage:SQL• Caneasilycombineoperatorsintocomplexqueries• Queryoptimization,indexing,andphysicaltuning• Streamsdatafromoneoperatortothenextwithoutblocking
27CS564[Fall2015]- ParisKoutris