CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures,...

17
CS 378 – Big Data Programming Lecture 23 Closures, Caching, Par<<ons

Transcript of CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures,...

Page 1: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

CS378–BigDataProgramming

Lecture23Closures,Caching,Par<<ons

Page 2: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

Review

•  Assignment11– Createusersessions– Ordereventsby<mestamp,eventtype,subtype– OrdersessionsbyuserID– Par<<onsessionsbyreferringdomain– SampleSHOWERsessions(1in10)

BigDataProgramming 2CS378-Fall2018

Page 3: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

DistributedSparkApplica<onLearningSpark,Figure7-1

BigDataProgramming 3CS378-Fall2018

Page 4: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

Distribu<ngaSparkApplica<on

•  SparkDriverrunsyourmain()method– ConvertsSparkprogramintotasks– Createsanexecu<onplanbasedonDAG

•  DAGisderivedfromtransforma<ons

– Performsop<miza<on(like:pipeliningmap()’s)

•  Tasksarebundleduptobesenttocluster– Clusterhasmul<pletaskexecutors

BigDataProgramming 4CS378-Fall2018

Page 5: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

Distribu<ngaSparkApplica<on

•  Schedulingindividualtasks– Executorsregisterwithdriver– Tasksscheduledbasedondataloca<on– Cacheddataistracked(forfuturetaskscheduling)

•  Driverexposesdataontaskstatus

BigDataProgramming 5CS378-Fall2018

Page 6: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

Distribu<ngaSparkApplica<on

•  WithHadooptheJARwassenttoworkers– Sparkalsoneedstogetthecodetoworkers

•  Hadoophastwotasks:map,reduce–  Instan<a<ontakesplaceontheworkers

•  Sparksendsobjectinstancestoworkers–  IndividualtasksdefinedinyourSparkcode– Objectsareserialized(viaJavaserializa<on)

BigDataProgramming 6CS378-Fall2018

Page 7: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

Closures

•  Func<onsasfirstclassobjects– Canbepassedtoafunc<onasanargument– Canbereturnedfromafunc<on– Canbeassignedtovariables

•  Closurescontainfreevariablesthatareboundinthelexicalenvironment/scope

BigDataProgramming 7CS378-Fall2018

Page 8: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

Closures

•  InScala,func<onsasatypearebuilt-in

•  InJava,closuresarerealizedasinstances– Defineanobjectthatimplementsaninterface–  Interfacerequiresimplementa<onofanabstractmethod

–  InSparkAPI,thatmethodiscall()

BigDataProgramming 8CS378-Fall2018

Page 9: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

Closures

•  OurJavafunc<onsare:–  Instan<ated– Sentofftotheworkertasks(viaserializa<on)– Eachtaskgetsitsowncopy(nocommunica<on)

•  Non-localreferenceswillcausecontainingobjecttobeserializedaswell.– Variablevaluetypesmustbeserializable

BigDataProgramming 9CS378-Fall2018

Page 10: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

Closures–IssuesinJava

•  Afunc<onreferencesamethodinanenclosingscope– Methoditselfcannotbeserialized– Theen<recontainingclassmustbeserialized

•  Issues– Thisclassisnotserializable– Theassociateddatamightbelarge

BigDataProgramming 10CS378-Fall2018

Page 11: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

Persistence

•  RecallthatRDDsarerecomputedasneeded– Anac<onini<atesevalua<on– Addi<onalac<onresultsinanotherevalua<on

•  AnRDDcanbepersistedforefficiency•  MakinganRDDpersistent:– cache() – persist(StorageLevel level)

BigDataProgramming 11CS378-Fall2018

Page 12: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

PersistenceOp<onsFrom:hgp://training.databricks.com/workshop/itas_workshop.pdf

BigDataProgramming 12CS378-Fall2018

Page 13: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

Par<<oning

•  Prudentpar<<oningcangreatlyreducetheamountofcommunica<on(shuffle)

•  IfanRDDisscannedonlyonce,noneed•  IfanRDDisreusedmul<ple<mesinkey-orientedopera<ons– Par<<oningcanimproveperformancesignificantly

BigDataProgramming 13CS378-Fall2018

Page 14: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

Par<<oning

•  Par<<oningonpairRDDs(key,value)

•  ConsideranRDDcontainingusersessions– Allusersoversome<meperiod(dayorweek)– Wewanttomergeinthelasthourofevents

•  We’llbejoiningsessionsandeventsbyuserID

BigDataProgramming 14CS378-Fall2018

Page 15: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

Par<<oningFigure4-4,fromLearningSpark

BigDataProgramming 15CS378-Fall2018

Page 16: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

Par<<oningFigure4-5,fromLearningSpark

BigDataProgramming 16CS378-Fall2018

Page 17: CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

Par<<oning

•  ConsideranRDDcontainingusersessions– Allusersoversome<meperiod(dayorweek)– Wewanttomergeevents,mul<ple<mes

•  Tosetupforthis:– CreatethesessionRDD– Par<<on(callpartitionBy(),atransforma<on)– Persist

BigDataProgramming 17CS378-Fall2018