Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the...
Transcript of Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the...
![Page 1: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/1.jpg)
CS639:DataManagementfor
DataScienceLecture11:Spark
TheodorosRekatsinas
1
![Page 2: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/2.jpg)
Logistics/Announcements
2
• QuestionsonPA3?
![Page 3: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/3.jpg)
Today’sLecture
1. MapReduceImplementation
2. Spark
3
![Page 4: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/4.jpg)
1. MapReduceImplementation
4
![Page 5: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/5.jpg)
Recall:TheMapReduceAbstractionforDistributedAlgorithms
DistributedDataStorage
Map
Reduce
(Shuffle)
map map map map map map
reduce reduce reduce reduce
![Page 6: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/6.jpg)
MapReduce:whathappensinbetween?
![Page 7: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/7.jpg)
MapReduce:thecompletepicture
![Page 8: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/8.jpg)
Step1:Splitinputfilesintochunks(shards)
![Page 9: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/9.jpg)
Step2:Forkprocesses
![Page 10: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/10.jpg)
Step3:RunMapTasks
![Page 11: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/11.jpg)
Step4:Createintermediatefiles
![Page 12: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/12.jpg)
Step4a:Partitioning
![Page 13: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/13.jpg)
Step5:ReduceTask- sorting
![Page 14: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/14.jpg)
Step6:ReduceTask- reduce
![Page 15: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/15.jpg)
Step7:Returntouser
![Page 16: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/16.jpg)
MapReduce:thecompletepicture
Weneedadistributedfilesystem!
![Page 17: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/17.jpg)
2.Spark
17
![Page 18: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/18.jpg)
IntrotoSpark
• SparkisreallyadifferentimplementationoftheMapReduceprogrammingmodel
• WhatmakesSparkdifferentisthatitoperatesonMainMemory• Spark:wewriteprogramsintermsofoperationsonresilient
distributeddatasets(RDDs).• RDD(simpleview):acollectionofelementspartitionedacrossthe
nudesofaclusterthatcanbeoperatedoninparallel.• RDD(complexview):RDDisaninterfacefordatatransformation,
RDDreferstothedatastoredeitherinpersistedstore(HDFS)orincache(memory,memory+disk,diskonly)orinanotherRDD
![Page 19: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/19.jpg)
RDDsinSpark
![Page 20: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/20.jpg)
MapReducevsSpark
![Page 21: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/21.jpg)
RDDs
• Partitionsarerecomputedonfailureorcacheeviction• Metadatastoredforinterface:• Partitions– setofdatasplitsassociatedwiththisRDD• Dependencies– listofparentRDDsinvolvedincomputation• Compute– functiontocomputepartitionoftheRDDgiventheparent
partitionsfromtheDependencies• PreferredLocations– whereisthebestplacetoputcomputationsonthis
partition(datalocality)• Partitioner – howthedataissplitintopartitions
![Page 22: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/22.jpg)
RDDs
![Page 23: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/23.jpg)
DAG
• DirectedAcyclicGraph– sequenceofcomputationsperformedondata
• Node– RDDpartition• Edge– transformationontopofthedata• Acyclic– graphcannotreturntotheolderpartition• Directed– transformationisanactionthattransitionsdata
partitionsstate(fromAtoB)
![Page 24: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/24.jpg)
Example:WordCount
![Page 25: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/25.jpg)
SparkArchitecture
![Page 26: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/26.jpg)
SparkComponents
![Page 27: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/27.jpg)
SparkDriver
• EntrypointoftheSparkShell(Scala,Python,R)• TheplacewhereSparkContext iscreated• TranslatesRDDintotheexecutiongraph• Splitsgraphintostages• Schedulestasksandcontrolstheirexecution• StoresmetadataaboutalltheRDDsandtheirpartitions• BringsupSparkWebUI withjobinformation
![Page 28: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/28.jpg)
SparkExecutor
• StoresthedataincacheinJVMheaporonHDDs• Readsdatafromexternalsources• Writesdatatoexternalsources• Performsallthedataprocessing
![Page 29: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/29.jpg)
DagScheduler
![Page 30: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/30.jpg)
MoreRDDOperations
![Page 31: Lecture 11 Spark - GitHub PagesIntro to Spark • Spark is really a different implementation of the MapReduce programming model • What makes Spark different is that it operates on](https://reader033.fdocuments.in/reader033/viewer/2022060308/5f0a2ae57e708231d42a5672/html5/thumbnails/31.jpg)
Spark’ssecretisreallytheRDDabstraction