Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process...
Transcript of Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process...
![Page 1: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/1.jpg)
Python API for Spark
Josh RosenUC#BERKELEY#
![Page 2: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/2.jpg)
What is Spark?
![Page 3: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/3.jpg)
Fast and expressive cluster computing system
Compatible with Hadoop-‐supported file systems and data formats (HDFS, S3, SequenceFile, ...)
![Page 4: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/4.jpg)
Improves efficiency through in-memory computing primitives and general computation graphs
As much as 30x faster
![Page 5: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/5.jpg)
Improves usability through rich APIs in Scala, Python, and Java, and an interactive shell
Often 2-10x less code
![Page 6: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/6.jpg)
RDDsResilient Distributed Datasets
Immutable, partitioned collections of objects
![Page 7: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/7.jpg)
Transformations
Actions
mapfiltergroupByjoin...
countcollectsave...
![Page 8: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/8.jpg)
val lines = spark.textFile(“hdfs://...”) val errors = lines.filter(_.startsWith(“ERROR”)) val messages = errors.map(_.split(‘\t’)(2))
messages.filter(_.contains(“foo”)).count
Example: Log Mining
![Page 9: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/9.jpg)
What is PySpark?
![Page 10: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/10.jpg)
PySpark at a Glance
Write Spark jobs in Python
Run interactive jobs in the shell
Supports C extensions
![Page 11: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/11.jpg)
Previewed at AMP Camp 2012
Available now in 0.7 release
![Page 12: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/12.jpg)
Example: Word Countfrom pyspark.context import SparkContext
sc = SparkContext(...)lines = sc.textFile(sys.argv[2], 1)counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x, y: x + y)
for (word, count) in counts.collect(): print "%s : %i" % (word, count)
![Page 13: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/13.jpg)
Demo
![Page 14: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/14.jpg)
Implementation
![Page 15: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/15.jpg)
Built on top of Java API
Spark
LocalMode Mesos Standalone YARN
Java API
PySpark
![Page 16: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/16.jpg)
Process data in Python
and persist / transfer it in Java
![Page 17: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/17.jpg)
schedulingbroadcastcheckpointingnetworkingfault-recoveryHDFS access
Re-uses Spark’s
![Page 18: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/18.jpg)
< 2K lines, including comments
PySpark has a small codebase:-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐File blank comment code-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐python/pyspark/rdd.py 115 345 302core/src/main/scala/spark/api/python/PythonRDD.scala 33 45 231python/pyspark/context.py 32 101 133python/pyspark/tests.py 26 11 84python/pyspark/accumulators.py 37 91 70python/pyspark/serializers.py 21 7 55python/pyspark/join.py 15 27 50python/pyspark/worker.py 8 7 44core/src/main/scala/spark/api/python/PythonPartitioner.scala 5 9 34pyspark 9 8 27python/pyspark/java_gateway.py 5 7 26python/pyspark/files.py 7 14 17python/pyspark/broadcast.py 8 16 15python/pyspark/shell.py 4 6 8python/pyspark/__init__.py 6 14 7-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐SUM: 331 708 1103-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
![Page 19: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/19.jpg)
Data Flow
Python JVM
Local Cluster
![Page 20: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/20.jpg)
Data Flow
SparkContext
Python JVM
Local Cluster
![Page 21: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/21.jpg)
Data Flow
SparkContext
Python JVM
Py4J
Socket
Local Cluster
Spark Context
![Page 22: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/22.jpg)
Data Flow
SparkContext
Python JVM
Py4J
Socket
LocalFS
Local Cluster
Spark Context
![Page 23: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/23.jpg)
Data Flow
SparkContext
Python JVM
Py4J
Socket
LocalFS
Local Cluster
Spark Worker
Spark Worker
Spark Context
![Page 24: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/24.jpg)
Data Flow
SparkContext
Python JVM
Py4J
Socket
LocalFS
Local Cluster
Spark Worker
Python
Python
Python
Pipe
Spark Worker
Python
Python
Python
Spark Context
![Page 25: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/25.jpg)
Data is stored as Pickled objects in RDD[Array[Byte]]
![Page 26: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/26.jpg)
Storing batches of Python objects in one Scala object reduces overhead
![Page 27: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/27.jpg)
When possible, RDD transformations are pipelined:lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1))
MappedRDDfunc(x) = x.split(‘’)
MappedRDDfunc(x) = (x, 1)
MappedRDDfunc(x) = (x.split(‘’), 1)
![Page 28: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/28.jpg)
Python functions and closures are serialized using PiCloud’s CloudPickle module
![Page 29: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/29.jpg)
Roadmap
![Page 30: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/30.jpg)
Available in Spark 0.7
![Page 31: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/31.jpg)
Thanks!
![Page 32: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/32.jpg)
Bonus Slides
![Page 33: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/33.jpg)
>>> x = ["Hello", "World!"]>>> pickletools.dis(cPickle.dumps(x, 2)) 0: \x80 PROTO 2 2: ] EMPTY_LIST 3: q BINPUT 1 5: ( MARK 6: U SHORT_BINSTRING 'Hello' 13: q BINPUT 2 15: U SHORT_BINSTRING 'World!' 23: q BINPUT 3 25: e APPENDS (MARK at 5) 26: . STOPhighest protocol among opcodes = 2
Pickle is a miniature stack language
![Page 34: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/34.jpg)
You can do crazy stuff, like converting a collection of pickled objects into a pickled collection.
https://gist.github.com/JoshRosen/3384191
![Page 35: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing](https://reader030.fdocuments.in/reader030/viewer/2022040214/5ec565c3d68084568c75fe0e/html5/thumbnails/35.jpg)
Bulk depickling can be faster even if it involves Pickle opcode manipulation:
https://gist.github.com/JoshRosen/3401373
10000 integers:Bulk depickle (chunk size = 2): 0.266709804535Bulk depickle (chunk size = 10): 0.0797798633575Bulk depickle (chunk size = 100): 0.0388460159302Bulk depickle (chunk size = 1000): 0.0333180427551Individual depickle: 0.0540158748627
10000 dicts (dict([ (str(n), n) for n in range(100) ])):Bulk depickle (chunk size = 2): 2.70617198944Bulk depickle (chunk size = 10): 2.30310201645Bulk depickle (chunk size = 100): 2.22087192535Bulk depickle (chunk size = 1000): 2.22118020058Individual depickle: 2.44124102592