Basic&Spark&Programming&and&...
Transcript of Basic&Spark&Programming&and&...
![Page 1: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/1.jpg)
Basic Spark Programming and Performance Diagnosis
Jinliang Wei 15-‐719 Spring 2017
Recita@on
![Page 2: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/2.jpg)
Today’s Agenda
• PySpark shell and submiHng jobs • Basic Spark programming – Word Count • How does Spark execute your program? • Spark monitoring web UI • What is shuffle and how does it work? • Spark programming caveats • Generally good prac@ces • Important configura@on parameters • Basic performance diagnosis
![Page 3: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/3.jpg)
PySpark shell and submiHng jobs
![Page 4: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/4.jpg)
Launch A Spark + HDFS Cluster on EC2
• Firstly, set environment variables: – AWS_SECRET_ACCESS_KEY– AWS_ACCESS_KEY_ID
• Get spark-‐ec2-‐setup • Launch a cluster with 4 slave nodes: ./spark-ec2 -k <key-id> -i <identity-file> \
-t m4.xlarge -s 4 -a ami-6d15ec7b \--ebs-vol-size=200 --ebs-vol-num=1 \--ebs-vol-type=gp2 \--spot-price=<proper-price> \launch SparkCluster
• Login as root• Replace launch with destroy to terminate the
cluster
![Page 5: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/5.jpg)
Your Standalone Spark Cluster Master
Worker1 Worker2 • Spark master is the cluster manager (analogous to YARN/
Mesos). • Workers are some@mes referred to as slaves. • When your applica@on is submided, worker nodes run
executors, which are processes that run computa@ons and store data for your applica@on.
• By default, an executor uses all cores on a worker node. • Configurable via spark.executor.cores (normally lee as
default unless too many cores per node)
![Page 6: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/6.jpg)
Standalone Spark Master Web UI http://[master-node-public-ip]:8080For an overview of the cluster and state of each worker.
![Page 7: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/7.jpg)
PySpark Shell
• Spark is installed under /root/spark
• Launch PySpark shell /root/spark/bin/pyspark
![Page 8: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/8.jpg)
Simple math using PySpark Shell
• Define a list of numbers: a = [1, 3, 7, 4, 2]
• Create an RDD from that list: rdd_a = sc.parallelize(a)
• Double each element: rdd_b = rdd_a.map(lambda x: x * 2)
• Sum the elements up: c = rdd_b.reduce(lambda x, y: x + y)
![Page 9: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/9.jpg)
Submit Applica@ons to Spark
• Suppose you have a Spark program named word_count.py, submit to run by running
/root/spark/bin/spark-submit \[optional arguments to spark-submit] \word_count.py \[arguments to your program]
![Page 10: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/10.jpg)
What happens when you submit your applica@on?
• Your program (driver program) runs in “client” mode – a client outside of the Spark master.
• Spark launches executors on the worker nodes. • SparkContext sends tasks to the executors to run.
![Page 11: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/11.jpg)
Basic Spark Programming – Word Count
![Page 12: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/12.jpg)
How to implement a word count w/ map-‐reduce?
• Problem: given a document, count the occurrences of each word
• Map: take in a chunk of the document, output a list of pairs of (word, 1)
• Shuffle: group KV pairs by their key (word), assign each group to a reducer
• Reduce: sum up the values of each group
![Page 13: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/13.jpg)
How to implement it using Spark? import pyspark
if __name__ == "__main__": conf = pyspark.SparkConf().setAppName("WordCount") sc = pyspark.SparkContext(conf=conf)
text_rdd = sc.textFile("/README.md") tokens_rdd = text_rdd.flatMap( \
lambda x: [(a, 1) for a in x.split()])count_rdd = tokens_rdd.reduceByKey(lambda x, y: x + y)
tokens_count = count_rdd.collect()
sc.stop()
tokens_count.sort(key = lambda x: x[1], reverse=True) count = 0 for token_tuple in tokens_count: print "(%s, %d)" % token_tuple count += 1 if count >= 10: break
![Page 14: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/14.jpg)
Lazy Evalua@on
• Two kinds of opera@ons on RDD: – Transforma@on: RDD_A -‐> RDD_B, e.g. flatMap– Ac@on: RDD_A -‐> outside Spark, e.g. collect
• Transforma@on is “lazy evaluated”. – Record the dependency informa@on when called. – Evaluated only when necessary
• An ac@on causes the RDD and the ones it depends on to be computed.
![Page 15: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/15.jpg)
How does Spark execute your program?
Why should you care? Because you may need to do performance diagnosis and understand the terminology to interpret the Spark monitoring UIs.
![Page 16: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/16.jpg)
The lineage graph is built when transforma@ons are invoked
par@@on1
par@@on2
par@@on3
text_rdd tokens_rdd tokens_rdd
narrow dependence wide dependence
Pipelined execu@on: a sequence of transforma@ons applied on each record (par@@on), independently executed of other records (par@@ons)
Shuffle: every node reads from every other nodes; might cause global barrier
![Page 17: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/17.jpg)
An ac@on causes the actual evalua@on
• Spark calls it a job. • If the RDD on which the ac@on was invoked exists, then compute the ac@on, else compute the RDD.
• Compu@ng an RDD recursively computes its parent RDDs.
![Page 18: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/18.jpg)
Build a DAG of stages from the lineage graph
• RDD’s with narrow dependence between them are grouped into the same stage.
• Stage boundaries are shuffles. • Each task is scheduled to a core.
stage 1 stage 2
text_rdd tokens_rdd tokens_rdd
a task
![Page 19: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/19.jpg)
A stage is computed as a set of parallel tasks
• Each par@@on is a task • You may control the number of par@@ons of an RDD – partitionBy(num_partitions)– Some opera@ons allow you to explicitly specify the number of par@@ons
– Configura@on parameter: spark.default.parallelism
• This is where most of the parallelism comes from
![Page 20: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/20.jpg)
What’s the proper number of par@@ons for an RDD?
• Want to have sufficient parallelism and balanced load. – Rule of thumb: at least 2 @mes the number of cores
• Don’t want too many tasks otherwise most of the @me will be spent on seHng up the tasks. – Rule of thumb: at least hundreds of milliseconds per task
• Make sure each par@@on can fit in memory.
![Page 21: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/21.jpg)
How does Spark run Python code?
• Your Python UDFs are executed in Python processes • RDD records need to be transferred between JVM and Python
• Serializa@on could be a performance problem.
![Page 22: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/22.jpg)
PySpark “pipelines” Python func@on automa@cally
• If you apply mul@ple transforma@ons in a series, Spark “fuses” the Python UDFs to avoid mul@ple transfers between Python and JVM.
• Example: rdd_x.map(foo).map(bar) – Func@on foo(x) takes in a record x and outputs a record y
– Func@on bar(y) takes in a record y and outputs a record z
– Spark automa@cally creates a func@on foo_bar(x) that takes in a record x and outputs a record z, which is essen@ally bar(foo(x)).
![Page 23: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/23.jpg)
Spark Monitoring Web UIs
![Page 24: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/24.jpg)
Live Monitoring Web UI
http://[master-node-public-ip]:4040How is my running applica@on doing?
![Page 25: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/25.jpg)
History Server http://[master-node-public-ip]:18080Visualizing the logs of completed applica@ons.
![Page 26: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/26.jpg)
The job view • Jobs – why is there only one job?
![Page 27: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/27.jpg)
Details for a job • Stages: what opera@ons are in stage 1 and 2?
![Page 28: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/28.jpg)
Understand the DAG Visualiza@on
• Dots are RDDs. • Dots inside the blue box are RDDs in JVM. • Text labels are transforma@on that generates the RDDs – Problem: PySpark uses some transforma@ons to implement other transforma@ons (reduceByKey implement by par@@onBy and mapPar@@ons), so the labels are not exactly the same as your code
– But if you know the stage boundaries, you can figure out which opera@ons belong to which stage
![Page 29: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/29.jpg)
Details for a stage
![Page 30: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/30.jpg)
Event Timeline
![Page 31: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/31.jpg)
Recap: Stage DAG
• Each RDD par@@on correspond to a task • Number of RDD par@@ons can oeen be controled
stage 1 stage 2
text_rdd tokens_rdd tokens_rdd
![Page 32: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/32.jpg)
What is shuffle write and shuffle read?
What is shuffle and how does it work?
![Page 33: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/33.jpg)
What is shuffle and what is it used for?
• Informally, a mechanism that redistributes the par@@oned RDD records.
• Informally, it is needed whenever you need records that sa@sfy certain condi@on (e.g. the same key) to reside in the same par@@on.
(“a”, 1), (“b”, 1), (“d”, 1)
(“a”, 1), (“b”, 1), (“c”, 1)
(“a”, 1), (“a”, 1), (“c”, 1)
(“b”, 1), (“b”, 1), (“d”, 1)
![Page 34: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/34.jpg)
Opera@ons that may cause a shuffle
• par@@onBy • reduceByKey • groupByKey • …
![Page 35: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/35.jpg)
How is shuffle implemented?
• Two implementa@ons: hash shuffle and sort shuffle
• You don’t need to know the details for this project. If curious, read this blog post:
hdps://0x0fff.com/spark-‐architecture-‐shuffle/ • You need to know: – Mappers (sources) serializes RDD records and write them to local disk (shuffle write)
– Reducers (des@na@ons) reads from remote disk over network for their par@@on of records (shuffle read)
![Page 36: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/36.jpg)
Shuffle is expensive
• Data is serialized, wriden to local disk, and communicated over network – Serializa@on takes @me, disk and network are slow
• Everyone depends on everyone else, if there is a straggler, everyone has to wait
• Minimize the number of shuffles in your program
![Page 37: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/37.jpg)
Spark Programming Caveats
![Page 38: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/38.jpg)
Understanding Closures
• Informally, a closure is a func@on with its surrounding environment when the closure is created.
• The driver program sends closures to executors to have them executed.
• RDD opera@ons (closures) modify variables outside of their scope oeen causes confusion (generally don’t do that).
![Page 39: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/39.jpg)
What’s the behavior of this code?
• The driver’s counter is captured when the closure is created and then visible to executors.
• The global counter that the executor modifies is the executor’s local variable, i.e. writes are not see by driver.
![Page 40: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/40.jpg)
Broadcast variable
• broadcastVar.value can be read by any worker any@me aeer it’s created
• Read-‐only variable (to avoid dealing with concurrent writes)
• One copy per executor
![Page 41: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/41.jpg)
Ways to communicate values from driver to executors or tasks
• Create RDDs • Closure • Broadcast variable • Ques@on: when should you use each one? – Closure: values of small size that are only useful for this func@on
– Broadcast variable: more efficient for larger variables and when you want to reuse the values across stages
– RDDs: when the variable is too large
![Page 42: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/42.jpg)
How do executors send values to driver?
• Use RDD ac@ons • Accumulators – Only allow associa@ve and commuta@ve opera@ons
– Because concurrent writes can be easily dealt with – Read Spark programming guide for details
![Page 43: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/43.jpg)
RDD Persistence
• Spark is in-‐memory – what does it mean? • Spark is capable of persis@ng (or caching) an RDD in memory across ac@ons (jobs). – Hadoop can’t. – Spark may persist RDDs in disks too.
• If RDDs are not persisted, they are recomputed for different ac@ons. – RDDs are computed at most once per job.
• But you need to tell Spark which RDDs to persist. • Spark some@mes persists an RDD automa@cally, but this is not very well specified.
![Page 44: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/44.jpg)
Persis@ng an RDD
• persist() op@ons:– MEMORY_ONLY: default, if not enough memory, recompute it
– MEMORY_AND_DISK: if not enough memory, persist on disk
– DISK_ONLY: persist on disk – A few others
• cache() is persist(MEMORY_ONLY)
![Page 45: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/45.jpg)
Generally Good Prac@ces • Generally, avoid shuffles if you can – A shuffle might be worth doing if it increases parallelism • E.g. more par@@ons, beder load balancing,
• For shuffle, pick the right operators – avoid transferring the en@re RDD over network
• Some opera@ons do local aggrega@ons before shuffles before shuffling • E.g. groupByKey() + mapValues() vs. reduceByKey()
![Page 46: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/46.jpg)
Spark Proper@es – Per Applica@on Proper@es
![Page 47: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/47.jpg)
The ones that you should understand
• spark.executor.memory: amount of memory to user per executor process (JVM heap size) – Op@onal reading -‐ Spark memory management: hdp://spark.apache.org/docs/latest/tuning.html#memory-‐management-‐overview
• spark.default.parallelism: default number of par@@ons in RDDs returned by certain opera@ons, when not set by user – You can explicitly control the number of par@@ons in most cases
• More details (op@onal for project 2): hdp://spark.apache.org/docs/latest/[email protected]#spark-‐proper@es
![Page 48: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/48.jpg)
How to set those proper@es
• When calling spark-‐submit, use op@on –conf “config.property=value”One property per conf.
• Can be set programmably using SparkConf when crea@ng SparkContext (don’t work for all proper@es)
• conf/spark-defaults.conf (don’t do that for Project 2)
![Page 49: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/49.jpg)
Basic Performance Diagnosis: What do I do if my applica@on is
running slow?
![Page 50: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/50.jpg)
Q1: Which job and stage is the bodleneck?
• Check the Spark monitoring UIs • Iden@fy the bodlenecking stage
![Page 51: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/51.jpg)
Possible sources of bodleneck
• CPUs are not fully u@lized – Network I/O – Disk I/O – Insufficient parallelism – Imbalance
• CPUs are highly u@lized
![Page 52: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/52.jpg)
Q1: Are you fully u@lizing your CPUs? • vmstat 2 20
– One update every 2 seconds, for 20 updates – The first line is an average since the machine is booted – Good for a quick overview of the machine
![Page 53: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/53.jpg)
Q2: Why are my CPUs not fully u@lized?
• Generally you can find answers from the monitoring web UI
• Insufficient parallelism or imbalance? – Check the per stage @meline
• Blocked on network or disk I/O? – Shuffle reads and writes
• How to op@mize for those problems?
![Page 54: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/54.jpg)
Q2: My CPUs are highly u@lized, so?
• Which func@ons are your CPUs spend their @me on? Answer: profile your code.
• Spark Python profiler --conf “spark.python.profile=true”--conf “spark.python.profile.dump=/root/spark_profile” • More details: hdp://spark.apache.org/docs/latest/[email protected] hdps://docs.python.org/2/library/profile.html • If most @me is spent in JVM, this is not useful and it’s beyond your control.
![Page 55: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/55.jpg)
Basic Performance Diagnosis: What do I do if I get Out-‐Of-‐Memory
(OOM) excep@ons? OOM can manifest as other
excep@ons.
![Page 56: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/56.jpg)
A Common Pivall
• Driver and executor memory sizes are configurable, and the defaults are 1g
• You can configure them – spark.driver.memory– spark.executor.memory
![Page 57: Basic&Spark&Programming&and& Performance&Diagnosis&jinlianw/719-s17/spark_programming_performance... · Basic&Spark&Programming&and& Performance&Diagnosis& Jinliang&Wei& 15719Spring2017](https://reader033.fdocuments.in/reader033/viewer/2022042104/5e81dbd1a0aa1969561cad8a/html5/thumbnails/57.jpg)
Size of a par@@on maders
• Informally, for each task, the executor loads the corresponding par@@on into memory.
• If the par@@on cannot fit in memory, you get OOMs.
• Then you want more and smaller par@@ons. • RDD par@@oning is in unit of records, if a single record is huge then repar@@oning won’t help.