Operational Tips for Deploying Spark by Miklos Christine
-
Upload
spark-summit -
Category
Data & Analytics
-
view
2.002 -
download
1
Transcript of Operational Tips for Deploying Spark by Miklos Christine
![Page 1: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/1.jpg)
Operational Tips for Deploying Spark
Miklos Christine Solutions EngineerDatabricks
![Page 2: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/2.jpg)
$ whoami
• Previously @ Cloudera
• Deep Knowledge of Big Data Stack
• Apache Spark Expert
• Solutions Engineer @ Databricks!
![Page 3: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/3.jpg)
Agenda• Quick Apache Spark Overview
• Configuration Systems
• Pipeline Design Best Practices
• Debugging Techniques
![Page 4: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/4.jpg)
Apache Spark
![Page 5: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/5.jpg)
Spark Configuration
![Page 6: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/6.jpg)
• Command Line: spark-defaults.conf spark-env.sh
• Programmatically: SparkConf()
• Hadoop Configs:core-site.xmlhdfs-site.xml
Spark Core Configuration// Print SparkConfig
sc.getConf.toDebugString
// Print Hadoop Config
val hdConf = sc.hadoopConfiguration.iterator()
while (hdConf.hasNext){
println(hdConf.next().toString())
}
![Page 7: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/7.jpg)
• Set SQL Configs Through SQL InterfaceSET key=value;sqlContext.sql(“SET spark.sql.shuffle.partitions=10;”)
• Tools to see current configurations// View SparkSQL Config Propertiesval sqlConf = sqlContext.getAllConfssqlConf.foreach(x => println(x._1 +" : " + x._2))
Spark SQL Configuration
![Page 8: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/8.jpg)
• File Formats
• Compression Codecs
• Spark APIs
• Job Profiles
Spark Pipeline Design
![Page 9: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/9.jpg)
File Formats• Text File Formats
– CSV– JSON
• Avro Row Format
• Parquet Columnar Format
![Page 10: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/10.jpg)
Compression Codecs• Choose and Analyze Compression Codecs
– Snappy, Gzip, LZO
• Configuration Parameters– io.compression.codecs
– spark.sql.parquet.compression.codec
– spark.io.compression.codec
![Page 11: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/11.jpg)
Small Files Problem • Small files problem still exists
• Metadata loading
• Use coalesce()
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
![Page 12: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/12.jpg)
• 2 Types of Partitioning– File level and Spark
# Get Number of Spark
df.rdd.getNumPartitions()
40
Partitioningdf.write.\
partitionBy(“colName”).\
saveAsTable(“tableName”)
![Page 13: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/13.jpg)
• Leverage Spark UI– SQL – Streaming
Spark Job Profiles
![Page 14: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/14.jpg)
Spark Job Profiles
![Page 15: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/15.jpg)
Spark Job Profiles
![Page 16: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/16.jpg)
• Monitoring & Metrics – Spark – Servers
● Toolset– Ganglia– Graphite
Job Profiles: Monitoring
Ref:http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/
![Page 17: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/17.jpg)
● Analyze the Driver’s stacktrace.
● Analyze the executors stacktraces– Find the initial executor’s failure.
● Review metrics– Memory– Disk– Networking
Debugging Spark
![Page 18: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/18.jpg)
● OutOfMemoryErrors– Driver– Executors
● Out of Disk Space Issues
● Long GC Pauses
● API Usage
Top Support Issues
![Page 19: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/19.jpg)
● Use builtin functions instead of custom UDFs– import pyspark.sql.functions– import org.apache.spark.sql.functions
● Examples:
– to_date()
– get_json_object()
– regexp_extract()
Ref:http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
Top Support Issues
![Page 20: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/20.jpg)
● SQL Joins– df_users.join(df_orders).explain()
– set spark.sql.autoBroadcastJoinThreshold
● Exported Parquet from External Systems– spark.sql.parquet.binaryAsString
● Tune number of Shuffle Partitions– spark.sql.shuffle.partitions
Top Support Issues
![Page 21: Operational Tips for Deploying Spark by Miklos Christine](https://reader031.fdocuments.in/reader031/viewer/2022030310/58f9a94d760da3da068b6d43/html5/thumbnails/21.jpg)
Thank [email protected]://www.linkedin.com/in/mrchristine