Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
-
Upload
j-on-the-beach -
Category
Software
-
view
199 -
download
3
Transcript of Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for
Devs & Ops
WHO ARE WE?
Fede FernándezScala Software Engineer at 47 Degrees
Spark Certified Developer@fede_fdz
Fran PérezScala Software Engineer at 47 Degrees
Spark Certified Developer@FPerezP
Overview
Spark StreamingSpark + Kafka
groupByKey vs reduceByKeyTable Joins
SerializerTunning
Spark StreamingReal-time data processing
Continuous Data Flow
RDD
RDD
RDD
DStream
Output Data
Spark + Kafka
● Receiver-based Approach
○ At least once (with Write Ahead Logs)
● Direct API
○ Exactly once
Spark + Kafka
● Receiver-based Approach
Spark + Kafka
● Direct API
groupByKey VS reduceByKey
● groupByKey
○ Groups pairs of data with the same key.
● reduceByKey
○ Groups and combines pairs of data based on a reduce
operation.
groupByKey VS reduceByKey
sc.textFile(“hdfs://….”).flatMap(_.split(“ “)).map((_, 1)).groupByKey.map(t => (t._1, t._2.sum))
sc.textFile(“hdfs://….”).flatMap(_.split(“ “)).map((_, 1)).reduceByKey(_ + _)
groupByKey
(c, 1)(s, 1)(j, 1)
(s, 1)(c, 1)(c, 1)(j, 1)
(j, 1)(s, 1)(s, 1)
(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)
groupByKey
(c, 1)(s, 1)(j, 1)
(s, 1)(c, 1)(c, 1)(j, 1)
(j, 1)(s, 1)(s, 1)
(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)
groupByKey
(c, 1)(s, 1)(j, 1)
(s, 1)(c, 1)(c, 1)(j, 1)
(j, 1)(s, 1)(s, 1)
(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)
shuffle shuffle
groupByKey
(c, 1)(s, 1)(j, 1)
(s, 1)(c, 1)(c, 1)(j, 1)
(j, 1)(s, 1)(s, 1)
(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)
(j, 1)(j, 1)(j, 1)(j, 1)(j, 1)
(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)
(c, 1)(c, 1)(c, 1)(c, 1)
shuffle shuffle
groupByKey
(c, 1)(s, 1)(j, 1)
(s, 1)(c, 1)(c, 1)(j, 1)
(j, 1)(s, 1)(s, 1)
(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)
(j, 1)(j, 1)(j, 1)(j, 1)(j, 1)
(j, 5)
(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)
(s, 6)
(c, 1)(c, 1)(c, 1)(c, 1)
(c, 4)
shuffle shuffle
reduceByKey
(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)
(j, 1)(s, 1)(s, 1)
(s, 1)(c, 1)(c, 1)(j, 1)
(c, 1)(s, 1)(j, 1)
reduceByKey
(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)
(j, 2)(c, 1)(s, 2)
(j, 1)(s, 1)(s, 1)
(j, 1)(s, 2)
(s, 1)(c, 1)(c, 1)(j, 1)
(j, 1)(c, 2)(s, 1)
(c, 1)(s, 1)(j, 1)
(j, 1)(c, 1)(s, 1)
reduceByKey
(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)
(j, 2)(c, 1)(s, 2)
(j, 1)(s, 1)(s, 1)
(j, 1)(s, 2)
(s, 1)(c, 1)(c, 1)(j, 1)
(j, 1)(c, 2)(s, 1)
(c, 1)(s, 1)(j, 1)
(j, 1)(c, 1)(s, 1)
reduceByKey
(j, 2)(j, 1)(j, 1)(j, 1)
(s, 2)(s, 2)(s, 1)(s, 1)
(c, 1)(c, 2)(c, 1)
(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)
(j, 2)(c, 1)(s, 2)
(j, 1)(s, 1)(s, 1)
(j, 1)(s, 2)
(s, 1)(c, 1)(c, 1)(j, 1)
(j, 1)(c, 2)(s, 1)
(c, 1)(s, 1)(j, 1)
(j, 1)(c, 1)(s, 1)
shuffle shuffle
reduceByKey
(j, 2)(j, 1)(j, 1)(j, 1)
(j, 5)(s, 2)(s, 2)(s, 1)(s, 1)
(s, 6)(c, 1)(c, 2)(c, 1)
(c, 4)
(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)
(j, 2)(c, 1)(s, 2)
(j, 1)(s, 1)(s, 1)
(j, 1)(s, 2)
(s, 1)(c, 1)(c, 1)(j, 1)
(j, 1)(c, 2)(s, 1)
(c, 1)(s, 1)(j, 1)
(j, 1)(c, 1)(s, 1)
shuffle shuffle
reduce VS group
● Improve performance
● Can’t always be used
● Out of Memory Exceptions
● aggregateByKey, foldByKey, combineByKey
Table Joins
● Typical operations that can be improved
● Need a previous analysis
● There are no silver bullets
Table Joins: Medium - Large
Table Joins: Medium - Large
FILTER No Shuffle
Table Joins: Small - Large
...Shuffled Hash JoinsqlContext.sql("explain <select>").collect.mkString(“\n”)
[== Physical Plan ==][Project][+- SortMergeJoin][ :- Sort][ : +- TungstenExchange hashpartitioning][ : +- TungstenExchange RoundRobinPartitioning][ : +- ConvertToUnsafe][ : +- Scan ExistingRDD][ +- Sort][ +- TungstenExchange hashpartitioning][ +- ConvertToUnsafe][ +- Scan ExistingRDD]
Table Joins: Small - Large
Broadcast Hash JoinsqlContext.sql("explain <select>").collect.mkString(“\n”)
[== Physical Plan ==][Project][+- BroadcastHashJoin][ :- TungstenExchange RoundRobinPartitioning][ : +- ConvertToUnsafe][ : +- Scan ExistingRDD][ +- Scan ParquetRelation]
No shuffle!
By default from Spark 1.4 when using DataFrame API
Prior Spark 1.4
ANALYZE TABLE small_table COMPUTE STATISTICS noscan
Broadcast
Table Joins: Small - Large
Serializers
● Java’s ObjectOutputStream framework. (Default)
● Custom serializers: extends Serializable & Externalizable.
● KryoSerializer: register your custom classes.
● Where is our code being run?
● Special care to JodaTime.
Tuning
Garbage CollectorblockIntervalPartitioning
Storage
Tuning: Garbage Collector• Applications which rely heavily on memory consumption.• GC Strategies
• Concurrent Mark Sweep (CMS) GC• ParallelOld GC• Garbage-First GC
• Tuning steps:• Review your logic and object management• Try Garbage-First• Activate and inspect the logs
Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
Tuning: blockInterval
blockInterval = (bi * consumers) / (pf * sc)
● CAT: Total cores per partition.
● bi: Batch Interval time in milliseconds.
● consumers: number of streaming consumers.
● pf (partitionFactor): number of partitions per core.
● sc (sparkCores): CAT - consumers.
blockInterval: example
● batchIntervalMillis = 600,000
● consumers = 20
● CAT = 120
● sparkCores = 120 - 20 = 100
● partitionFactor = 3
blockInterval = (bi * consumers) / (pf * sc) =
(600,000 * 20) / (3 * 100) =
40,000
Tuning: Partitioning
partitions = consumers * bi / blockInterval
● consumers: number of streaming consumers.
● bi: Batch Interval time in milliseconds.
● blockInterval: time size to split data before storing into
Spark.
Partitioning: example
● batchIntervalMillis = 600,000
● consumers = 20
● blockInterval = 40,000
partitions = consumers * bi / blockInterval =
20 * 600,000/ 40,000=
30
Tuning: Storage
• Default (MEMORY_ONLY)
• MEMORY_ONLY_SER with Serialization Library
• MEMORY_AND_DISK & DISK_ONLY
• Replicated _2
• OFF_HEAP (Tachyon/Alluxio)
Where to find more information?
Spark Official DocumentationDatabricks Blog
Databricks Spark Knowledge BaseSpark Notebook - By Andy Petrella
Databricks YouTube Channel