Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

37
Spark Streaming Tips for Devs & Ops

Transcript of Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Page 1: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Spark Streaming Tips for

Devs & Ops

Page 2: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Page 3: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

WHO ARE WE?

Fede FernándezScala Software Engineer at 47 Degrees

Spark Certified Developer@fede_fdz

Fran PérezScala Software Engineer at 47 Degrees

Spark Certified Developer@FPerezP

Page 4: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Overview

Spark StreamingSpark + Kafka

groupByKey vs reduceByKeyTable Joins

SerializerTunning

Page 5: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Spark StreamingReal-time data processing

Continuous Data Flow

RDD

RDD

RDD

DStream

Output Data

Page 6: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Spark + Kafka

● Receiver-based Approach

○ At least once (with Write Ahead Logs)

● Direct API

○ Exactly once

Page 7: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Spark + Kafka

● Receiver-based Approach

Page 8: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Spark + Kafka

● Direct API

Page 9: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

groupByKey VS reduceByKey

● groupByKey

○ Groups pairs of data with the same key.

● reduceByKey

○ Groups and combines pairs of data based on a reduce

operation.

Page 10: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

groupByKey VS reduceByKey

sc.textFile(“hdfs://….”).flatMap(_.split(“ “)).map((_, 1)).groupByKey.map(t => (t._1, t._2.sum))

sc.textFile(“hdfs://….”).flatMap(_.split(“ “)).map((_, 1)).reduceByKey(_ + _)

Page 11: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

Page 12: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

Page 13: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

shuffle shuffle

Page 14: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 1)(j, 1)(j, 1)(j, 1)(j, 1)

(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)

(c, 1)(c, 1)(c, 1)(c, 1)

shuffle shuffle

Page 15: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 1)(j, 1)(j, 1)(j, 1)(j, 1)

(j, 5)

(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)

(s, 6)

(c, 1)(c, 1)(c, 1)(c, 1)

(c, 4)

shuffle shuffle

Page 16: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

reduceByKey

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(c, 1)(s, 1)(j, 1)

Page 17: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

reduceByKey

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 2)(c, 1)(s, 2)

(j, 1)(s, 1)(s, 1)

(j, 1)(s, 2)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(c, 2)(s, 1)

(c, 1)(s, 1)(j, 1)

(j, 1)(c, 1)(s, 1)

Page 18: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

reduceByKey

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 2)(c, 1)(s, 2)

(j, 1)(s, 1)(s, 1)

(j, 1)(s, 2)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(c, 2)(s, 1)

(c, 1)(s, 1)(j, 1)

(j, 1)(c, 1)(s, 1)

Page 19: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

reduceByKey

(j, 2)(j, 1)(j, 1)(j, 1)

(s, 2)(s, 2)(s, 1)(s, 1)

(c, 1)(c, 2)(c, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 2)(c, 1)(s, 2)

(j, 1)(s, 1)(s, 1)

(j, 1)(s, 2)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(c, 2)(s, 1)

(c, 1)(s, 1)(j, 1)

(j, 1)(c, 1)(s, 1)

shuffle shuffle

Page 20: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

reduceByKey

(j, 2)(j, 1)(j, 1)(j, 1)

(j, 5)(s, 2)(s, 2)(s, 1)(s, 1)

(s, 6)(c, 1)(c, 2)(c, 1)

(c, 4)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 2)(c, 1)(s, 2)

(j, 1)(s, 1)(s, 1)

(j, 1)(s, 2)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(c, 2)(s, 1)

(c, 1)(s, 1)(j, 1)

(j, 1)(c, 1)(s, 1)

shuffle shuffle

Page 21: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

reduce VS group

● Improve performance

● Can’t always be used

● Out of Memory Exceptions

● aggregateByKey, foldByKey, combineByKey

Page 22: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Table Joins

● Typical operations that can be improved

● Need a previous analysis

● There are no silver bullets

Page 23: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Table Joins: Medium - Large

Page 24: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Table Joins: Medium - Large

FILTER No Shuffle

Page 25: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Table Joins: Small - Large

...Shuffled Hash JoinsqlContext.sql("explain <select>").collect.mkString(“\n”)

[== Physical Plan ==][Project][+- SortMergeJoin][ :- Sort][ : +- TungstenExchange hashpartitioning][ : +- TungstenExchange RoundRobinPartitioning][ : +- ConvertToUnsafe][ : +- Scan ExistingRDD][ +- Sort][ +- TungstenExchange hashpartitioning][ +- ConvertToUnsafe][ +- Scan ExistingRDD]

Page 26: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Table Joins: Small - Large

Broadcast Hash JoinsqlContext.sql("explain <select>").collect.mkString(“\n”)

[== Physical Plan ==][Project][+- BroadcastHashJoin][ :- TungstenExchange RoundRobinPartitioning][ : +- ConvertToUnsafe][ : +- Scan ExistingRDD][ +- Scan ParquetRelation]

No shuffle!

By default from Spark 1.4 when using DataFrame API

Prior Spark 1.4

ANALYZE TABLE small_table COMPUTE STATISTICS noscan

Broadcast

Page 27: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Table Joins: Small - Large

Page 28: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Serializers

● Java’s ObjectOutputStream framework. (Default)

● Custom serializers: extends Serializable & Externalizable.

● KryoSerializer: register your custom classes.

● Where is our code being run?

● Special care to JodaTime.

Page 29: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Tuning

Garbage CollectorblockIntervalPartitioning

Storage

Page 30: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Tuning: Garbage Collector• Applications which rely heavily on memory consumption.• GC Strategies

• Concurrent Mark Sweep (CMS) GC• ParallelOld GC• Garbage-First GC

• Tuning steps:• Review your logic and object management• Try Garbage-First• Activate and inspect the logs

Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

Page 31: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Tuning: blockInterval

blockInterval = (bi * consumers) / (pf * sc)

● CAT: Total cores per partition.

● bi: Batch Interval time in milliseconds.

● consumers: number of streaming consumers.

● pf (partitionFactor): number of partitions per core.

● sc (sparkCores): CAT - consumers.

Page 32: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

blockInterval: example

● batchIntervalMillis = 600,000

● consumers = 20

● CAT = 120

● sparkCores = 120 - 20 = 100

● partitionFactor = 3

blockInterval = (bi * consumers) / (pf * sc) =

(600,000 * 20) / (3 * 100) =

40,000

Page 33: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Tuning: Partitioning

partitions = consumers * bi / blockInterval

● consumers: number of streaming consumers.

● bi: Batch Interval time in milliseconds.

● blockInterval: time size to split data before storing into

Spark.

Page 34: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Partitioning: example

● batchIntervalMillis = 600,000

● consumers = 20

● blockInterval = 40,000

partitions = consumers * bi / blockInterval =

20 * 600,000/ 40,000=

30

Page 35: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Tuning: Storage

• Default (MEMORY_ONLY)

• MEMORY_ONLY_SER with Serialization Library

• MEMORY_AND_DISK & DISK_ONLY

• Replicated _2

• OFF_HEAP (Tachyon/Alluxio)

Page 37: Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

QUESTIONS

Fede Fernández@fede_fdz

[email protected]

Fran Pérez@FPerezP

[email protected]

Thanks!