Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Spark Streaming Tips for

Devs & Ops

WHO ARE WE?

Fede FernándezScala Software Engineer at 47 Degrees

Spark Certified Developer@fede_fdz

Fran PérezScala Software Engineer at 47 Degrees

Spark Certified Developer@FPerezP

Overview

Spark StreamingSpark + Kafka

groupByKey vs reduceByKeyTable Joins

SerializerTunning

Spark StreamingReal-time data processing

Continuous Data Flow

RDD

RDD

RDD

DStream

Output Data

Spark + Kafka

● Receiver-based Approach

○ At least once (with Write Ahead Logs)

● Direct API

○ Exactly once

Spark + Kafka

● Receiver-based Approach

Spark + Kafka

● Direct API

groupByKey VS reduceByKey

● groupByKey

○ Groups pairs of data with the same key.

● reduceByKey

○ Groups and combines pairs of data based on a reduce

operation.

groupByKey VS reduceByKey

sc.textFile(“hdfs://….”).flatMap(_.split(“ “)).map((_, 1)).groupByKey.map(t => (t._1, t._2.sum))

sc.textFile(“hdfs://….”).flatMap(_.split(“ “)).map((_, 1)).reduceByKey(_ + _)

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

shuffle shuffle

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 1)(j, 1)(j, 1)(j, 1)(j, 1)

(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)

(c, 1)(c, 1)(c, 1)(c, 1)

shuffle shuffle

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 1)(j, 1)(j, 1)(j, 1)(j, 1)

(j, 5)

(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)

(s, 6)

(c, 1)(c, 1)(c, 1)(c, 1)

(c, 4)

shuffle shuffle

reduceByKey

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(c, 1)(s, 1)(j, 1)

reduceByKey

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 2)(c, 1)(s, 2)

(j, 1)(s, 1)(s, 1)

(j, 1)(s, 2)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(c, 2)(s, 1)

(c, 1)(s, 1)(j, 1)

(j, 1)(c, 1)(s, 1)

reduceByKey

(j, 2)(j, 1)(j, 1)(j, 1)

(s, 2)(s, 2)(s, 1)(s, 1)

(c, 1)(c, 2)(c, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 2)(c, 1)(s, 2)

(j, 1)(s, 1)(s, 1)

(j, 1)(s, 2)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(c, 2)(s, 1)

(c, 1)(s, 1)(j, 1)

(j, 1)(c, 1)(s, 1)

shuffle shuffle

reduceByKey

(j, 2)(j, 1)(j, 1)(j, 1)

(j, 5)(s, 2)(s, 2)(s, 1)(s, 1)

(s, 6)(c, 1)(c, 2)(c, 1)

(c, 4)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 2)(c, 1)(s, 2)

(j, 1)(s, 1)(s, 1)

(j, 1)(s, 2)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(c, 2)(s, 1)

(c, 1)(s, 1)(j, 1)

(j, 1)(c, 1)(s, 1)

shuffle shuffle

reduce VS group

● Improve performance

● Can’t always be used

● Out of Memory Exceptions

● aggregateByKey, foldByKey, combineByKey

Table Joins

● Typical operations that can be improved

● Need a previous analysis

● There are no silver bullets

Table Joins: Medium - Large

Table Joins: Medium - Large

FILTER No Shuffle

Table Joins: Small - Large

...Shuffled Hash JoinsqlContext.sql("explain <select>").collect.mkString(“\n”)

[== Physical Plan ==][Project][+- SortMergeJoin][ :- Sort][ : +- TungstenExchange hashpartitioning][ : +- TungstenExchange RoundRobinPartitioning][ : +- ConvertToUnsafe][ : +- Scan ExistingRDD][ +- Sort][ +- TungstenExchange hashpartitioning][ +- ConvertToUnsafe][ +- Scan ExistingRDD]


Broadcast Hash JoinsqlContext.sql("explain <select>").collect.mkString(“\n”)

[== Physical Plan ==][Project][+- BroadcastHashJoin][ :- TungstenExchange RoundRobinPartitioning][ : +- ConvertToUnsafe][ : +- Scan ExistingRDD][ +- Scan ParquetRelation]

No shuffle!

By default from Spark 1.4 when using DataFrame API

Prior Spark 1.4

ANALYZE TABLE small_table COMPUTE STATISTICS noscan

Broadcast

Serializers

● Java’s ObjectOutputStream framework. (Default)

● Custom serializers: extends Serializable & Externalizable.

● KryoSerializer: register your custom classes.

● Where is our code being run?

● Special care to JodaTime.

Tuning

Garbage CollectorblockIntervalPartitioning

Storage

Tuning: Garbage Collector• Applications which rely heavily on memory consumption.• GC Strategies

• Concurrent Mark Sweep (CMS) GC• ParallelOld GC• Garbage-First GC

• Tuning steps:• Review your logic and object management• Try Garbage-First• Activate and inspect the logs

Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

Tuning: blockInterval

blockInterval = (bi * consumers) / (pf * sc)

● CAT: Total cores per partition.

● bi: Batch Interval time in milliseconds.

● consumers: number of streaming consumers.

● pf (partitionFactor): number of partitions per core.

● sc (sparkCores): CAT - consumers.

blockInterval: example

● batchIntervalMillis = 600,000

● consumers = 20

● CAT = 120

● sparkCores = 120 - 20 = 100

● partitionFactor = 3

blockInterval = (bi * consumers) / (pf * sc) =

(600,000 * 20) / (3 * 100) =

40,000

Tuning: Partitioning

partitions = consumers * bi / blockInterval

● consumers: number of streaming consumers.

● bi: Batch Interval time in milliseconds.

● blockInterval: time size to split data before storing into

Spark.

Partitioning: example

● batchIntervalMillis = 600,000

● consumers = 20

● blockInterval = 40,000

partitions = consumers * bi / blockInterval =

20 * 600,000/ 40,000=

30

Tuning: Storage

• Default (MEMORY_ONLY)

• MEMORY_ONLY_SER with Serialization Library

• MEMORY_AND_DISK & DISK_ONLY

• Replicated _2

• OFF_HEAP (Tachyon/Alluxio)

Where to find more information?

Spark Official DocumentationDatabricks Blog

Databricks Spark Knowledge BaseSpark Notebook - By Andy Petrella

Databricks YouTube Channel

http://spark.apache.org/docs/latest/index.html

http://spark.apache.org/docs/latest/index.html

https://databricks.com/blog

https://databricks.com/blog

https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details

https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details

http://spark-notebook.io/

http://spark-notebook.io/

https://www.youtube.com/channel/UC3q8O3Bh2Le8Rj1-Q-_UUbA

https://www.youtube.com/channel/UC3q8O3Bh2Le8Rj1-Q-_UUbA

QUESTIONS

Fede Fernández@fede_fdz

[email protected]

Fran Pérez@FPerezP

[email protected]

Thanks!

Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Software

Transcript of Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández