Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Post on 16-Apr-2017

199 views 3 download

Transcript of Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Spark Streaming Tips for

Devs & Ops

WHO ARE WE?

Fede FernándezScala Software Engineer at 47 Degrees

Spark Certified Developer@fede_fdz

Fran PérezScala Software Engineer at 47 Degrees

Spark Certified Developer@FPerezP

Overview

Spark StreamingSpark + Kafka

groupByKey vs reduceByKeyTable Joins

SerializerTunning

Spark StreamingReal-time data processing

Continuous Data Flow

RDD

RDD

RDD

DStream

Output Data

Spark + Kafka

● Receiver-based Approach

○ At least once (with Write Ahead Logs)

● Direct API

○ Exactly once

Spark + Kafka

● Receiver-based Approach

Spark + Kafka

● Direct API

groupByKey VS reduceByKey

● groupByKey

○ Groups pairs of data with the same key.

● reduceByKey

○ Groups and combines pairs of data based on a reduce

operation.

groupByKey VS reduceByKey

sc.textFile(“hdfs://….”).flatMap(_.split(“ “)).map((_, 1)).groupByKey.map(t => (t._1, t._2.sum))

sc.textFile(“hdfs://….”).flatMap(_.split(“ “)).map((_, 1)).reduceByKey(_ + _)

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

shuffle shuffle

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 1)(j, 1)(j, 1)(j, 1)(j, 1)

(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)

(c, 1)(c, 1)(c, 1)(c, 1)

shuffle shuffle

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 1)(j, 1)(j, 1)(j, 1)(j, 1)

(j, 5)

(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)

(s, 6)

(c, 1)(c, 1)(c, 1)(c, 1)

(c, 4)

shuffle shuffle

reduceByKey

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(c, 1)(s, 1)(j, 1)

reduceByKey

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 2)(c, 1)(s, 2)

(j, 1)(s, 1)(s, 1)

(j, 1)(s, 2)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(c, 2)(s, 1)

(c, 1)(s, 1)(j, 1)

(j, 1)(c, 1)(s, 1)

reduceByKey

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 2)(c, 1)(s, 2)

(j, 1)(s, 1)(s, 1)

(j, 1)(s, 2)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(c, 2)(s, 1)

(c, 1)(s, 1)(j, 1)

(j, 1)(c, 1)(s, 1)

reduceByKey

(j, 2)(j, 1)(j, 1)(j, 1)

(s, 2)(s, 2)(s, 1)(s, 1)

(c, 1)(c, 2)(c, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 2)(c, 1)(s, 2)

(j, 1)(s, 1)(s, 1)

(j, 1)(s, 2)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(c, 2)(s, 1)

(c, 1)(s, 1)(j, 1)

(j, 1)(c, 1)(s, 1)

shuffle shuffle

reduceByKey

(j, 2)(j, 1)(j, 1)(j, 1)

(j, 5)(s, 2)(s, 2)(s, 1)(s, 1)

(s, 6)(c, 1)(c, 2)(c, 1)

(c, 4)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 2)(c, 1)(s, 2)

(j, 1)(s, 1)(s, 1)

(j, 1)(s, 2)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(c, 2)(s, 1)

(c, 1)(s, 1)(j, 1)

(j, 1)(c, 1)(s, 1)

shuffle shuffle

reduce VS group

● Improve performance

● Can’t always be used

● Out of Memory Exceptions

● aggregateByKey, foldByKey, combineByKey

Table Joins

● Typical operations that can be improved

● Need a previous analysis

● There are no silver bullets

Table Joins: Medium - Large

Table Joins: Medium - Large

FILTER No Shuffle

Table Joins: Small - Large

...Shuffled Hash JoinsqlContext.sql("explain <select>").collect.mkString(“\n”)

[== Physical Plan ==][Project][+- SortMergeJoin][ :- Sort][ : +- TungstenExchange hashpartitioning][ : +- TungstenExchange RoundRobinPartitioning][ : +- ConvertToUnsafe][ : +- Scan ExistingRDD][ +- Sort][ +- TungstenExchange hashpartitioning][ +- ConvertToUnsafe][ +- Scan ExistingRDD]

Table Joins: Small - Large

Broadcast Hash JoinsqlContext.sql("explain <select>").collect.mkString(“\n”)

[== Physical Plan ==][Project][+- BroadcastHashJoin][ :- TungstenExchange RoundRobinPartitioning][ : +- ConvertToUnsafe][ : +- Scan ExistingRDD][ +- Scan ParquetRelation]

No shuffle!

By default from Spark 1.4 when using DataFrame API

Prior Spark 1.4

ANALYZE TABLE small_table COMPUTE STATISTICS noscan

Broadcast

Table Joins: Small - Large

Serializers

● Java’s ObjectOutputStream framework. (Default)

● Custom serializers: extends Serializable & Externalizable.

● KryoSerializer: register your custom classes.

● Where is our code being run?

● Special care to JodaTime.

Tuning

Garbage CollectorblockIntervalPartitioning

Storage

Tuning: Garbage Collector• Applications which rely heavily on memory consumption.• GC Strategies

• Concurrent Mark Sweep (CMS) GC• ParallelOld GC• Garbage-First GC

• Tuning steps:• Review your logic and object management• Try Garbage-First• Activate and inspect the logs

Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

Tuning: blockInterval

blockInterval = (bi * consumers) / (pf * sc)

● CAT: Total cores per partition.

● bi: Batch Interval time in milliseconds.

● consumers: number of streaming consumers.

● pf (partitionFactor): number of partitions per core.

● sc (sparkCores): CAT - consumers.

blockInterval: example

● batchIntervalMillis = 600,000

● consumers = 20

● CAT = 120

● sparkCores = 120 - 20 = 100

● partitionFactor = 3

blockInterval = (bi * consumers) / (pf * sc) =

(600,000 * 20) / (3 * 100) =

40,000

Tuning: Partitioning

partitions = consumers * bi / blockInterval

● consumers: number of streaming consumers.

● bi: Batch Interval time in milliseconds.

● blockInterval: time size to split data before storing into

Spark.

Partitioning: example

● batchIntervalMillis = 600,000

● consumers = 20

● blockInterval = 40,000

partitions = consumers * bi / blockInterval =

20 * 600,000/ 40,000=

30

Tuning: Storage

• Default (MEMORY_ONLY)

• MEMORY_ONLY_SER with Serialization Library

• MEMORY_AND_DISK & DISK_ONLY

• Replicated _2

• OFF_HEAP (Tachyon/Alluxio)

QUESTIONS

Fede Fernández@fede_fdz

fede.f@47deg.com

Fran Pérez@FPerezP

fran.p@47deg.com

Thanks!