Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Spark Streaming Tips for

Devs & Ops

WHO ARE WE?

Fede FernándezScala Software Engineer at 47 Degrees

Spark Certified Developer@fede_fdz

Fran PérezScala Software Engineer at 47 Degrees

Spark Certified Developer@FPerezP

Overview

Spark StreamingSpark + Kafka

groupByKey vs reduceByKeyTable Joins

SerializerTunning

Spark StreamingReal-time data processing

Continuous Data Flow

DStream

Output Data

Spark + Kafka

● Receiver-based Approach

○ At least once (with Write Ahead Logs)

● Direct API

○ Exactly once

Spark + Kafka

● Receiver-based Approach

Spark + Kafka

● Direct API

groupByKey VS reduceByKey

● groupByKey

○ Groups pairs of data with the same key.

● reduceByKey

○ Groups and combines pairs of data based on a reduce

operation.

groupByKey VS reduceByKey

sc.textFile(“hdfs://….”).flatMap(_.split(“ “)).map((_, 1)).groupByKey.map(t => (t._1, t._2.sum))

sc.textFile(“hdfs://….”).flatMap(_.split(“ “)).map((_, 1)).reduceByKey(_ + _)

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

shuffle shuffle

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 1)(j, 1)(j, 1)(j, 1)(j, 1)

(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)

(c, 1)(c, 1)(c, 1)(c, 1)

shuffle shuffle

groupByKey

(c, 1)(s, 1)(j, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 1)(j, 1)(j, 1)(j, 1)(j, 1)

(j, 5)

(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)(s, 1)

(s, 6)

(c, 1)(c, 1)(c, 1)(c, 1)

(c, 4)

shuffle shuffle

reduceByKey

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 1)(s, 1)(s, 1)

(s, 1)(c, 1)(c, 1)(j, 1)

(c, 1)(s, 1)(j, 1)

reduceByKey

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 2)(c, 1)(s, 2)

(j, 1)(s, 1)(s, 1)

(j, 1)(s, 2)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(c, 2)(s, 1)

(c, 1)(s, 1)(j, 1)

(j, 1)(c, 1)(s, 1)

reduceByKey

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 2)(c, 1)(s, 2)

(j, 1)(s, 1)(s, 1)

(j, 1)(s, 2)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(c, 2)(s, 1)

(c, 1)(s, 1)(j, 1)

(j, 1)(c, 1)(s, 1)

reduceByKey

(j, 2)(j, 1)(j, 1)(j, 1)

(s, 2)(s, 2)(s, 1)(s, 1)

(c, 1)(c, 2)(c, 1)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 2)(c, 1)(s, 2)

(j, 1)(s, 1)(s, 1)

(j, 1)(s, 2)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(c, 2)(s, 1)

(c, 1)(s, 1)(j, 1)

(j, 1)(c, 1)(s, 1)

shuffle shuffle

reduceByKey

(j, 2)(j, 1)(j, 1)(j, 1)

(j, 5)(s, 2)(s, 2)(s, 1)(s, 1)

(s, 6)(c, 1)(c, 2)(c, 1)

(c, 4)

(s, 1)(j, 1)(j, 1)(c, 1)(s, 1)

(j, 2)(c, 1)(s, 2)

(j, 1)(s, 1)(s, 1)

(j, 1)(s, 2)

(s, 1)(c, 1)(c, 1)(j, 1)

(j, 1)(c, 2)(s, 1)

(c, 1)(s, 1)(j, 1)

(j, 1)(c, 1)(s, 1)

shuffle shuffle

reduce VS group

● Improve performance

● Can’t always be used

● Out of Memory Exceptions

● aggregateByKey, foldByKey, combineByKey

Table Joins

● Typical operations that can be improved

● Need a previous analysis

● There are no silver bullets

Table Joins: Medium - Large

FILTER No Shuffle

Table Joins: Small - Large

...Shuffled Hash JoinsqlContext.sql("explain <select>").collect.mkString(“\n”)

[== Physical Plan ==][Project][+- SortMergeJoin][ :- Sort][ : +- TungstenExchange hashpartitioning][ : +- TungstenExchange RoundRobinPartitioning][ : +- ConvertToUnsafe][ : +- Scan ExistingRDD][ +- Sort][ +- TungstenExchange hashpartitioning][ +- ConvertToUnsafe][ +- Scan ExistingRDD]

Broadcast Hash JoinsqlContext.sql("explain <select>").collect.mkString(“\n”)

[== Physical Plan ==][Project][+- BroadcastHashJoin][ :- TungstenExchange RoundRobinPartitioning][ : +- ConvertToUnsafe][ : +- Scan ExistingRDD][ +- Scan ParquetRelation]

No shuffle!

By default from Spark 1.4 when using DataFrame API

Prior Spark 1.4

ANALYZE TABLE small_table COMPUTE STATISTICS noscan

Broadcast

Serializers

● Java’s ObjectOutputStream framework. (Default)

● Custom serializers: extends Serializable & Externalizable.

● KryoSerializer: register your custom classes.

● Where is our code being run?

● Special care to JodaTime.

Tuning

Garbage CollectorblockIntervalPartitioning

Storage

Tuning: Garbage Collector• Applications which rely heavily on memory consumption.• GC Strategies

• Concurrent Mark Sweep (CMS) GC• ParallelOld GC• Garbage-First GC

• Tuning steps:• Review your logic and object management• Try Garbage-First• Activate and inspect the logs

Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

Tuning: blockInterval

blockInterval = (bi * consumers) / (pf * sc)

● CAT: Total cores per partition.

● bi: Batch Interval time in milliseconds.

● consumers: number of streaming consumers.

● pf (partitionFactor): number of partitions per core.

● sc (sparkCores): CAT - consumers.

blockInterval: example

● batchIntervalMillis = 600,000

● consumers = 20

● CAT = 120

● sparkCores = 120 - 20 = 100

● partitionFactor = 3

blockInterval = (bi * consumers) / (pf * sc) =

(600,000 * 20) / (3 * 100) =

40,000

Tuning: Partitioning

partitions = consumers * bi / blockInterval

● consumers: number of streaming consumers.

● bi: Batch Interval time in milliseconds.

● blockInterval: time size to split data before storing into

Spark.

Partitioning: example

● batchIntervalMillis = 600,000

● consumers = 20

● blockInterval = 40,000

partitions = consumers * bi / blockInterval =

20 * 600,000/ 40,000=

Tuning: Storage

• Default (MEMORY_ONLY)

• MEMORY_ONLY_SER with Serialization Library

• MEMORY_AND_DISK & DISK_ONLY

• Replicated _2

• OFF_HEAP (Tachyon/Alluxio)

Where to find more information?

Spark Official DocumentationDatabricks Blog

Databricks Spark Knowledge BaseSpark Notebook - By Andy Petrella

Databricks YouTube Channel

QUESTIONS

Fede Fernández@fede_fdz

fede.f@47deg.com

Fran Pérez@FPerezP

fran.p@47deg.com

Thanks!

Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Software

Transcript of Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández

Cloud 4 Java Devs book

DEVS Unified Process for Web-Centric Development …acims.asu.edu/wp-content/uploads/2012/02/DEVS-Unified-Process-for... · DEVS Unified Process for Web-Centric Development and Testing

Cf intro for spring devs

IT recruitment Devs PickUp

SPSBE 2013 Claims for devs

.net Core Blimey - Smart Devs UG

Kotlin for Android devs

Moon Ho Hwang DEVS# version 1.2 - SourceForgexsy-csharp.sourceforge.net/DEVSsharp/DEVSsharpSim.pdf · open DEVS# library at . Although the main objective of developing DEVS# is to

Chapter 8 The DEVS Integrator: Motion in Space From pulse-based to quantum-based integrators Mapping Ordinary Differential Equations into DEVS one-D DEVS.

Mono for Android... for Google Devs

DEVS Tutorial-- DEVS and Distributed DEVS

DEVSML: Automating DEVS Execution Over SOA …acims.asu.edu/wp-content/uploads/2012/02/DEVSML-Automating-DEVS... · 1 DEVSML: Automating DEVS Execution Over SOA Towards Transparent

OO JS for AS3 Devs

Enhancing DEVS Simulation through Template Metaprogramming · Enhancing DEVS Simulation through Template Metaprogramming DEVS-MetaSimulator Luc Touraille1, Mamadou K. Traoré2, David

Modeling with Parallel DEVS Serialization in DEVS models Select function Implicit serialization of parallel models E-DEVS: internal transition first,

Apache Cayenne for WO Devs

When Devs Do Ops

DEVS M&S Tutorial 2

Management 3.0 beyond devs

Wild Weather hurricanes. hurricane Fran hurricane Fran.