Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Spark Streaming+ KafkaBest Practices

Brandon O’Brien@hakczarExpedia, Inc

Or“A Case Study in

Operationalizing Spark Streaming”

Context/Disclaimer Our use case: Build resilient, scalable data pipeline with

streaming ref data lookups, 24hr stream self-join and some aggregation. Values accuracy over speed.

Spark Streaming 1.5-1.6, Kafka 0.9

Standalone Cluster (not YARN or Mesos)

No Hadoop

Message velocity: k/s. Batch window: 10s

Data sourcee: Kafka (primary), Redis (joins + ref data) & S3 (ref data)

Demo: Spark in Action

Game & Scoreboard Architecture

Outline Spark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Spark Streaming & Standalone Cluster Overview RDD: Partitioned, replicated collection of data

objects Driver: JVM that creates Spark program,

negotiates for resources. Handles scheduling of tasks but does not do heavy lifting. Bottlenecks.

Executor: Slave to the driver, executes tasks on RDD partitions. Function serialization.

Lazy Execution: Transformations & Actions Cluster Types: Standalone, YARN, Mesos

Spark Streaming & Standalone Cluster Overview

Standalone Cluster Each node

Master Worker Executor Driver

Zookeeper cluster

Delegate all IO/CPU to the Executors Avoid unnecessary shuffles (join, groupBy,

repartition) Externalize streaming joins & reference data

lookups. Large/volatile ref data set. JVM static hashmap External cache (e.g. Redis) Static LRU cache (amortize lookups) RocksDB

Hygienic function closures

We’re done, right?

Just need to QA the data…

70% missing data

Guaranteed Message Processing & Direct Kafka Integration Guaranteed Message Processing = At-least-once

processing + idempotence Kafka Receiver

Consumes messages faster than Spark can process Checkpoints before processing finished Inefficient CPU utilization

Direct Kafka Integration Control over checkpointing & transactionality Better distribution on resource consumption 1:1 Kafka Topic-partition to Spark RDD-partition Use Kafka as WAL

Statelessness, Fail-fast

Operational Monitoring& Alerting Driver “Heartbeat”

Batch processing time Message count

Kafka lag (latest offsets vs last processed) Driver start events StatsD + Graphite + Seyren http://localhost:4040/metrics/json/

Data loss fixed

So we’re done, right?

Cluster & appcontinuously crashing

Spark Cluster & App Stability

Spark slave memory utilization

Spark Cluster & App Stability

Slave memory overhead OOM killer

Crashes + Kafka Receiver = missing data Supervised driver: “--supervise” for spark-

submit. Driver restart logging Cluster resource overprovisioning Standby Masters for failover Auto-cleanup of work directories

spark.worker.cleanup.enabled=true

We’re done, right?

Finally, yes

Party Time

TL;DR1. Use Direct Kafka Integration +

transactionality2. Cache reference data for speed3. Avoid shuffles & driver bottlenecks4. Supervised driver5. Cleanup worker temp directory6. Beware of function closures7. Cluster resource over-provisioning8. Spark slave memory headroom 9. Monitoring on Driver heartbeat & Kafka lag10. Standby masters

Spark Streaming+ KafkaBest Practices

Brandon O’Brien@hakczarExpedia, Inc

Thanks!

Links Operationalization Spark Streaming:

https://techblog.expedia.com/2016/12/29/operationalizing-spark-streaming-part-1/

Direct Kafka Integration: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

App metrics: http://localhost:4040/metrics/json/ MetricsSystem: http

://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

sparkConf.set("spark.worker.cleanup.enabled", "true")

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Technology

Transcript of Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Christopher O'Brien

101 ways to configure kafka - badly (Kafka Summit)

Kafka Tutorial: Kafka Security

Franz Kafka - Letter to his · PDF fileFranz Kafka Pictures Home Page Kafka and Judaism The Holocaust photographs Galleries Franz Kafka Biography Franz Kafka-Wax Museum Kafka &

Kafka & Hadoop - for NYC Kafka Meetup

O'Brien Denise__5119__scanned

Kafka Tutorial - Introduction to Apache Kafka (Part 1)

Paris Is Out starring Pat O'Brien and Eloise O'Brien

Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core

...Bolsinger, Tod, Canoeing the Mountains: Christian Leadership in Uncharted Territory (ISBN- 978-0830841264) Brandon O'Brien and Jim Belcher, The Strategically Small Church: Intimate,

TOP VERDICTS & S ETTLEMENTS - … · Dave Buchanan, Seeger Weiss Brandon Bogle, Levin Papantonio Wes Bowden, Levin Papantonio ... Tim O'Brien, Levin Papantonio Ned McWilliams, Levin

Kafka Performances 1perfug.github.io/assets/files/PerfUG68.pdf · Apache Kafka. 34 Consuming From Kafka - Single Consumer C. 35 Consuming From Kafka - Grouped Consumers ... programming.”

Forcepoint Behavioral Analytics Installation Manual › content › support › library › ... · Kafka kafka 9092-9095 API, Rose Kafka Manager kafka 9000 Administrator Workstation

Amy o'brien

Kafka to the Maxka - (Kafka Performance Tuning)

MAURICE BLANCHOT, de Kafka a Kafka

Formatted: Figure [PACKT] cm, Width: 21.59 cm, Height: 27 ... · Kafka 0.7.x Consumer Kafka 0.7.x Cluster Kafka Migration Kafka 0.8 Cluster Kafka 0.8 Producer Producer (Front End)

O'brien clay

Kafka Audit - Kafka Meetup - January 27th, 2015

Enterprise Kafka: Kafka as a Service

O'Brien Denise5119scanned