Post on 05-Apr-2017
Spark Streaming+ KafkaBest Practices
Brandon O’Brien@hakczarExpedia, Inc
Or“A Case Study in
Operationalizing Spark Streaming”
Context/Disclaimer Our use case: Build resilient, scalable data pipeline with
streaming ref data lookups, 24hr stream self-join and some aggregation. Values accuracy over speed.
Spark Streaming 1.5-1.6, Kafka 0.9
Standalone Cluster (not YARN or Mesos)
No Hadoop
Message velocity: k/s. Batch window: 10s
Data sourcee: Kafka (primary), Redis (joins + ref data) & S3 (ref data)
Demo: Spark in Action
Game & Scoreboard Architecture
Outline Spark Streaming & Standalone Cluster Overview
Design Patterns for Performance
Guaranteed Message Processing & Direct Kafka Integration
Operational Monitoring & Alerting
Spark Cluster & App Resilience
Outline Spark Streaming & Standalone Cluster Overview
Design Patterns for Performance
Guaranteed Message Processing & Direct Kafka Integration
Operational Monitoring & Alerting
Spark Cluster & App Resilience
Spark Streaming & Standalone Cluster Overview RDD: Partitioned, replicated collection of data
objects Driver: JVM that creates Spark program,
negotiates for resources. Handles scheduling of tasks but does not do heavy lifting. Bottlenecks.
Executor: Slave to the driver, executes tasks on RDD partitions. Function serialization.
Lazy Execution: Transformations & Actions Cluster Types: Standalone, YARN, Mesos
Spark Streaming & Standalone Cluster Overview
Standalone Cluster Each node
Master Worker Executor Driver
Zookeeper cluster
Outline Spark Streaming & Standalone Cluster Overview
Design Patterns for Performance
Guaranteed Message Processing & Direct Kafka Integration
Operational Monitoring & Alerting
Spark Cluster & App Resilience
Design Patterns for Performance
Delegate all IO/CPU to the Executors Avoid unnecessary shuffles (join, groupBy,
repartition) Externalize streaming joins & reference data
lookups. Large/volatile ref data set. JVM static hashmap External cache (e.g. Redis) Static LRU cache (amortize lookups) RocksDB
Hygienic function closures
We’re done, right?
We’re done, right?
Just need to QA the data…
70% missing data
Outline Spark Streaming & Standalone Cluster Overview
Design Patterns for Performance
Guaranteed Message Processing & Direct Kafka Integration
Operational Monitoring & Alerting
Spark Cluster & App Resilience
Guaranteed Message Processing & Direct Kafka Integration Guaranteed Message Processing = At-least-once
processing + idempotence Kafka Receiver
Consumes messages faster than Spark can process Checkpoints before processing finished Inefficient CPU utilization
Direct Kafka Integration Control over checkpointing & transactionality Better distribution on resource consumption 1:1 Kafka Topic-partition to Spark RDD-partition Use Kafka as WAL
Statelessness, Fail-fast
Outline Spark Streaming & Standalone Cluster Overview
Design Patterns for Performance
Guaranteed Message Processing & Direct Kafka Integration
Operational Monitoring & Alerting
Spark Cluster & App Resilience
Operational Monitoring& Alerting Driver “Heartbeat”
Batch processing time Message count
Kafka lag (latest offsets vs last processed) Driver start events StatsD + Graphite + Seyren http://localhost:4040/metrics/json/
Data loss fixed
Data loss fixed
So we’re done, right?
Cluster & appcontinuously crashing
Outline Spark Streaming & Standalone Cluster Overview
Design Patterns for Performance
Guaranteed Message Processing & Direct Kafka Integration
Operational Monitoring & Alerting
Spark Cluster & App Resilience
Spark Cluster & App Stability
Spark slave memory utilization
Spark Cluster & App Stability
Slave memory overhead OOM killer
Crashes + Kafka Receiver = missing data Supervised driver: “--supervise” for spark-
submit. Driver restart logging Cluster resource overprovisioning Standby Masters for failover Auto-cleanup of work directories
spark.worker.cleanup.enabled=true
We’re done, right?
We’re done, right?
Finally, yes
Party Time
TL;DR1. Use Direct Kafka Integration +
transactionality2. Cache reference data for speed3. Avoid shuffles & driver bottlenecks4. Supervised driver5. Cleanup worker temp directory6. Beware of function closures7. Cluster resource over-provisioning8. Spark slave memory headroom 9. Monitoring on Driver heartbeat & Kafka lag10. Standby masters
Spark Streaming+ KafkaBest Practices
Brandon O’Brien@hakczarExpedia, Inc
Thanks!
Links Operationalization Spark Streaming:
https://techblog.expedia.com/2016/12/29/operationalizing-spark-streaming-part-1/
Direct Kafka Integration: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
App metrics: http://localhost:4040/metrics/json/ MetricsSystem: http
://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/
sparkConf.set("spark.worker.cleanup.enabled", "true")