Spark & Storm: When & Where?
www.mammothdata.com | @mammothdataco
The Leader in Big Data Consulting
● BI/Data Strategy○ Development of a business intelligence/ data architecture strategy.
● Installation○ Installation of Hadoop or relevant technology.
● Data Consolidation○ Load data from diverse sources into a single scalable repository.
● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards, feeds or computer-driven decision making processes to derive insights and make decisions.
● Visualization Tools ○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to
necessary employees who will analyze the data.
Mammoth Data, based in downtown Durham (right above Toast)
www.mammothdata.com | @mammothdataco
● Lead Consultant on all things DevOps and Spark
● @carsondial on Twitter
Me!
www.mammothdata.com | @mammothdataco
● Quick overview of Spark Streaming
● Reasons why Spark Streaming can be tricky in practice
● Performance and tuning tips we’ve learnt over the past two years
● …and when to pack it all in and use Storm instead
What This Talk Is About
www.mammothdata.com | @mammothdataco
This IS WEB SCALE!
www.mammothdata.com | @mammothdataco
● I kid, Rails!
● (mostly)
Beyond Web Scale
www.mammothdata.com | @mammothdataco
● Spark & Storm - millions of requests / second on commodity hardware
● Different problems at different scales!
Beyond Web Scale
www.mammothdata.com | @mammothdataco
● Directed Acyclic Graph Data Processing Engine
● Based around the Resilient Distributed Dataset (RDD) primitive
Spark
www.mammothdata.com | @mammothdataco
Spark Streaming — Overview
www.mammothdata.com | @mammothdataco
Spark Streaming — In Production?
● Yes!
● (Alibaba, AutoTrader, Cisco, Netflix, etc.)
www.mammothdata.com | @mammothdataco
● Streaming by running batches very quickly!
● Batch length: can be as low as 0.5s / batch
● Every X seconds, get Y records (DStream/RDDs)
Spark Streaming — Overview
www.mammothdata.com | @mammothdataco
● Using same implementation (mostly) for batch and stream processing (Lambda Architecture hipster points ahoy!)
● Access to rest of Spark - Dataframes, MLLib, GraphX, etc.
Spark Streaming — Good Things
www.mammothdata.com | @mammothdataco
● What happens if you can’t process Y records in X seconds?
● What happens if you require sub-second latency?
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
Spark Streaming — I’m so sorry.
www.mammothdata.com | @mammothdataco
● What happens if you can’t process Y records in X seconds?
● Data builds up in executors
● Executors run out of memory…
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
● “Hey, we forgot to tell you Ops people that we have a major new client adding stuff into the firehose sometime today. That’s fine, right?”
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
Spark Streaming — It Will Be Okay
www.mammothdata.com | @mammothdataco
● As a former Ops person:
● WE WILL REMEMBER.
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
● Do you need low-latency?
● If so, a 10-minute nap is advisable!
● Everybody else, let’s dive in…
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
Spark Streaming — Down In The Hole
www.mammothdata.com | @mammothdataco
Spark Streaming — Down In The Hole
www.mammothdata.com | @mammothdataco
● Easiest method — alter the batch window until it’s all fine!
● Tiny batches provide tight execution times!
Spark Streaming — Down In The Hole
www.mammothdata.com | @mammothdataco
● Use Kafka.
● Data source with the most love (e.g. exactly-once semantics without Write Ahead Logs and receiver-less operation in 1.3+)
● (other sources get the features…eventually)
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
● Use Scala.
● CPython = slower in execution
● PyPy is much faster…but…
● New features always come to Scala first.
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
● (or Java if you really must)
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
● Spark Streaming = data receivers + Spark
● spark.cores.max = x * number of receivers
● For Great Data Locality and Parallelism!
Spark Streaming — Cores
www.mammothdata.com | @mammothdataco
● Are you using a foreachRDD loop?
rdd.foreachRDD{ rdd =>
rdd.cache()
…rdd.unpersist()
}
Spark Streaming — Caching
www.mammothdata.com | @mammothdataco
● If routing to multiple stores / iterating over an RDD multiple times using cache() is a quick win
● It really shouldn’t work so well…
Spark Streaming — Caching
www.mammothdata.com | @mammothdataco
● Hurrah for Spark 1.5!
● spark.streaming.backpressure.enabled = true
● Spark dynamically alters incoming data rates (keeping the data in Kafka rather than in the executors)
● Works for all data sources (for once!)
Spark Streaming — Backpressure
www.mammothdata.com | @mammothdataco
● I really need that low-latency response!
Storm
www.mammothdata.com | @mammothdataco
● Directed Acyclic Graph Data Processing Engine
Storm
www.mammothdata.com | @mammothdataco
Spark
“Very Good, Sir”
www.mammothdata.com | @mammothdataco
Storm
“Here you go!”
www.mammothdata.com | @mammothdataco
● Stream of tuples
● Bolts
● Spouts
● Topologies
Storm Concepts
www.mammothdata.com | @mammothdataco
● Unbounded stream of tuples
● Tuples are defined via schema (usual base types plus custom serializers)
Storm — Streams
www.mammothdata.com | @mammothdataco
● Sources of tuples in a topology
● Read from external sources (e.g. Kafka) and emitting them
● Can emit multiple streams from a spout!
Storm — Spouts
www.mammothdata.com | @mammothdataco
● Where your processing happens● Roll your own aggregations / filtering / windowing● Bolts can feed into other bolts● Potentially easier to test than Spark Streaming● Many Bolt connectors for external sources (e.g. Cassandra,
Redis, Hive, etc)
Storm — Bolts
www.mammothdata.com | @mammothdataco
● The DAG of the spouts and bolts
● Built programmatically in code and submitted to the Storm cluster
● Flux - Do It In YAML (and then complain about whitespace)
Storm — Topologies
www.mammothdata.com | @mammothdataco
● Each bolt or spout runs 'tasks' across the cluster
● How parallelism works in Storm
● Set in topology submission
Storm — Tasks
www.mammothdata.com | @mammothdataco
● Where the topology runs
● 1 worker = 1 JVM
● Tasks run as threads on a worker
● Storm distributes tasks evenly across cluster
Storm — Workers
www.mammothdata.com | @mammothdataco
● True Streaming
● Tuples processed as they enter topology - low latency
● Scales far beyond Spark Streaming (currently)
Storm — Good Things
www.mammothdata.com | @mammothdataco
● Battle-tested at Twitter & Yahoo!
● Yahoo! has 300-node clusters and working to support 1000+ nodes
● Single node clocked at over 1.5m tuples / second at Twitter
Storm — Good Things
www.mammothdata.com | @mammothdataco
● Very DIY (bring your own aggregations, ML, etc)
● Your DAG construction may not be optimal
● Operationally more complex (and Storm WebUI is more primitive)
● Where’s Me REPL?
Storm — Bad Things
www.mammothdata.com | @mammothdataco
Spark or Storm?
www.mammothdata.com | @mammothdataco
● SLA on latency?
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Storm!
● (though simply because it’s possible doesn’t mean you’ll get it!)
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Insane data needs (e.g. ~100m records/second?)
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Storm!
● (though, again, it’s not a magic bullet!)
Spark or Storm?
www.mammothdata.com | @mammothdataco
● For almost anything else? Spark.
● High-level vs. Low-level
● Each new version of Spark delivers improvements!
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Other frameworks that show promise:○ Flink○ Apex○ Samza○ Heron (Twitter’s not-public Storm replacement)
Other Listing Magazines Are Available
www.mammothdata.com | @mammothdataco
Questions?
Top Related