Download - All Things Open - Spark & Storm - Where & When?

Spark & Storm: When & Where?

www.mammothdata.com | @mammothdataco

The Leader in Big Data Consulting

● BI/Data Strategy○ Development of a business intelligence/ data architecture strategy.

● Installation○ Installation of Hadoop or relevant technology.

● Data Consolidation○ Load data from diverse sources into a single scalable repository.

● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards, feeds or computer-driven decision making processes to derive insights and make decisions.

● Visualization Tools ○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to

necessary employees who will analyze the data.

Mammoth Data, based in downtown Durham (right above Toast)

http://www.mammothdata.com

https://twitter.com/mammothdataco



● Lead Consultant on all things DevOps and Spark

● @carsondial on Twitter

Me!





● Quick overview of Spark Streaming

● Reasons why Spark Streaming can be tricky in practice

● Performance and tuning tips we’ve learnt over the past two years

● …and when to pack it all in and use Storm instead

What This Talk Is About





This IS WEB SCALE!





● I kid, Rails!

● (mostly)

Beyond Web Scale





● Spark & Storm - millions of requests / second on commodity hardware

● Different problems at different scales!

Beyond Web Scale





● Directed Acyclic Graph Data Processing Engine

● Based around the Resilient Distributed Dataset (RDD) primitive

Spark





Spark Streaming — Overview





Spark Streaming — In Production?

● Yes!

● (Alibaba, AutoTrader, Cisco, Netflix, etc.)





● Streaming by running batches very quickly!

● Batch length: can be as low as 0.5s / batch

● Every X seconds, get Y records (DStream/RDDs)

Spark Streaming — Overview





● Using same implementation (mostly) for batch and stream processing (Lambda Architecture hipster points ahoy!)

● Access to rest of Spark - Dataframes, MLLib, GraphX, etc.

Spark Streaming — Good Things





● What happens if you can’t process Y records in X seconds?

● What happens if you require sub-second latency?

Spark Streaming — Bad Things!





Spark Streaming — I’m so sorry.





● What happens if you can’t process Y records in X seconds?

● Data builds up in executors

● Executors run out of memory…






● “Hey, we forgot to tell you Ops people that we have a major new client adding stuff into the firehose sometime today. That’s fine, right?”






Spark Streaming — It Will Be Okay





● As a former Ops person:

● WE WILL REMEMBER.






● Do you need low-latency?

● If so, a 10-minute nap is advisable!

● Everybody else, let’s dive in…

Spark Streaming — Tuning





Spark Streaming — Down In The Hole





● Easiest method — alter the batch window until it’s all fine!

● Tiny batches provide tight execution times!

Spark Streaming — Down In The Hole





● Use Kafka.

● Data source with the most love (e.g. exactly-once semantics without Write Ahead Logs and receiver-less operation in 1.3+)

● (other sources get the features…eventually)






● Use Scala.

● CPython = slower in execution

● PyPy is much faster…but…

● New features always come to Scala first.






● (or Java if you really must)






● Spark Streaming = data receivers + Spark

● spark.cores.max = x * number of receivers

● For Great Data Locality and Parallelism!

Spark Streaming — Cores





● Are you using a foreachRDD loop?

rdd.foreachRDD{ rdd =>

rdd.cache()

…rdd.unpersist()

}

Spark Streaming — Caching





● If routing to multiple stores / iterating over an RDD multiple times using cache() is a quick win

● It really shouldn’t work so well…

Spark Streaming — Caching





● Hurrah for Spark 1.5!

● spark.streaming.backpressure.enabled = true

● Spark dynamically alters incoming data rates (keeping the data in Kafka rather than in the executors)

● Works for all data sources (for once!)

Spark Streaming — Backpressure





● I really need that low-latency response!

Storm





● Directed Acyclic Graph Data Processing Engine

Storm





Spark

“Very Good, Sir”





Storm

“Here you go!”





● Stream of tuples

● Bolts

● Spouts

● Topologies

Storm Concepts





● Unbounded stream of tuples

● Tuples are defined via schema (usual base types plus custom serializers)

Storm — Streams





● Sources of tuples in a topology

● Read from external sources (e.g. Kafka) and emitting them

● Can emit multiple streams from a spout!

Storm — Spouts





● Where your processing happens● Roll your own aggregations / filtering / windowing● Bolts can feed into other bolts● Potentially easier to test than Spark Streaming● Many Bolt connectors for external sources (e.g. Cassandra,

Redis, Hive, etc)

Storm — Bolts





● The DAG of the spouts and bolts

● Built programmatically in code and submitted to the Storm cluster

● Flux - Do It In YAML (and then complain about whitespace)

Storm — Topologies





● Each bolt or spout runs 'tasks' across the cluster

● How parallelism works in Storm

● Set in topology submission

Storm — Tasks





● Where the topology runs

● 1 worker = 1 JVM

● Tasks run as threads on a worker

● Storm distributes tasks evenly across cluster

Storm — Workers





● True Streaming

● Tuples processed as they enter topology - low latency

● Scales far beyond Spark Streaming (currently)

Storm — Good Things





● Battle-tested at Twitter & Yahoo!

● Yahoo! has 300-node clusters and working to support 1000+ nodes

● Single node clocked at over 1.5m tuples / second at Twitter

Storm — Good Things





● Very DIY (bring your own aggregations, ML, etc)

● Your DAG construction may not be optimal

● Operationally more complex (and Storm WebUI is more primitive)

● Where’s Me REPL?

Storm — Bad Things





Spark or Storm?





● SLA on latency?

Spark or Storm?





● Storm!

● (though simply because it’s possible doesn’t mean you’ll get it!)

Spark or Storm?





● Insane data needs (e.g. ~100m records/second?)

Spark or Storm?





● Storm!

● (though, again, it’s not a magic bullet!)

Spark or Storm?





● For almost anything else? Spark.

● High-level vs. Low-level

● Each new version of Spark delivers improvements!

Spark or Storm?





● Other frameworks that show promise:○ Flink○ Apex○ Samza○ Heron (Twitter’s not-public Storm replacement)

Other Listing Magazines Are Available





Questions?