Bigdata roundtable-storm

23
Storm - pipes and filters on steroids Andre Sprenger BigData Roundtable Hamburg 30. Nov 2011

description

Andre Sprenger presentation on the Twitter Storm framework at the first bigdata-roundtable in Hamburg

Transcript of Bigdata roundtable-storm

Page 1: Bigdata roundtable-storm

Storm - pipes and filters on steroids

Andre Sprenger

BigData Roundtable

Hamburg 30. Nov 2011

Page 2: Bigdata roundtable-storm

My background• [email protected]

• Studied Computer Science and Economics

• Background: banking, ecommerce, online advertising

• Freelancer

• Java, Scala, Ruby, Rails

• Hadoop, Pig, Hive, Cassandra

Page 3: Bigdata roundtable-storm

“Next click” problemRaymie Strata (CTO, Yahoo):

“With the paths that go through Hadoop [at Yahoo!], the latency is about fifteen minutes. … [I]t will never be true real-time. It will never be what we call “next click,” where I click and by the time the page loads, the semantic implication of my decision is reflected in the page.”

Page 4: Bigdata roundtable-storm

“Next click” problem

collect data process data

real time layer

max latency80 ms

HTTPRequest

HTTPResponse

web server

max latency80 ms

(next)HTTP

RequestHTTP

Response

realtimeresponse

near realtimeresponse

time

Page 5: Bigdata roundtable-storm

Example problems• Realtime statistics - counting, trends, moving average

• Read Twitter stream and output images that are trending in the last 10 minutes

• CTR calculation - read ad clicks/ad impressions and calculate new click through rate

• ETL - transform format, filter duplicates / bot traffic, enrich from static data, persist

• Search advertising

Page 6: Bigdata roundtable-storm

Pick your framework...• S4 - Yahoo, “real time map reduce”, actor model

• Storm - Twitter

• MapReduce Online - Yahoo

• Cloud Map Reduce - Accenture

• HStreaming - Startup, based on Hadoop

• Brisk - DataStax, Cassandra

Page 7: Bigdata roundtable-storm

System requirements• Fault tolerance - system keeps running when a node

fails

• Horizontal scalability - should be easy, just add a node

• Low latency

• Reliable - does not loose data

• High availability - well, if it’s down for an hour its not realtime

Page 8: Bigdata roundtable-storm

Storm in a nutshell

• Written by Backtype (aquired by Twitter)

• Open Source, Github

• Runs on JVM

• Clojure, Python, Zookeeper, ZeroMQ

• Currently used by Twitter for real time statistics

Page 9: Bigdata roundtable-storm

Programming model• Tuple - name/value list

• Stream - unbounded sequence of Tuples

• Spout - source of Streams

• Bolt - consumer / producer of Streams

• Topology - network of Streams, Spouts and Bolts

Page 10: Bigdata roundtable-storm

Spout

Spout

tuple tuple tupletuple

tuple tuple tupletuple

Page 11: Bigdata roundtable-storm

Bolt

Bolt

tuple tuple tupletuple

tuple tuple tupletuple

tuple tuple tupletuple

Processes streams and generates new streams.

Page 12: Bigdata roundtable-storm

Bolt

• filtering

• transformation

• split / aggregate streams

• counting, statistics

• read from / write to database

Page 13: Bigdata roundtable-storm

Topology

Bolt

Network of Streams, Spouts and Bolts

Bolt

Bolt

Spout

Spout

Bolt

Bolt

Page 14: Bigdata roundtable-storm

TaskParallel processor inside Spouts and Bolts.

Each Spout / Bolt has a fixed number of Tasks.

Spout

Task

Bolt

Task

Task Task

Task

Page 15: Bigdata roundtable-storm

Stream grouping

Which Task does a Tuple go to?

• shuffle grouping - distribute randomly

• field grouping - partition by field value

• all grouping - send to all Tasks

• custom grouping - implement your own logic

Page 16: Bigdata roundtable-storm

Word count example

SentenceSplitter

BoltSpout

WordCountBolt

(“a b c a b d”)

(“a”)(“b”)(“c”)(“a”)(“b”)(“d”)

(“a”, 2)(“b”, 2)(“c”, 1)(“d”, 1)

Page 17: Bigdata roundtable-storm

Guaranteed processing

Spout (“a b c a b d”)

(“a”)

(“b”)

(“c”)

(“a”)

(“b”)

(“d”)

(“a”, 2)(“b”, 2)(“c”, 1)(“d”, 1)

Topology has a timeout for processing of the tuple tree

Page 18: Bigdata roundtable-storm

Runtime view

Page 19: Bigdata roundtable-storm

Reliability• Nimbus / Supervisor are SPOF

• both are stateless, easy to restart without data loss

• Failure of master node (?)

• Running Topologies should not be affected!

• Failed Workers are restarted

• Guaranteed message processing

Page 20: Bigdata roundtable-storm

Administration

• Nimbus / Supervisor / Zookeeper need monitoring and supervisor (e.g. Monit)

• Cluster nodes can be added at runtime

• But: existing Topologies are not rebalanced (there is a ticket)

• Administration web GUI

Page 21: Bigdata roundtable-storm

Community• Source is on Github - https://github.com/

nathanmarz/storm.git

• Wiki - https://github.com/nathanmarz/storm/wiki

• Nice documentation

• Google Group

• People start to build add-ons: JRuby integration, adapters for JMS, AMQP

Page 22: Bigdata roundtable-storm

Storm summary

• Nice programming model

• Easy to deploy new topologies

• Horizontal scalability

• Low latency

• Fault tolerance

• Easy to setup on EC2

Page 23: Bigdata roundtable-storm

Questions?