Bigdata roundtable-storm
-
Upload
tobias-schlottke -
Category
Technology
-
view
2.423 -
download
1
description
Transcript of Bigdata roundtable-storm
Storm - pipes and filters on steroids
Andre Sprenger
BigData Roundtable
Hamburg 30. Nov 2011
My background• [email protected]
• Studied Computer Science and Economics
• Background: banking, ecommerce, online advertising
• Freelancer
• Java, Scala, Ruby, Rails
• Hadoop, Pig, Hive, Cassandra
“Next click” problemRaymie Strata (CTO, Yahoo):
“With the paths that go through Hadoop [at Yahoo!], the latency is about fifteen minutes. … [I]t will never be true real-time. It will never be what we call “next click,” where I click and by the time the page loads, the semantic implication of my decision is reflected in the page.”
“Next click” problem
collect data process data
real time layer
max latency80 ms
HTTPRequest
HTTPResponse
web server
max latency80 ms
(next)HTTP
RequestHTTP
Response
realtimeresponse
near realtimeresponse
time
Example problems• Realtime statistics - counting, trends, moving average
• Read Twitter stream and output images that are trending in the last 10 minutes
• CTR calculation - read ad clicks/ad impressions and calculate new click through rate
• ETL - transform format, filter duplicates / bot traffic, enrich from static data, persist
• Search advertising
Pick your framework...• S4 - Yahoo, “real time map reduce”, actor model
• Storm - Twitter
• MapReduce Online - Yahoo
• Cloud Map Reduce - Accenture
• HStreaming - Startup, based on Hadoop
• Brisk - DataStax, Cassandra
System requirements• Fault tolerance - system keeps running when a node
fails
• Horizontal scalability - should be easy, just add a node
• Low latency
• Reliable - does not loose data
• High availability - well, if it’s down for an hour its not realtime
Storm in a nutshell
• Written by Backtype (aquired by Twitter)
• Open Source, Github
• Runs on JVM
• Clojure, Python, Zookeeper, ZeroMQ
• Currently used by Twitter for real time statistics
Programming model• Tuple - name/value list
• Stream - unbounded sequence of Tuples
• Spout - source of Streams
• Bolt - consumer / producer of Streams
• Topology - network of Streams, Spouts and Bolts
Spout
Spout
tuple tuple tupletuple
tuple tuple tupletuple
Bolt
Bolt
tuple tuple tupletuple
tuple tuple tupletuple
tuple tuple tupletuple
Processes streams and generates new streams.
Bolt
• filtering
• transformation
• split / aggregate streams
• counting, statistics
• read from / write to database
Topology
Bolt
Network of Streams, Spouts and Bolts
Bolt
Bolt
Spout
Spout
Bolt
Bolt
TaskParallel processor inside Spouts and Bolts.
Each Spout / Bolt has a fixed number of Tasks.
Spout
Task
Bolt
Task
Task Task
Task
Stream grouping
Which Task does a Tuple go to?
• shuffle grouping - distribute randomly
• field grouping - partition by field value
• all grouping - send to all Tasks
• custom grouping - implement your own logic
Word count example
SentenceSplitter
BoltSpout
WordCountBolt
(“a b c a b d”)
(“a”)(“b”)(“c”)(“a”)(“b”)(“d”)
(“a”, 2)(“b”, 2)(“c”, 1)(“d”, 1)
Guaranteed processing
Spout (“a b c a b d”)
(“a”)
(“b”)
(“c”)
(“a”)
(“b”)
(“d”)
(“a”, 2)(“b”, 2)(“c”, 1)(“d”, 1)
Topology has a timeout for processing of the tuple tree
Runtime view
Reliability• Nimbus / Supervisor are SPOF
• both are stateless, easy to restart without data loss
• Failure of master node (?)
• Running Topologies should not be affected!
• Failed Workers are restarted
• Guaranteed message processing
Administration
• Nimbus / Supervisor / Zookeeper need monitoring and supervisor (e.g. Monit)
• Cluster nodes can be added at runtime
• But: existing Topologies are not rebalanced (there is a ticket)
• Administration web GUI
Community• Source is on Github - https://github.com/
nathanmarz/storm.git
• Wiki - https://github.com/nathanmarz/storm/wiki
• Nice documentation
• Google Group
• People start to build add-ons: JRuby integration, adapters for JMS, AMQP
Storm summary
• Nice programming model
• Easy to deploy new topologies
• Horizontal scalability
• Low latency
• Fault tolerance
• Easy to setup on EC2
Questions?