Introduction to Storm
-
Upload
chandler-huang -
Category
Technology
-
view
5.866 -
download
0
Transcript of Introduction to Storm
Chandler@PyHugprevia [at] gmail.com
StormDistributed and fault-tolerant realtime computation system
Outline
• Background• Why Strom• Component• Topology• Storm & DRPC• Multilang Protocol• Experience
Background
Background
• Creates by Nathan Marz @ BackType/Twitter– Analyze twits, links, users on Twitter
• Opensourced at Sep 2011– Eclipse Public License 1.0– Storm 0.5.2 – 16k java and 7k Clojure Loc– Current stable release 0.8.2• 0.9.0 major core improvement
Background
• Active user group– https://groups.google.com/group/storm-user– https://github.com/nathanmarz/storm
– Most watched java repo at GitHub (>4k watcher)– Used by over 30 companies• Twitter, Groupon, Alibaba, GumGum, ..
Why Storm ?
Before Storm
Problems
• Scale is painful• Poor fault-tolerance– Hadoop is stateful
• Coding is tedious • Batch processing– Long latency– no realtime
Storm
• Scalable and robust– No persistent layer
• Guarantees no data loss• Fault-tolerant• Programming language agnostic• Use case– Stream processing– Distributed RPC– Continues computation
Components
Base on
• Apache Zookeeper– Distributed system, used to store metadata
• ØMQ– Asynchronous message transport layer
• Apache Thrift – Cross-language bridge, RPC
• LMAX Disruptor– High performance queue shared by threads
• Kryo– Serialization framework
System architecture
System architecture
• Nimbus– Like JobtTacker in hadoop
• Supervisor– Manage workers
• Zookeeper– Store meta data
• UI– Web-UI
Topology
Topology
• Tuples – ordered list of elements– (“user”, “link”, “event”, “10/3/12 17:50“)
• Streams – unbounded sequence of tuples
Spouts
• Source of streams• Example
• Read from logs, API calls, event data, queues, …
Spout
• Interface ISpout– BaseRichSpout, ClojureSpout, DRPCSpout, FeederSpout, FixedTupleSpout, MasterBatchCoordinator, NoOpSpout,
RichShellSpout, RichSpoutBatchTriggerer, ShellSpout, SpoutTracker, TestPlannerSpout, TestWordSpout, TransactionalSpoutCoordinator
Topology
• Bolts– Processes input streams and produces new
streams– Example• Stream Joins, DBs, APIs, Filters, Aggregation, …
Bolts
• Interface Ibolt– BaseRichBolt, BasicBoltExecutor, BatchBoltExecutor, BoltTracker, ClojureBolt,
CoordinatedBolt, JoinResult, KeyedFairBolt, NonRichBoltTracker, ReturnResults, BaseShellBolt, ShellBolt, TestAggregatesCounter, TestGlobalCount, TestPlannerBolt, TransactionalSpoutBatchExecutor,TridentBoltExecutor, TupleCaptureBolt
Topology
• Topology– A directed graph of Spouts and Bolts
Tasks
• Instances of Spouts and Blots• Managed by Supervisor
– http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
Stream grouping
• All grouping– Send to all tasks
• Global grouping– Pick task with lowest id
• Shuffle grouping– Pick a random task
• Fields grouping– Consistent hashing on a subset of tuple fields
Storm fault-tolerance• Reliability API
– Spout tuple creation• colloctor.emit(values, msgID);
– Child tuple creation (Bolts)• colloctor.emit(parentTuples, values);
– Tuple end of processing• collector.ack(tuple);
– Tuple failed to process• collector.fail(tuple);
Storm fault-tolerance
• Disable reliability API– Globally• Config.TOPOLOGY_ACKER_EXECUTORS = 0
– On topology level• Collector.emit(values, msgID);
– For a single tuple• Collector.emit(paranetTuples, values);
Storm & DRPC
Distributed RPC
Multilang Protocol
Multilang protocol
• Using ShellSpout/ShellBolt• Process using stand in/out to communicate• Massage are encoded as JSON/ lines of plain text
Three steps
• Initiate a handshake – Keep track with process id– Send a json object to standard input while start– Contains• Storm configuration, topology, context, PID directory
Three steps
• Start looping– storm_sync would expect torm_ack
• Read or write tuples – Follow defined structure– Implement read_msg(), storm_emit() ,…
Experience
Experience
• Not hard to setup, but– Beware of certain version of Zookeeper– Wait a while after topology deployed
• Fast, – Better use fabric
• Stable– But beware of memory leak
Reference
Reference• “Getting started with Storm”, O’REILLY
• Twitter Storm – Sergey Lukjanov@slideshare– http://www.slideshare.net/lukjanovsv/twitter-storm
• Storm– nathanmarz@slideshare– http://www.slideshare.net/nathanmarz/storm-11164672
• Realtime Analytics with Storm and Hadoop– Hadoop_Summit@slideshare– http://www.slideshare.net/Hadoop_Summit/realtime-analytics-with-storm
Q/A
Thanks