Introduction to Storm

36
Chandler@PyHug previa [at] gmail.com Storm Distributed and fault-tolerant realtime computation system

Transcript of Introduction to Storm

Page 1: Introduction to Storm

Chandler@PyHugprevia [at] gmail.com

StormDistributed and fault-tolerant realtime computation system

Page 2: Introduction to Storm

Outline

• Background• Why Strom• Component• Topology• Storm & DRPC• Multilang Protocol• Experience

Page 3: Introduction to Storm

Background

Page 4: Introduction to Storm

Background

• Creates by Nathan Marz @ BackType/Twitter– Analyze twits, links, users on Twitter

• Opensourced at Sep 2011– Eclipse Public License 1.0– Storm 0.5.2 – 16k java and 7k Clojure Loc– Current stable release 0.8.2• 0.9.0 major core improvement

Page 5: Introduction to Storm

Background

• Active user group– https://groups.google.com/group/storm-user– https://github.com/nathanmarz/storm

– Most watched java repo at GitHub (>4k watcher)– Used by over 30 companies• Twitter, Groupon, Alibaba, GumGum, ..

Page 6: Introduction to Storm

Why Storm ?

Page 7: Introduction to Storm

Before Storm

Page 8: Introduction to Storm

Problems

• Scale is painful• Poor fault-tolerance– Hadoop is stateful

• Coding is tedious • Batch processing– Long latency– no realtime

Page 9: Introduction to Storm

Storm

• Scalable and robust– No persistent layer

• Guarantees no data loss• Fault-tolerant• Programming language agnostic• Use case– Stream processing– Distributed RPC– Continues computation

Page 10: Introduction to Storm

Components

Page 11: Introduction to Storm

Base on

• Apache Zookeeper– Distributed system, used to store metadata

• ØMQ– Asynchronous message transport layer

• Apache Thrift – Cross-language bridge, RPC

• LMAX Disruptor– High performance queue shared by threads

• Kryo– Serialization framework

Page 12: Introduction to Storm

System architecture

Page 13: Introduction to Storm

System architecture

• Nimbus– Like JobtTacker in hadoop

• Supervisor– Manage workers

• Zookeeper– Store meta data

• UI– Web-UI

Page 14: Introduction to Storm

Topology

Page 15: Introduction to Storm

Topology

• Tuples – ordered list of elements– (“user”, “link”, “event”, “10/3/12 17:50“)

• Streams – unbounded sequence of tuples

Page 16: Introduction to Storm

Spouts

• Source of streams• Example

• Read from logs, API calls, event data, queues, …

Page 18: Introduction to Storm

Topology

• Bolts– Processes input streams and produces new

streams– Example• Stream Joins, DBs, APIs, Filters, Aggregation, …

Page 19: Introduction to Storm

Bolts

• Interface Ibolt– BaseRichBolt, BasicBoltExecutor, BatchBoltExecutor, BoltTracker, ClojureBolt,

CoordinatedBolt, JoinResult, KeyedFairBolt, NonRichBoltTracker, ReturnResults, BaseShellBolt, ShellBolt, TestAggregatesCounter, TestGlobalCount, TestPlannerBolt, TransactionalSpoutBatchExecutor,TridentBoltExecutor, TupleCaptureBolt

Page 20: Introduction to Storm

Topology

• Topology– A directed graph of Spouts and Bolts

Page 21: Introduction to Storm

Tasks

• Instances of Spouts and Blots• Managed by Supervisor

– http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/

Page 22: Introduction to Storm

Stream grouping

• All grouping– Send to all tasks

• Global grouping– Pick task with lowest id

• Shuffle grouping– Pick a random task

• Fields grouping– Consistent hashing on a subset of tuple fields

Page 23: Introduction to Storm

Storm fault-tolerance• Reliability API

– Spout tuple creation• colloctor.emit(values, msgID);

– Child tuple creation (Bolts)• colloctor.emit(parentTuples, values);

– Tuple end of processing• collector.ack(tuple);

– Tuple failed to process• collector.fail(tuple);

Page 24: Introduction to Storm

Storm fault-tolerance

• Disable reliability API– Globally• Config.TOPOLOGY_ACKER_EXECUTORS = 0

– On topology level• Collector.emit(values, msgID);

– For a single tuple• Collector.emit(paranetTuples, values);

Page 25: Introduction to Storm

Storm & DRPC

Page 26: Introduction to Storm

Distributed RPC

Page 27: Introduction to Storm

Multilang Protocol

Page 28: Introduction to Storm

Multilang protocol

• Using ShellSpout/ShellBolt• Process using stand in/out to communicate• Massage are encoded as JSON/ lines of plain text

Page 29: Introduction to Storm

Three steps

• Initiate a handshake – Keep track with process id– Send a json object to standard input while start– Contains• Storm configuration, topology, context, PID directory

Page 30: Introduction to Storm

Three steps

• Start looping– storm_sync would expect torm_ack

• Read or write tuples – Follow defined structure– Implement read_msg(), storm_emit() ,…

Page 31: Introduction to Storm

Experience

Page 32: Introduction to Storm

Experience

• Not hard to setup, but– Beware of certain version of Zookeeper– Wait a while after topology deployed

• Fast, – Better use fabric

• Stable– But beware of memory leak

Page 33: Introduction to Storm

Reference

Page 34: Introduction to Storm

Reference• “Getting started with Storm”, O’REILLY

• Twitter Storm – Sergey Lukjanov@slideshare– http://www.slideshare.net/lukjanovsv/twitter-storm

• Storm– nathanmarz@slideshare– http://www.slideshare.net/nathanmarz/storm-11164672

• Realtime Analytics with Storm and Hadoop– Hadoop_Summit@slideshare– http://www.slideshare.net/Hadoop_Summit/realtime-analytics-with-storm

Page 35: Introduction to Storm

Q/A

Page 36: Introduction to Storm

Thanks