Storm - NodeBB · Distributed realtime computation system Originated at BackType/Twitter, open...
Transcript of Storm - NodeBB · Distributed realtime computation system Originated at BackType/Twitter, open...
StormHui Li
08/15/2016
► Storm 介绍及特点
► Storm 核心概念
► Storm 系统架构
► Storm 使用
► Storm 应用开发
Storm 大纲
► Storm 介绍及特点
► Storm 核心概念
► Storm 系统架构
► Storm 使用
► Storm 应用开发
Storm 大纲
► Distributed realtime computation system
► Originated at BackType/Twitter, open sourced in late 2011
► Implemented in Clojure, some Java
► Top-level-project, ~141 contributors
Storm 介绍
► Reliable、Guaranteed data processing► At-most-once & At-least-once & Exactly-once(Trident)
► Scalable► Thousands of worker per cluster
► Fault-tolerance► Failure is expected, and embraced
► Fast► clocked at 1M+ messages per second per node
Storm 特点
► Realtime analytics
► Online machine learning
► Continuous computation
► Distributed RPC
► ETL
Storm Use Cases
► Twitter: personalization, search, revenue optimization, …► 200 nodes, 30 topos, 50B msg/day, avg latency <50ms, Jun 2013
► Yahoo: user events, content feeds, and application logs ► 320 nodes (YARN), 130k msg/s, June 2013
► Spotify: recommendation, ads, monitoring, …► v0.8.0, 22 nodes, 15+ topos, 200k msg/s, Mar 2014
► Alibaba, Cisco, Flickr, PARC, WeatherChannel, …► Netflix is looking at Storm and Samza, too.
Storm Adoptions
► Storm 介绍及特点
► Storm 核心概念
► Storm 系统架构
► Storm 使用
► Storm 应用开发
Storm 大纲
► Core Unit of Data
► Immutable Set of Key/Value Pairs
Tuple
► Unbounded Sequence of Tuples
Streams
► Source of Streams
► Wraps a streaming data source and emits Tuples
► Eg: read from Kafka or read from Redis
Spouts
► Processes input streams and produces new streams
► Core functions of a streaming computation
Bolts
► Functions
► Filters
► Aggregation
► Joins
► Talk to databases
Bolts
► Network(DAG) of spouts and bolts
► Data Flow Representation
Topology
► Spouts and bolts execute as many tasks across the cluster
Tasks
► Determine how Storm routes Tuples between tasks in a topology
Stream Grouping
► Shuffle grouping► Randomized round-robin
► Local or shuffle grouping► Randomized round-robin
► With a preference for intra-worker tasks
Stream Grouping
► Fields grouping► Mod hashing on a subset of tuple fields
► Ensures all tuples with the same field values are always routed to the same task.
► Partial Key grouping► Like the Fields grouping, but are load balanced between two
downstream bolts► Provides better utilization of resources for skewed incoming data
Stream Grouping
► All grouping► Send to all tasks
► Global grouping► Pick task with lowest id
► None grouping
► Currently, equivalent to shuffle groupings
Stream Grouping
► Direct grouping► The producer of the tuple decides which task of the
consumer will receive this tuple.
► Direct groupings can only be declared on streams that have been declared as direct streams.
Stream Grouping
► Storm 介绍及特点
► Storm 核心概念
► Storm 系统架构
► Storm 使用
► Storm 应用开发
Storm 大纲
Storm 架构
Topology Nimbus
Zookeeper Zookeeper Zookeeper
Supervisor Supervisor Supervisor
Workers Workers Workers
► Storm 介绍及特点
► Storm 核心概念
► Storm 系统架构
► Storm 使用
► Storm 应用开发
Storm 大纲
► Storm Deployment
► Command Line Client
► REST API
► Storm UI
Storm 使用
► 1. Set up a Zookeeper cluster
► For demo: storm dev-zookeeper
► 2. Install dependencies on Nimbus and worker machines
► Java 7
► Python 2.6.6
► Optional
► Configure PATH & JAVA_HOME environment
Storm Deployment
► 3. Download and extract a Storm release to Nimbus and
worker machines
► sudo tar -zxvf apache-storm-1.0.2.tar.gz -C /opt/
► 4. Fill in mandatory configurations into storm.yaml
► storm.zookeeper.servers
► nimbus.seeds
► supervisor.slots.ports
► storm.local.dir
Storm Deployment
► 5. Launch daemons under supervision using "storm" script
and a supervisor of your choice
► storm nimbus
► storm supervisor
► Optional
► storm ui
► storm drpc
► storm logviewer
Storm Deployment
► storm jar topology-jar-path class ...
► storm list
► storm deactivate topology-name
► storm activate topology-name
► storm rebalance topology-name [-w wait-time-secs] [-n new-num-workers] [-e component=parallelism]*
► storm get-errors topology-name
► storm kill topology-name [-w wait-time-secs]
Command Line Client -- Toplogy Related
► storm nimbus
► storm supervisor
► storm ui
► storm drpc
► storm logviewer
► storm pacemaker
Command Line Client -- Daemon Related
► storm classpath
► storm localconfvalue conf-name
► ~/.storm/storm.yaml + defaults.yaml
► storm remoteconfvalue conf-name
► $STORM-PATH/conf/storm.yaml + defaults.yaml
Command Line Client -- Config Related
► storm monitor topology-name [-i interval-secs] [-m component-id] [-s stream-id] [-w [emitted | transferred]]
► storm set_log_level -l [logger name]=[log level][:optional timeout] -r [logger name] topology-name
► storm shell resourcesdir command args
► storm blobstore cmd► storm blobstore create mytopo:data.tgz -f data.tgz -a u:alice:rwa,u:bob:rw,o::r
► storm sql sql-file topology-name
Command Line Client -- Advanced
► storm help
► storm version
► storm dev-zookeeper
► storm kill_workers
► run on a supervisor node
Command Line Client -- Misc
► Function
► retrieving metrics data
► retrieving configuration information
► management operations
► Supports JSONP
REST API
► Request URL Format
► http://<ui-host>:<ui-port>/api/v1/...
► Default Port: 8080
► Response Format: JSON
REST API
► /api/v1/cluster/configuration (GET)
► /api/v1/cluster/summary (GET)
► /api/v1/nimbus/summary (GET)
► /api/v1/supervisor/summary (GET)
► /api/v1/topology/summary (GET)
► /api/v1/topology/:id (GET)
REST API - GET
► /api/v1/topology/:id/activate (POST)
► /api/v1/topology/:id/deactivate (POST)
► /api/v1/topology/:id/rebalance/:wait-time (POST)
► /api/v1/topology/:id/kill/:wait-time (POST)
REST API - POST
Storm UI
► Storm 介绍及特点
► Storm 核心概念
► Storm 系统架构
► Storm 使用
► Storm 应用开发
Storm 大纲
► API
► WordCount Example
► Parallelism
► Reliablity API
► DRPC
► Trident
► WordCount(Trident version) Example
Storm 应用开发
public interface ISpout extends Serializable {
void open(Map var1, TopologyContext context, SpoutOutputCollector );
void close();
void activate();
void deactivate();
void nextTuple();
void ack(Object var1);
void fail(Object var1);
}
API -- Spout
Lifecycle API
Core API
Reliablity API
• 常见子接口:IRichSpout• 常见实现类:BaseRichSpout, DRPCSpout, RandomSentenceSpout, KafkaSpout
public interface IBolt extends Serializable {
void prepare(Map var1, TopologyContext context, OutputCollector collector);
void execute(Tuple var1);
void cleanup();
}
API -- Bolt
Lifecycle API
Core API
• 常见子接口:IRichBolt • 常见实现类:BaseRichBolt, ShellBolt, RedisStoreBolt, KafkaBolt, HdfsBolt
public interface IOutputCollector extends IErrorReporter {
List<Integer> emit(String streamId, Collection<Tuple> anchors, List<Object> tuple);
void emitDirect(int taskId, String streamId, Collection<Tuple> anchors, List<Object> tuple);
void ack(Tuple input);
void fail(Tuple input);
void resetTimeout(Tuple input);
}
API -- Bolt Output
Core API
Reliablity API
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 2);
builder.setBolt("split", new SplitSentence(), 2).shuffleGrouping("spout").setNumTasks(4);
builder.setBolt("count", new WordCount(), 6).fieldsGrouping("split", new Fields("word"));
API -- Topology
spout split count
常见配置
• Config.TOPOLOGY_WORKERS
• Config.TOPOLOGY_ACKER_EXECUTORS
• Config.TOPOLOGY_MAX_SPOUT_PENDING
• Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS
• Config.TOPOLOGY_SERIALIZATIONS
API -- Topology Configuration
Config conf = new Config();
conf.setNumWorkers(20);
conf.setMaxSpoutPending(5000);
► Local Mode
API - Topology Submission
► Remote Mode
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("word-count", conf, builder.createTopology());
...
cluster.shutdown();
StormSubmitter.submitTopologyWithProgressBar("word-count", conf, builder.createTopology());
WordCount Example
snow white and the seven dwarfssnow
white
and
the
seven
dwarfs
seven: 11snow: 11
and: 23dwarfs: 11
the: 19white: 11
Fields GroupingShuffle Grouping
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 2);
builder.setBolt("split", new SplitSentence(), 2).shuffleGrouping("spout").setNumTasks(4);
builder.setBolt("count", new WordCount(), 6).fieldsGrouping("split", new Fields("word"));
public class RandomSentenceSpout extends BaseRichSpout { SpoutOutputCollector _collector; Random _rand; public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _rand = new Random(); } public void nextTuple() { Utils.sleep(100); String[] sentences = new String[]{ "the cow jumped over the moon", "an apple a day keeps the doctor away",
"four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature" }; String sentence = sentences[_rand.nextInt(sentences.length)]; _collector.emit(new Values(sentence)); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("sentence")); }}
WordCount Example -- Spout
public static class SplitSentence extends BaseBasicBolt { public void execute(Tuple tuple, BasicOutputCollector collector) { String sentence = tuple.getString(0); for (String word : sentence.split(" ")) { collector.emit(new Values(word)); } } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); }}
WordCount Example -- Split Bolt
public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); System.out.println(String.format("== %s, %d ==", word, count)); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); }}
WordCount Example -- Count Bolt
builder.setSpout("spout", new RandomSentenceSpout(), 2);
builder.setBolt("split", new SplitSentence(), 2).shuffleGrouping("spout").setNumTasks(4);
builder.setBolt("count", new WordCount(), 6).fieldsGrouping("split", new Fields("word"));
conf.setNumWorkers(2);
Parallelism
Parallelism Hint & Task Number & Worker Number ?
► Worker processes
► Executors (threads)
► Tasks
Parallelism
Parallelism
► Rebalance► storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10
Reliablity API -- "fully processed"
► To create a new link in the tree of tuples
Reliablity API -- "anchoring"
List<Tuple> anchors = new ArrayList<Tuple>();
anchors.add(A);
_collector.emit(anchors, new Values(B));
Reliablity API -- Acknowledgment
ACK
Fail
ACK Bolt
ACK Bolt
ackack
failfail
public interface ISpout extends Serializable{
void ack(Object var1);
void fail(Object var1);
}
public interface IOutputCollector extends IErrorReporter {
void ack(Tuple input);
void fail(Tuple input);
}
BaseBasicBolt
► Use single 64-bit integer
► XOR MagicLong a, b = Random.nextLong();a != 0a ^ a ^b != 0a ^ a ^ b ^ b == 0
► Question► What will happen if a tuple isn't acked because the task died?
Reliablity API -- Track Tuple Tree
DRPC
► DRPC Server
storm drpc
► DRPC Client
DRPCClient client = new DRPCClient(conf, host, 3772);
String result = client.execute("wc", word);
► DRPC Topology
LinearDRPCTopologyBuilder
DRPC
► Provides consistent, exactly-once semantics
► Micro-Batch Oriented
► Fluent, Stream-Oriented API► Functions
► Filters
► Groupings
► Aggregations
► Merges and Joins
► Stateful, incremental processing on top of any persistence store
Trident
TridentBatch #1Batch #2
TridentTopology topology = new TridentTopology();TridentState wordCounts = topology.newStream("spout1", spout).parallelismHint(16)
.each(new Fields("sentence"),new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
.parallelismHint(16);topology.newDRPCStream("words", drpc)
.each(new Fields("args"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.stateQuery(wordCounts, new Fields("word"), new MapGet(), new Fields("count"))
.each(new Fields("count"), new FilterNull())
.aggregate(new Fields("count"), new Sum(), new Fields("sum"));
WordCount(Trident version) Example
► Storm 基本概念、系统架构、基本使用及应用开发入门
► Advanced
► State Management & Statefule Bolts
► Native Streaming Window API
► Distributed Cache API
► Scheduler & Resource Aware Scheduler
► Worker Execution Model
► ...
总结
关注我们
QingCloud-IaaS
青云QingCloud
www.qingcloud.com
Thank [email protected]