Streams processing with Storm

43
Data streams processing with STORM Mariusz Gil

Transcript of Streams processing with Storm

Page 1: Streams processing with Storm

Data streamsprocessing with

STORM

Mariusz Gil

Page 2: Streams processing with Storm

data expire fast. very fast

Page 3: Streams processing with Storm
Page 4: Streams processing with Storm

realtime processing?

Page 5: Streams processing with Storm

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.

Page 6: Streams processing with Storm

Storm is fast, a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Page 7: Streams processing with Storm

concept architecture

Page 8: Streams processing with Storm

Stream

(val1, val2)(val3, val4)(val5, val6)

unbounded sequence of tuples

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Page 9: Streams processing with Storm

Spoutssource of streams

tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Page 10: Streams processing with Storm

Reliable and unreliable Spoutsreplay or forget about touple

tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Page 11: Streams processing with Storm

Spoutssource of streams

tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Storm-Kafka

Page 12: Streams processing with Storm

Spoutssource of streams

tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Storm-Kestrel

Page 13: Streams processing with Storm

Spoutssource of streams

tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Storm-AMQP-Spout

Page 14: Streams processing with Storm

Spoutssource of streams

tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Storm-JMS

Page 15: Streams processing with Storm

Spoutssource of streams

tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Storm-PubSub*

Page 16: Streams processing with Storm

Spoutssource of streams

tupletuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Storm-Beanstalkd-Spout

Page 17: Streams processing with Storm

Boltsprocess input streams and produce new streams

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

Page 18: Streams processing with Storm

Boltsprocess input streams and produce new streams

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tuple

tupl

etu

ple

tupl

etu

ple

tupl

etu

ple

tupl

etuple

tupletuple

tupletuple

tupletuple

Page 19: Streams processing with Storm

Topologiesnetwork of spouts and bolts

TextSpout SplitSentenceBolt WordCountBolt

[sentence] [word] [word, count]

Page 20: Streams processing with Storm

Topologiesnetwork of spouts and bolts

TextSpout SplitSentenceBolt

WordCountBolt

[sentence]

[word]

[word, count]

TextSpout SplitSentenceBolt

[sentence]

xyzBolt

Page 21: Streams processing with Storm

servers architecture

Page 22: Streams processing with Storm

Nimbusprocess responsible for distributing processing across the cluster

Page 23: Streams processing with Storm

Supervisorsworker process responsible for executing subset of topology

Page 24: Streams processing with Storm

zookeeperscoordination layer between Nimbus and Supervisors

Page 25: Streams processing with Storm

fastCLUSTER STATE IS STOREDLOCALLY OR IN ZOOKEEPERSfail

Page 26: Streams processing with Storm

sample code

Page 27: Streams processing with Storm

Spouts

public class RandomSentenceSpout extends BaseRichSpout { SpoutOutputCollector _collector; Random _rand;

@Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _rand = new Random(); }

@Override public void nextTuple() { Utils.sleep(100); String[] sentences = new String[] { "the cow jumped over the moon", "an apple a day keeps the doctor away", "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"}; String sentence = sentences[_rand.nextInt(sentences.length)]; _collector.emit(new Values(sentence)); }

@Override public void ack(Object id) { }

@Override public void fail(Object id) { }

@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); }}

Page 28: Streams processing with Storm

Bolts

public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>();

@Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word);

if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); }}

Page 29: Streams processing with Storm

Bolts

public static class ExclamationBolt implements IRichBolt { OutputCollector _collector;

public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; }

public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); }

public void cleanup() { }

public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } public Map getComponentConfiguration() { return null; }}

Page 30: Streams processing with Storm

Topology

public class WordCountTopology { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

Config conf = new Config();

conf.setDebug(true);

if (args != null && args.length > 0) { conf.setNumWorkers(3);

StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3);

LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology());

Thread.sleep(10000);

cluster.shutdown(); } }}

Page 31: Streams processing with Storm

Bolts

public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); }

public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); }}

import storm

class SplitSentenceBolt(storm.BasicBolt): def process(self, tup): words = tup.values[0].split(" ") for word in words: storm.emit([word])

SplitSentenceBolt().run()

Page 32: Streams processing with Storm

github.com/nathanmarz/storm-starter

Page 33: Streams processing with Storm

streams groupping

Page 34: Streams processing with Storm

Topology

public class WordCountTopology { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));

Config conf = new Config();

conf.setDebug(true);

if (args != null && args.length > 0) { conf.setNumWorkers(3);

StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3);

LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology());

Thread.sleep(10000);

cluster.shutdown(); } }}

Page 35: Streams processing with Storm

Grouppingshuffle, fields, all, global, none, direct, local or shuffle

Page 36: Streams processing with Storm

distributed rpc

Page 37: Streams processing with Storm

RPCdistributed

arguments

results

[request-id, arguments]

[request-id, results]

Page 38: Streams processing with Storm

RPCdistributed

arguments

results

[request-id, arguments]

[request-id, results]public static class ExclaimBolt extends BaseBasicBolt { public void execute(Tuple tuple, BasicOutputCollector collector) { String input = tuple.getString(1); collector.emit(new Values(tuple.getValue(0), input + "!")); }

public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("id", "result")); }}

public static void main(String[] args) throws Exception { LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder("exclamation"); builder.addBolt(new ExclaimBolt(), 3);

LocalDRPC drpc = new LocalDRPC(); LocalCluster cluster = new LocalCluster();

cluster.submitTopology("drpc-demo", conf, builder.createLocalTopology(drpc));

System.out.println("Results for 'hello':" + drpc.execute("exclamation", "hello"));

cluster.shutdown(); drpc.shutdown();}

Page 39: Streams processing with Storm

realtime analytics

personalization

search

revenue

optimization

monitoring

Page 40: Streams processing with Storm

content search

realtime analytics

generating feeds

integrated with

elastic search,

Hbase,hadoop

and hdfs

Page 41: Streams processing with Storm

realtime scoring

moments generation

integrated with

kafka queues and

hdfs storage

Page 42: Streams processing with Storm

Storm-YARN enables

Storm applications to

utilize the

computational

resources in a Hadoop

cluster along with

accessing Hadoop

storage resources

such As HBase and

HDFS

Page 43: Streams processing with Storm

thanks!mail: [email protected]: @mariuszgil