storm at twitter

Post on 10-May-2015

12.386 views 2 download

Tags:

description

Talk given at facebook's analytics@webscale conference. Covers storm basics, system overview, architecture at twitter and current use-cases.

Transcript of storm at twitter

stormstream processing @twitter

Krishna GadeTwitter

@krishnagade

Sunday, June 16, 13

what is storm?

storm is a platform for doing analysis on streams of data as they come in, so you can react to data as it

happens.

Sunday, June 16, 13

storm v hadoop

storm & hadoop are complementary!

hadoop => big batch processingstorm => fast, reactive, real time processing

Sunday, June 16, 13

origins

• originated at backtype, acquired by twitter in 2011.

• to vastly simplify dealing with queues & workers.

Sunday, June 16, 13

queue-worker model

queues workers

a a a a a

Sunday, June 16, 13

typical workflow

queues queues

workers workers

datastore

Sunday, June 16, 13

problems

• scaling is painful - queue partitioning & worker deploy.

• operational overhead - worker failures & queue backups.

• no guarantees on data processing.

Sunday, June 16, 13

storm

Sunday, June 16, 13

what does storm provide?

• at least once message processing.

• horizontal scalability.

• no intermediate queues.

• less operational overhead.

• “just works”.

Sunday, June 16, 13

storm primitives

• streams

• spouts

• bolts

• topologies

Sunday, June 16, 13

streams

unbounded sequence of tuples

T T T T T T T T T T T T T T T

Sunday, June 16, 13

spouts

source of streams

A A A A A A A A A A A A

B B B B B B B B B B B B

Sunday, June 16, 13

typical spouts

• read from a kestrel/kafka queue. {tuples = events}

• read from a http server log. {tuples = http requests}

• read from twitter streaming api. {tuples = tweets}

Sunday, June 16, 13

bolts

process input stream - Aproduce output stream - B

A A A A A A A A B B B B B B B B

Sunday, June 16, 13

bolts

• filtering tuples in a stream.

• aggregation of tuples.

• joining multiple streams.

• arbitrary functions on streams.

• communication with external caches/dbs.

Sunday, June 16, 13

topology

directed-acyclic-graph of spouts and bolts.

s1

s2

b1

b2

b3

b4

b5

Sunday, June 16, 13

storm cluster

nimbus

supervisor

w1 w2 w3 w4

supervisor

w1 w2 w3 w4

ZK

topology map

sync code

topology submission

master node

slave nodesSunday, June 16, 13

nimbus

• master node.

• manages the topologies.

• job tracker in hadoop.

$ storm jar myapp.jar com.twitter.MyTopology demo

Sunday, June 16, 13

supervisor

• runs on slave nodes.

• co-ordinates with zookeeper.

• manages workers.

Sunday, June 16, 13

worker

jvm process

executor

task task

task

task

executor executor

Sunday, June 16, 13

recap

• worker - process that executes a subset of a topology.

• executor - a thread spawned by a worker.

• task - performs the actual data processing.

Sunday, June 16, 13

stream grouping

• shuffle grouping - random distribution of tuples.

• field grouping - groups tuples by a field.

• all grouping - replicates to all tasks.

• global grouping - sends the entire stream to one task.

Sunday, June 16, 13

streaming word-count TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("tweet_spout", new RandomTweetSpout(), 5); builder.setBolt("parse_bolt", new ParseTweetBolt(), 8) .shuffleGrouping("tweet_spout") .setNumTasks(2); builder.setBolt("count_bolt", new WordCountBolt(), 12) .fieldsGrouping("parse_bolt", new Fields("word"));

Config config = new Config(); config.setNumWorkers(3); StormSubmitter.submitTopology(“demo”, config, builder.createTopology());

Sunday, June 16, 13

tweet spoutclass RandomTweetSpout extends BaseRichSpout { SpoutOutputCollector collector; Random rand; String[] tweets = new String[] { "@jkrums: There’s a plane in the Hudson. I’m on the ferry to pick up people. Crazy", "@barackobama: Four more years. pic.twitter.com/bAJE6Vom", ...

};

....

@Override public void nextTuple() { Utils.sleep(100); String tweet = tweets[rand.nextInt(tweets.length)]; collector.emit(new Values(tweet)); }}

Sunday, June 16, 13

parse boltclass ParseTweetBolt extends BaseBasicBolt {

@Override public void execute(Tuple tuple, BasicOutputCollector collector) { String tweet = tuple.getString(0); for (String word : tweet.split(" ")) { collector.emit(new Values(word)); } }

@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); }}

Sunday, June 16, 13

word count boltclass WordCountBolt extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>();

@Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); count = (count == null) ? 1 : count + 1; counts.put(word, count); collector.emit(new Values(word, count)); }

@Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); }}

Sunday, June 16, 13

word-count topology

RandomTweetSpout ParseTweetBolt WordCountBolt

shuffle grouping fields grouping

Sunday, June 16, 13

how do we run storm @twitter ?

Sunday, June 16, 13

storm on mesos

node node node node

mesos

we run multiple instances of storm on the same cluster via mesos.

storm(production)

storm(dev) provides efficient

resource isolation and sharing across distributed

frameworks such as storm.

Sunday, June 16, 13

topology isolation

isolation scheduler solves the problem of multi-tenancy – avoiding resource contention between topologies, by providing full isolation

between topologies.

Sunday, June 16, 13

topology isolation

• shared pool - multiple topologies can run on the same host.

• isolated pool - dedicated set of hosts to run a single topology.

Sunday, June 16, 13

topology isolationshared pool

storm cluster

Sunday, June 16, 13

topology isolationshared pool

storm cluster

joe’s topology

isolated pools

Sunday, June 16, 13

topology isolationshared pool

storm cluster

joe’s topology

isolated pools

jane’s topology

Sunday, June 16, 13

topology isolationshared pool

storm cluster

joe’s topology

isolated pools

jane’s topology

dave’s topology

Sunday, June 16, 13

topology isolation

X

shared pool

storm cluster

joe’s topology

isolated pools

jane’s topology

dave’s topology

host failure

Sunday, June 16, 13

topology isolationshared pool

storm cluster

joe’s topology

isolated pools

jane’s topology

dave’s topology

repair hostadd host

Sunday, June 16, 13

topology isolationshared pool

storm cluster

joe’s topology

isolated pools

jane’s topology

dave’s topology

add to shared pool

Sunday, June 16, 13

numbers

• benchmarked at a million tuples processed per second per node.

• running 30 topologies in a 200 node cluster..

• processing 50 billion messages a day with an average complete latency under 50 ms.

Sunday, June 16, 13

storm use-cases@twitter

Sunday, June 16, 13

stream processing applications

tweets

favorites, retweets

impressions

twitter stormstreams

spout

bolt

bolt

$$$$

realtime dashboards

new features

Sunday, June 16, 13

current use-cases

• discovery of emerging topics/stories.

• online learning of tweet features for search result ranking.

• realtime analytics for ads.

• internal log processing.

Sunday, June 16, 13

tweet scoring pipeline

tweets

data streams

impressions

interactions

storm topology

graphstore

metadatastore

join: tweets, impressions

join: tweets, interactions

last 7 days of:tweet ->

feature_val, feature_type,

timestamp

persistent store:

tweet -> feature_val,

feature_type,timestamp

thriftservice

cassandra

twemcache

input: tweet idoutput: score

write tweetfeatures

Sunday, June 16, 13

road ahead

• auto scaling.

• persistent bolts.

• better grouping schemes.

• replicated computation.

• higher-level abstractions.

Sunday, June 16, 13

companies using storm

Sunday, June 16, 13