Slide 1
Storm
Real-time computation made easy
Michael Vogiatzis
Whats Storm?
Distributed real-time computation system
Fault tolerant
Fast
Scalable
Guaranteed message processing
Open source
Multilang capabilities
Purpose
Ok but why?
Motivation
Queues Workers paradigm
Scaling is hard
System is not robust
Coding is not fun!
No abstraction
Low level message passing
Intermediate message brokers
Use cases
Stream processing
Consume stream, update db, etc
Distributed RPC
Intense function on top of storm
Ongoing computation
Computing music trends on Twitter
Architecture
Elements
Streams
Set of tuples
Unbounded sequence of data
Spout
Source of streams
Bolts
Application logic
Functions
Streaming aggregations, joins, DB ops
Topology
Storm UI
Demo
Unshorten URLs
Evil Shorteners
Demo
Trident
Higher level of abstraction on top of Storm
Batch processing
Keeps state using your persistence store e.g. DBs, Memcached, etc.
Exactly once semanticsTuples can be replayed!
Similar API to Pig / Cascading
Trident operations
Operation
Input fields Function fields
Trident operations
Joins
Aggregations
Grouping
Functions
Filtering
Sorting
Trident State
Solid API for reading / writing to stateful sources
State updates are idempotent
Different kind of fault-tolerance depending on the different Spout implementations
Learn by example
Compute Male Female count on a particular topic on Twitter over time
Trident Gender
Stream of incoming tweets
Filter out the non-relevant to topic
Check gender by checking first name
Update either male or female counter
Input (Spout impl.)
Receives public stream (~1% of tweets) and emits them into the system
List tweets;public void emitBatch(long batchId, TridentCollector collector) {for (Object o : tweets)collector.emit(new Values(o));
}
Filter
Implement a Filter class called FilterWords.each(new Fields("status"), new FilterWords(interestingWords))
String[] words = {instagram, flickr, pinterest, picasa};public boolean isKeep(TridentTuple tuple) {Tweet t = (Tweet) tuple.getValue(0);//is tweet an interesting one?
for (String word : words) if (s.getText().toLowerCase().contains(word)) return true; return false; }}
Function
Implement a function class.each(new Fields("status"), new ExpandName(), new Fields("name"))
Tuple before:[{fullname: Iris HappyWorker, text:Having the freedom to choose your work location feels great. This week is London. pic.twitter.com/BHZq86o6}]
Function
Implement a function class.each(new Fields("status"), new ExpandName(), new Fields("name"))
Tuple before:[{fullname: Iris HappyWorker, text:Having the freedom to choose your work location feels great. This week is London. pic.twitter.com/BHZq86o6}]
Tuple after: [{fullname: Iris HappyWorker, text:Having the
freedom to choose your work location feels great. This week is
London. pic.twitter.com/BHZq86o6},
Iris]
State Query
Implement a QueryFunction to query the persistence storage. .stateQuery(genderDB, new Fields("name"), new QueryGender(), new Fields("gender"))
public List batchRetrieve(GenderDB state, List tuples) {List batchToQuery = new ArrayList();for (TridentTuple t : tuples){ String name = t.getStringByField("name"); batchToQuery.add(name);
}return state.getGenders(batchToQuery);
}
State Query
Tuple before: [{fullname: Iris HappyWorker, text:Having the
freedom to choose your work location feels great. This week is
London. pic.twitter.com/BHZq86o6},
Iris]
State Query
Tuple before: [{fullname: Iris HappyWorker, text:Having the
freedom to choose your work location feels great. This week is
London. pic.twitter.com/BHZq86o6},
Iris]
Tuple after: [{fullname: Iris HappyWorker, text:Having the
freedom to choose your work location feels great. This week is
London. pic.twitter.com/BHZq86o6},
Iris,
Female]
Grouping
.groupBy(new Fields("gender"))
Groups the tuples containing the same gender value together
Re-partitions the stream
Tuples are sent over the network
Grouping
Tuples before: 1st Partition: [{TweetJson1}, Iris, Female]1st Partition: [{TweetJson2}, Michael, Male]2nd Partition: [{TweetJson3}, Lena, Female]
Grouping
Tuples before: 1st Partition: [{TweetJson1}, Iris, Female]1st Partition: [{TweetJson2}, Michael, Male]2nd Partition: [{TweetJson3}, Lena, Female]
Group By Gender
Tuple after: new 1st Partition: [{TweetJson1}, Iris, Female]new 1st Partition: [{TweetJson3}, Lena, Female]new 2nd Partition: [{TweetJson2}, Michael, Male]
Aggregators (general case)
Run the init() function before processing the batch
Aggregate through a number of tuples (usually grouped-by before) and emit one or more results based on the aggregate method.
public interface Aggregator extends Operation { T init(Object batchId, TridentCollector collector); void aggregate(T state, TridentTuple tuple, TridentCollector collector); void complete(T state, TridentCollector collector);}
Combiner Aggregator
Run init(TridentTuple t) on every tuple
Run combine method to tuple values until no tuples are left, then return single value.
public class Count implements CombinerAggregator { public Long init(TridentTuple tuple) { return 1L; } public Long combine(Long val1, Long val2) { return val1 + val2; } public Long zero() { return 0L; }}
Reducer Aggregator
Run init() to get an initial value
Iterate over the value to emit a single result
public interface ReducerAggregator extends Serializable { T init(); T reduce(T curr, TridentTuple tuple);}
Back to the example
For each gender batch run Count() aggregator
Not only aggregate, but also store the value to memory
Why?
Over time count
Back to the example
For each gender batch run Count() aggregator
Not only aggregate, but also store the value to memory
Why?
Over time count
persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
Putting it all together
TridentState genderDB = topology.newStaticState(new GenderDBFactory());Stream gender = topology.newStream("spout", spout).each(new Fields("status"), new Filter(topicWords)).each(new Fields("status"), new ExpandName(), new Fields("name")) .parallelismHint(4).stateQuery(genderDB, new Fields("name"), new QueryGender(), new Fields("gender")).parallelismHint(10)
.groupBy(new Fields("gender")).persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).newValuesStream();
Demo
Gender count
Some minus
Hard debugging
pseudo-distributed mode but still..
Object serialization
When using 3rd party libraries
Register your own serializers for better performance e.g. Kryo
I didnt tackle
Reliability
Guaranteed message processing
Distributed RPC example
Storm-deploy companion
One-click storm cluster automated deploy i.e. EC2
Contributions
Overall
Express your realtime needs naturally
Growing community
System rapidly improving
Not a Hadoop/MR competitor
Fun to use
Resources
Storm Unshortening examplehttps://github.com/mvogiatzis/storm-unshortening
Understanding the Storm Parallelism http://bit.ly/RCx4Ln
http://storm-project.net/
https://github.com/nathanmarz/storm
The End
Michael Vogiatzis
Follow me @mvogiatzis
Q & A
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
30/4/2013
Click to edit Master title style
Click to edit Master subtitle style
30/4/2013
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
30/4/2013
Click to edit Master title style
Click to edit Master text styles
30/4/2013
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
30/4/2013
Click to edit Master title style
Click to edit Master text styles
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Click to edit Master text styles
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
30/4/2013
Click to edit Master title style
30/4/2013
30/4/2013
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Click to edit Master text styles
30/4/2013
Click to edit Master title style
Click to edit Master text styles
30/4/2013
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
30/4/2013
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
30/4/2013
Top Related