Real time stream processing presentation at General Assemb.ly
-
Upload
varun-vijayaraghavan -
Category
Engineering
-
view
138 -
download
0
Transcript of Real time stream processing presentation at General Assemb.ly
![Page 1: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/1.jpg)
Real Time Stream Processing
Varun Vijayaraghavan
![Page 2: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/2.jpg)
Why real time?
We live in a world of continuous data.
Server
Page Views
Social Media Event (Image, Comments, etc)
Sensor Data
Instant Messaging
Market Data
![Page 3: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/3.jpg)
Why real time?
● What’s true right now - may not have been true in the past.
● Competition is fierce. We need to act before others to win.
● In some cases, “too late” can lead to losses you would rather avoid. :(
● Need to make decisions NOW!
![Page 4: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/4.jpg)
What does such a system look like?
Continuous Data Source Data Ingestion Real Time Stream
ProcessorsInstant Visualization / Insights / Alerts
m1, m2 ...
![Page 5: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/5.jpg)
Properties of real time stream processing systems
● Fast! It needs to keep up.● Scalable - Expand your cluster as data and computation needs grow
larger.● Fault tolerant. It should be able to recover from failure.
Important practical concerns:● Easy to setup and write applications with.● Needs to be battle tested. (Any distributed system is extremely hard to get
right!)● Needs to have excellent monitoring and tooling capability.
![Page 6: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/6.jpg)
Practical use cases● “Real time” predictive analytics for news media (what I
do :)○ Analyze clicks, page views, user behavior, social streams, video views
etc. to provide instant insights.● Cloud based home security systems
○ Analyze sensor data like temperature, video streams, motion sensors etc. to determine threats
● Trading shares○ Use real time market data (and other sources) to make instant
decisions (buy, sell, long, short etc.)○ Check out reactivetrader.com
![Page 7: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/7.jpg)
A small aside on parallelismFrom wikipedia: “Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently.”
Simple example: Think crawling / scraping thousands of web pages one after another vs doing them parallely.
Two types of parallelism: Task Parallelism (Apache Storm) and Data Parallelism (Hadoop, Apache Spark). Although most distributed stream processing systems are a combination of both.
![Page 8: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/8.jpg)
Task Parallelism
Streaming Data Source
Task 1 Task 2
Task 3
Task 4
![Page 9: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/9.jpg)
Data ParallelismStreaming Data Source
Data 1 Data 2 Data n
Task Task Task
![Page 10: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/10.jpg)
Apache Storm● Apache Storm is a distributed, real time, stream processing
system. It uses task based parallelism.● From the storm wiki: “Storm exposes a set of primitives for
doing realtime computation. Like how MapReduce greatly eases the writing of parallel batch processing, Storm's primitives greatly ease the writing of parallel realtime computation”.
● It’s fast (written in jvm), scalable, robust and fault tolerant. Also, pretty easy to setup and has great tooling.
![Page 11: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/11.jpg)
Components of StormSpout: The data source (twitter posts, page views data etc).
Bolt: A processing task. (word-counter, category-classifier etc)
Topology: This is a graph that consists of one or more spouts (data source) and one or more bolts. Each bolt and spout can have several “workers” executing them in parallel.
Tuple: The “message” or “data” that’s passed between bolts and spouts. For instance: (“social-network”: “twitter”, “type”: “retweet”, “post”: “Hi Storm!”, “user_id”: “asdf123”, “post_id”: “1234”, “posted_at”: 2012-07-14T01:00:00)
![Page 12: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/12.jpg)
Storm Architecture
Shuffle Grouping:
“Tuples” from the previous bolt or spout can be processed by any instance of the current bolt.
Stream Grouping:
“Tuples” from the previous spout or bolt will always go to the same instance of the bolt or spout - depending on stream grouping field.
Example: If you group by user_id - all posts from that user will always be processed by the same instance.
![Page 13: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/13.jpg)
Example: Word counts topology
RandomSentenceSpout
SplitSentenceBolt
WordCountBolt
PrinterBolt
Topology that takes a stream of sentences and keeps printing out the number of occurrences of each word.
![Page 14: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/14.jpg)
Example: RandomSentence Spoutclass RandomSentenceSpout {
public void nextTuple() {
Utils.sleep(100);
String[] sentences = new String[]{
"the cow jumped over the moon",
"an apple a day keeps the doctor away",
"four score and seven years ago",
"snow white and the seven dwarfs",
"i am at two with nature"
};
String sentence = sentences[_rand.nextInt(sentences.length)];
collector.emit(new Values(sentence)); // Sends the sentence to the next bolt
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("sentence")); // Declares what this spout emits
}
}
![Page 15: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/15.jpg)
Example: SplitSentence Boltpublic class SplitSentence {
public void execute(Tuple tuple, BasicOutputCollector collector) {
String sentence = tuple.getStringByField("sentence");
String[] words = sentence.split(" ");
for(String word: words) {
collector.emit(new Values(word));
}
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
![Page 16: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/16.jpg)
Example: WordCount Boltpublic class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count = count + 1;
counts.put(word, count);
collector.emit(new Values(word, count));
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
![Page 17: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/17.jpg)
Example: Printer Boltpublic class PrinterBolt {
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getStringByField("word");
Integer count = tuple.getIntegerByField("count");
System.out.println("Word count for " + word + ": " + count);
}
}
![Page 18: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/18.jpg)
Example: WordCount topologypublic class WordCountTopology {
public static void main(String[] args) throws Exception {
//...
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
builder.setBolt("print", new PrinterBolt(), 8).shuffleGrouping("count");
// ...
}
![Page 19: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/19.jpg)
Let’s run this!
![Page 20: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/20.jpg)
…
![Page 21: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/21.jpg)
Lambda Architecture● Architecture proposed by creator of Storm (nathanmarz)● Two processing layers:
○ Speed layer■ This is the real time processing layer performed by a framework
like Storm.■ This helps you give real time insights, but is complex and can
potentially drop data.○ Batch layer
■ This is “after the fact” processing layer.■ This will give you “delayed” insights, but it can be made as
reliable as possible/
![Page 22: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/22.jpg)
Lambda Architecture
Continuous Data Source
Speed Layer (eg Storm)
Permanent Archive
Batch Layer (eg Hadoop)
DatabaseInstant / Delayed visualization and alerts
![Page 23: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/23.jpg)
Apache Spark● Apache Spark is a fast and general-purpose cluster
computing system. It has mapreduce like functionality.● It is a batch based data parallel system - where the data
is distributed to the workers in the cluster, and operations can be applied on them (similar to Hadoop).
● Spark Streaming using microbatches (~1s of data)● Spark Streaming + Spark = Lambda architecture with
the same codebase.
![Page 24: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/24.jpg)
Spark Streaming: Simple Example//Initialize connections and streams
...
// Read the lines
val lines = ssc.socketTextStream("localhost", 9999)
// Split each line into words
val words = lines.flatMap(_.split(" "))
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD
wordCounts.print()
// Execute and close streams....
![Page 25: Real time stream processing presentation at General Assemb.ly](https://reader035.fdocuments.in/reader035/viewer/2022070323/55a205041a28abda648b45fb/html5/thumbnails/25.jpg)
Questions?