Storm: a distributed ,fault tolerant ,real time computation
-
Upload
nitin-guleria -
Category
Education
-
view
622 -
download
4
description
Transcript of Storm: a distributed ,fault tolerant ,real time computation
STORMDistributed and Fault-Tolerant
Real Time Computation
By :Nitin Guleria
Storm :Distributed Fault Tolerant Real Time Computation
Rationale• Hadoop Scales but no Real Time Data Processing.• Batch processing is stale data.• Before Storm :
MessagesQueues
Workers
Tedious
Hard to Scale
1.Tedious2.Brittle3.Hard to Scale
Storm :Distributed Fault Tolerant Real Time Computation
Why Storm• Real-Time
• Fault tolerant
• Extremely robust
• Scalable
(processed 1,000,000
Messages per second
on a 10 node cluster)
Storm :Distributed Fault Tolerant Real Time Computation
Storm Cluster
Job Tracker
Comparison to Hadoop
Task Tracker
Coordinates everything
Storm :Distributed Fault Tolerant Real Time Computation
Key Concepts• Topology• Tasks• Tuple• Stream• Spout• Bolt
Topology is a graph of
Computation.
Tasks are the processes
which execute the
Streams or bolts.
Storm :Distributed Fault Tolerant Real Time Computation
Stream
Tuple
Bolt
A simple Topology
Spout
Key Concepts• Tuple and Streams
• Tuple : Ordered list of elements
• Steams: Unbounded sequence of tuples
Storm :Distributed Fault Tolerant Real Time Computation 6/12
Key ConceptsSpouts and Bolts• Spout : the source of a stream • Deals with queues • weblogs• API calls • Event data.
• Bolts :process input streams
and create new streams.• Apply functions/transforms
filter, aggregation ,streaming
joins etc.• Can produce multiple streams
Storm :Distributed Fault Tolerant Real Time Computation
Key Concepts Stream groupings• Stream partitioning among the bolt tasks.
Storm :Distributed Fault Tolerant Real Time Computation
A simple topology
Storm :Distributed Fault Tolerant Real Time Computation
words exclaim1 exclaim2
mike!!!!!!
mikemike!!!
ShuffleShuffle
Implementation of Spout• The object implements IRichSpout Interface.
• nextTuple() method as part of the TestWordSpout()
Storm :Distributed Fault Tolerant Real Time Computation
Implementation of Bolt• Implements IRichBolt interface
• Prepare method saves the outputCollector as a variable.
• Execute method receives a tuple and appends exclamation.
• Cleanup prevents resource leakages on bolt Shutdown
• DeclareOutputFields declares that the bolt emits a tuple with field named ‘word’.
Storm :Distributed Fault Tolerant Real Time Computation
Conclusion• Storm is a promising tool.
• It has a clean and elegant design.
• Excellent documentation for a young open source tool.
• Great replacement of Hadoop for real time Computation.
Storm :Distributed Fault Tolerant Real Time Computation
Thank You
Storm :Distributed Fault Tolerant Real Time Computation
Sources• Storm: The Real-Time Layer - GlueCon 2012
Dan Lynn( [email protected])
• http://storm.incubator.apache.org/documentation/Tutorial.html
Nathan Marz
• Streams processing with Storm
Mariusz Gil
Storm :Distributed Fault Tolerant Real Time Computation
Questions• What are the major issues with processing in real time
stream and how to solve them ?Specify algorithms or techniques.
• Any Query Languages for real time stream processing?
Storm :Distributed Fault Tolerant Real Time Computation
Answers• One strategy to dealing with streams is to maintain
summaries of the streams, su cient to answer the ffiexpected queries about the data and use sampling and filtering of data to extract the subset.
• A second approach is to maintain a sliding window of the most recently arrived data.
• SQL stream.
Storm :Distributed Fault Tolerant Real Time Computation