Storm: a distributed ,fault tolerant ,real time computation

16
STORM Distributed and Fault- Tolerant Real Time Computation By :Nitin Guleria [email protected] a Storm :Distributed Fault Tolerant Real Time Computation

description

Storm:a distributed real time,fault tolerant computation

Transcript of Storm: a distributed ,fault tolerant ,real time computation

Page 1: Storm: a distributed ,fault tolerant ,real time computation

STORMDistributed and Fault-Tolerant

Real Time Computation

By :Nitin Guleria

[email protected]

Storm :Distributed Fault Tolerant Real Time Computation

Page 2: Storm: a distributed ,fault tolerant ,real time computation

Rationale• Hadoop Scales but no Real Time Data Processing.• Batch processing is stale data.• Before Storm :

MessagesQueues

Workers

Tedious

Hard to Scale

1.Tedious2.Brittle3.Hard to Scale

Storm :Distributed Fault Tolerant Real Time Computation

Page 3: Storm: a distributed ,fault tolerant ,real time computation

Why Storm• Real-Time

• Fault tolerant

• Extremely robust

• Scalable

(processed 1,000,000

Messages per second

on a 10 node cluster)

Storm :Distributed Fault Tolerant Real Time Computation

Page 4: Storm: a distributed ,fault tolerant ,real time computation

Storm Cluster

Job Tracker

Comparison to Hadoop

Task Tracker

Coordinates everything

Storm :Distributed Fault Tolerant Real Time Computation

Page 5: Storm: a distributed ,fault tolerant ,real time computation

Key Concepts• Topology• Tasks• Tuple• Stream• Spout• Bolt

Topology is a graph of

Computation.

Tasks are the processes

which execute the

Streams or bolts.

Storm :Distributed Fault Tolerant Real Time Computation

Stream

Tuple

Bolt

A simple Topology

Spout

Page 6: Storm: a distributed ,fault tolerant ,real time computation

Key Concepts• Tuple and Streams

• Tuple : Ordered list of elements

• Steams: Unbounded sequence of tuples

Storm :Distributed Fault Tolerant Real Time Computation 6/12

Page 7: Storm: a distributed ,fault tolerant ,real time computation

Key ConceptsSpouts and Bolts• Spout : the source of a stream • Deals with queues • weblogs• API calls • Event data.

• Bolts :process input streams

and create new streams.• Apply functions/transforms

filter, aggregation ,streaming

joins etc.• Can produce multiple streams

Storm :Distributed Fault Tolerant Real Time Computation

Page 8: Storm: a distributed ,fault tolerant ,real time computation

Key Concepts Stream groupings• Stream partitioning among the bolt tasks.

Storm :Distributed Fault Tolerant Real Time Computation

Page 9: Storm: a distributed ,fault tolerant ,real time computation

A simple topology

Storm :Distributed Fault Tolerant Real Time Computation

words exclaim1 exclaim2

mike!!!!!!

mikemike!!!

ShuffleShuffle

Page 10: Storm: a distributed ,fault tolerant ,real time computation

Implementation of Spout• The object implements IRichSpout Interface.

• nextTuple() method as part of the TestWordSpout()

Storm :Distributed Fault Tolerant Real Time Computation

Page 11: Storm: a distributed ,fault tolerant ,real time computation

Implementation of Bolt• Implements IRichBolt interface

• Prepare method saves the outputCollector as a variable.

• Execute method receives a tuple and appends exclamation.

• Cleanup prevents resource leakages on bolt Shutdown

• DeclareOutputFields declares that the bolt emits a tuple with field named ‘word’.

Storm :Distributed Fault Tolerant Real Time Computation

Page 12: Storm: a distributed ,fault tolerant ,real time computation

Conclusion• Storm is a promising tool.

• It has a clean and elegant design.

• Excellent documentation for a young open source tool.

• Great replacement of Hadoop for real time Computation.

Storm :Distributed Fault Tolerant Real Time Computation

Page 13: Storm: a distributed ,fault tolerant ,real time computation

Thank You

Storm :Distributed Fault Tolerant Real Time Computation

Page 14: Storm: a distributed ,fault tolerant ,real time computation

Sources• Storm: The Real-Time Layer - GlueCon 2012

Dan Lynn( [email protected])

• http://storm.incubator.apache.org/documentation/Tutorial.html

Nathan Marz

• Streams processing with Storm

Mariusz Gil

Storm :Distributed Fault Tolerant Real Time Computation

Page 15: Storm: a distributed ,fault tolerant ,real time computation

Questions• What are the major issues with processing in real time

stream and how to solve them ?Specify algorithms or techniques.

• Any Query Languages for real time stream processing?

Storm :Distributed Fault Tolerant Real Time Computation

Page 16: Storm: a distributed ,fault tolerant ,real time computation

Answers• One strategy to dealing with streams is to maintain

summaries of the streams, su cient to answer the ffiexpected queries about the data and use sampling and filtering of data to extract the subset.

• A second approach is to maintain a sliding window of the most recently arrived data.

• SQL stream.

Storm :Distributed Fault Tolerant Real Time Computation