Storm distributed processing

34
Storm distributed processing BarCamp Saigon 2012 Duc Quoc

Transcript of Storm distributed processing

Page 1: Storm distributed processing

Storm distributed processing

BarCamp Saigon 2012Duc Quoc

Page 2: Storm distributed processing

Hello! I’m Duc

• Senior Software Engineer– KMS Technology

• Open source advocate– www.ducquoc.vn – [email protected] – @ducquoc_vn

Page 3: Storm distributed processing

Agenda

• Why Storm created

• Basic concepts

• Some use cases

• Q&A

Page 4: Storm distributed processing

Agenda

• Why Storm created

• Basic concepts

• Some use cases

• Q&A

Page 5: Storm distributed processing

Storm?

• Twitter’s stream processing framework

Page 6: Storm distributed processing

Storm

• Originally from BackType for analyzing tweets– (More than 2000 watchers on GitHub)

• “the realtime Hadoop”– continuous computation system (open source)

• distributed, reliable, fault-tolerant– suitable for big data processing

Page 7: Storm distributed processing

Big Data challenges

• Scalability– vertical, horizontal

• (high) Avalaibility

• Stability (fault-tolerance)

caching, replication, partitioning/sharding, load-balancing, …

Page 9: Storm distributed processing

Apache Hadoop

• MapReduce, HDFS, HBase– later on: Hive, Pig, Mahout, ZooKeeper, …

JobTracker

ZooKeeper

ZooKeeper

ZooKeeper

TaskTracker

TaskTracker

TaskTracker

TaskTracker

TaskTracker

Page 10: Storm distributed processing

Hadoop limits

• Batch processing with jobs -> not realtime• Stateful nodes, SPOF – JobTracker/NameNode• Cumbersome API

t

nowUnprocessed

Data

Fully processed Latest full period

Hadoop job takes this long for this data

Page 11: Storm distributed processing

Agenda

• Why Storm created

• Basic concepts

• Some use cases

• Q&A

Page 12: Storm distributed processing

Cluster

• Nimbus: daemon master node• Supervisor: daemon worker nodes• Coordination via ZooKeeper

Nimbus

ZooKeeper

ZooKeeper

ZooKeeper

Supervisor

Supervisor

Supervisor

Supervisor

SupervisorUI

Page 13: Storm distributed processing

Tuple

• Ordered list of elements– (“user-1234”, “email:[email protected]”)

Page 14: Storm distributed processing

Stream

• Unbounded sequence of tuples

Page 15: Storm distributed processing

Spout

• Source of stream – emitting tuples• Talks with queue, logs, API calls, event data

Page 16: Storm distributed processing

Bolt

• Process tuples, may emit new stream

• Apply functions, transforms, access DB & API– filter, aggregate, join, …

Page 17: Storm distributed processing

Topology

• A directed graph of Spout and Bolt

Page 18: Storm distributed processing

Task

• Thread which executes a Spout or Bolt

• Deploy a topology:$ storm jar myCode.jar com.example.MyTopology arg1 arg2

• Kill a topology:$ storm kill topologyName

Page 19: Storm distributed processing

Sample code

Source code of this sample: https://ducquoc.googlecode.com/svn/trunk/storm/

Create stream called “word”

Run 10 tasksCreate stream called “first-…”

Run 3 tasksSubscribes to stream “word”,using shuffle grouping

Page 20: Storm distributed processing

Sample code (2/3)

• RandomWordSpout

emits a random string from the array words, each 100 milliseconds

Page 21: Storm distributed processing

Sample code (3/3)

• InterrogativeBolt

appends a question mark to the first field of Tuple then emit

Page 22: Storm distributed processing

Stream grouping

• Decides which task in the bolt, the tuple is sent to

• ShuffleGrouping: randomly• FieldsGrouping: groups tuples by named fields• Global grouping, All grouping, None grouping,

Direct grouping

Page 23: Storm distributed processing

Local/distributed mode

Page 24: Storm distributed processing

More abstractions

• Distributed RPC server

• Transactional/Batch

• Trident

• https://github.com/nathanmarz/storm/wiki– http://groups.google.com/group/storm-user

Page 25: Storm distributed processing

Agenda

• Why Storm created

• Basic concepts

• Some use cases

• Q&A

Page 26: Storm distributed processing

Popular use cases

• Continuous/realtime query with low latency– analyzing, monitoring, statistics, classifying, …

• Back-end processing for streaming data– automated scoring, log processing/auditing, …

• Distributed, high-volume data processing– ETL, realtime integration/synchronization, …

Page 27: Storm distributed processing

Storm integration

• Data to Storm– storm-jms, storm-kafka, storm-redis-pubsub, storm-

scribe, storm-contrib-sqs, …

• Storm to databases– storm-cassandra, storm-hbase, storm-contrib-mongo,

storm-state, storm-rdbms, …

• Polyglotism (language agnostic)– Clojure, Java, python, ruby, PHP, Perl, JRuby, …

Page 28: Storm distributed processing

Storm dependencies

• Java 5+, Clojure

• ZeroMQ 2.1.7-, JZMQ, Python 2.6+

• Thrift, ZooKeeper, Kryo, Jetty, … – slf4j, joda, snakeyaml, guava, …

Page 29: Storm distributed processing

Storm UI

Page 30: Storm distributed processing

In production

• https://github.com/nathanmarz/storm/wiki/Powered-By

Page 31: Storm distributed processing

Agenda

• Why Storm created

• Basic concepts

• Some use cases

• Q&A

Page 32: Storm distributed processing

Q&A

Thank you!

Page 33: Storm distributed processing

Bonus

• I wanna know how many queries I get– Per second, minute, day, week

• Results should be available– within <2 seconds 99.8+% of the time– within 50 seconds almost always

• History should last >2 years• Should work for 0.01 q/s up to 50,000 q/s• Failure tolerant, yadda, yadda

Page 34: Storm distributed processing

t

now

Hadoop works great back here

Storm workshere

Real-time and Long-time together

Blended view

Blended view

Blended View