Storm distributed processing

Storm distributed processing

BarCamp Saigon 2012Duc Quoc

Hello! I’m Duc

• Senior Software Engineer– KMS Technology

• Open source advocate– www.ducquoc.vn – [email protected] – @ducquoc_vn

http://www.ducquoc.vn/

mailto:[email protected]

Agenda

• Why Storm created

• Basic concepts

• Some use cases

• Q&A

Storm?

• Twitter’s stream processing framework

Storm

• Originally from BackType for analyzing tweets– (More than 2000 watchers on GitHub)

• “the realtime Hadoop”– continuous computation system (open source)

• distributed, reliable, fault-tolerant– suitable for big data processing

Big Data challenges

• Scalability– vertical, horizontal

• (high) Avalaibility

• Stability (fault-tolerance)

caching, replication, partitioning/sharding, load-balancing, …

Google!

• published papers on MapReduce, Google FileSystem (GFS), BigTable

http://research.google.com/archive/mapreduce.html

http://research.google.com/archive/gfs.html

http://research.google.com/archive/bigtable.html

Apache Hadoop

• MapReduce, HDFS, HBase– later on: Hive, Pig, Mahout, ZooKeeper, …

JobTracker

ZooKeeper

ZooKeeper

ZooKeeper

TaskTracker

TaskTracker

TaskTracker

TaskTracker

TaskTracker

Hadoop limits

• Batch processing with jobs -> not realtime• Stateful nodes, SPOF – JobTracker/NameNode• Cumbersome API

t

nowUnprocessed

Data

Fully processed Latest full period

Hadoop job takes this long for this data

Agenda


• Basic concepts

• Some use cases

• Q&A

Cluster

• Nimbus: daemon master node• Supervisor: daemon worker nodes• Coordination via ZooKeeper

Nimbus

ZooKeeper

ZooKeeper

ZooKeeper

Supervisor

Supervisor

Supervisor

Supervisor

SupervisorUI

Tuple

• Ordered list of elements– (“user-1234”, “email:[email protected]”)

Stream

• Unbounded sequence of tuples

Spout

• Source of stream – emitting tuples• Talks with queue, logs, API calls, event data

Bolt

• Process tuples, may emit new stream

• Apply functions, transforms, access DB & API– filter, aggregate, join, …

Topology

• A directed graph of Spout and Bolt

Task

• Thread which executes a Spout or Bolt

• Deploy a topology:$ storm jar myCode.jar com.example.MyTopology arg1 arg2

• Kill a topology:$ storm kill topologyName

Sample code

Source code of this sample: https://ducquoc.googlecode.com/svn/trunk/storm/

Create stream called “word”

Run 10 tasksCreate stream called “first-…”

Run 3 tasksSubscribes to stream “word”,using shuffle grouping

https://ducquoc.googlecode.com/svn/trunk/storm/

https://ducquoc.googlecode.com/svn/trunk/storm/

Sample code (2/3)

• RandomWordSpout

emits a random string from the array words, each 100 milliseconds

Sample code (3/3)

• InterrogativeBolt

appends a question mark to the first field of Tuple then emit

Stream grouping

• Decides which task in the bolt, the tuple is sent to

• ShuffleGrouping: randomly• FieldsGrouping: groups tuples by named fields• Global grouping, All grouping, None grouping,

Direct grouping

Local/distributed mode

More abstractions

• Distributed RPC server

• Transactional/Batch

• Trident

• https://github.com/nathanmarz/storm/wiki– http://groups.google.com/group/storm-user

https://github.com/nathanmarz/storm/wiki

https://github.com/nathanmarz/storm/wiki

http://groups.google.com/group/storm-user

http://groups.google.com/group/storm-user

Agenda


• Basic concepts

• Some use cases

• Q&A

Popular use cases

• Continuous/realtime query with low latency– analyzing, monitoring, statistics, classifying, …

• Back-end processing for streaming data– automated scoring, log processing/auditing, …

• Distributed, high-volume data processing– ETL, realtime integration/synchronization, …

Storm integration

• Data to Storm– storm-jms, storm-kafka, storm-redis-pubsub, storm-

scribe, storm-contrib-sqs, …

• Storm to databases– storm-cassandra, storm-hbase, storm-contrib-mongo,

storm-state, storm-rdbms, …

• Polyglotism (language agnostic)– Clojure, Java, python, ruby, PHP, Perl, JRuby, …

Storm dependencies

• Java 5+, Clojure

• ZeroMQ 2.1.7-, JZMQ, Python 2.6+

• Thrift, ZooKeeper, Kryo, Jetty, … – slf4j, joda, snakeyaml, guava, …

Storm UI

In production

• https://github.com/nathanmarz/storm/wiki/Powered-By

https://github.com/nathanmarz/storm/wiki/Powered-By

Agenda


• Basic concepts

• Some use cases

• Q&A

Q&A

Thank you!

Bonus

• I wanna know how many queries I get– Per second, minute, day, week

• Results should be available– within <2 seconds 99.8+% of the time– within 50 seconds almost always

• History should last >2 years• Should work for 0.01 q/s up to 50,000 q/s• Failure tolerant, yadda, yadda

t

now

Hadoop works great back here

Storm workshere

Real-time and Long-time together

Blended view

Blended view

Blended View

Storm distributed processing

Technology

Transcript of Storm distributed processing