Real-Time Big Data at In-Memory Speed, Using Storm

34
Real Time Big Data With Storm, Cassandra, and In- Memory Computing Nati Shalom @natishalom DeWayne Filppi @dfilppi

description

Storm, a popular framework from Twitter, is used for real-time event processing. The challenge presented is how to manage the state of your real-time data processing at all times. In addition, you need Storm to integrate with your batch processing system (such as Hadoop) in a consistent manner. This session will demonstrate how to integrate Storm with an in-memory database/grid, and explore various strategies for integrating the data grid with Hadoop and Cassandra, seamlessly. By achieving smooth integration with consistent management, you will be able to easily manage all the tiers of you Big Data stack in a consistent and effective way. - See more at: http://nosql2013.dataversity.net/sessionPop.cfm?confid=74&proposalid=5526#sthash.FWIdqRHh.dpuf

Transcript of Real-Time Big Data at In-Memory Speed, Using Storm

Page 1: Real-Time Big Data at In-Memory Speed, Using Storm

Real Time Big Data With Storm, Cassandra, and In-Memory Computing

Nati Shalom @natishalomDeWayne Filppi @dfilppi

Page 2: Real-Time Big Data at In-Memory Speed, Using Storm

Introduction to Real Time AnalyticsHomeland Security

Real Time Search

Social

eCommerce

User Tracking & Engagement

Financial Services

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved2

Page 3: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3

The Two Vs of Big Data

Velocity Volume

Page 4: Real-Time Big Data at In-Memory Speed, Using Storm

The Flavors of Big Data Analytics

Counting Correlating Research

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4

Page 5: Real-Time Big Data at In-Memory Speed, Using Storm

It’s All about Timing

• Event driven / stream processing • High resolution – every tweet gets counted

• Ad-hoc querying • Medium resolution (aggregations)

• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5

This is what we’re here to discuss

Page 6: Real-Time Big Data at In-Memory Speed, Using Storm

Facebook & Twitter Real Time Analytics

Page 7: Real-Time Big Data at In-Memory Speed, Using Storm

FACEBOOK REAL-TIMEANALYTICS SYSTEM

(LOGGING CENTRIC APPROACH)

7

Page 8: Real-Time Big Data at In-Memory Speed, Using Storm

8

The actual analytics.. Like button analytics

Comments box analytics

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 9: Real-Time Big Data at In-Memory Speed, Using Storm

PTail

Scribe

Puma

HbaseFACEBOOK

Log

FACEBOOK

Log

FACEBOOK

Log

HDFS

Real Time Long Term

Batch1.5 Sec

Facebook architecture..10,000 write/sec per server

Page 10: Real-Time Big Data at In-Memory Speed, Using Storm

TWITTER REAL-TIMEANALYTICS SYSTEM

(EVENT DRIVEN APPROACH)

10

Page 11: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11

URL Mentions – Here’s One Use Case

Page 12: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved12

Twitter Real Time Analytics based on Storm

Page 13: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13

Comparing the two approaches..

Facebook Rely on Hadoop for Real

Time and Batch RT = 10’s Sec Suits for Simple processing Low parallelization

Twitter Use Hadoop for Batch and

Storm for real time RT = Msec, Sec Suits for Complex

processing Extremely parallel

This is what we’re here to discuss

Page 14: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14

Introduction to Storm

Page 15: Real-Time Big Data at In-Memory Speed, Using Storm

Popular open source, real time, in-memory, streaming computation platform.

Includes distributed runtime and intuitive API for defining distributed processing flows.

Scalable and fault tolerant. Developed at BackType, and open sourced by Twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15

Storm Background

Page 16: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved16

Storm Cluster

Page 17: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved17

Streams Unbounded sequence of tuples

Spouts Source of streams (Queues)

Bolts Functions, Filters, Joins, Aggregations

Topologies

Storm ConceptsSpouts

Bolt

Topologies

Page 18: Real-Time Big Data at In-Memory Speed, Using Storm

Challenge – Word Count

Word:Count

Tweets

Count?® Copyright 2011 Gigaspaces Ltd. All Rights Reserved18

• Hottest topics• URL mentions• etc.

Page 19: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved19

Streaming word count with Storm

Page 20: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved20

Computing Reach with Event Streams

Page 21: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21

But where is my

Big Data?

Page 22: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22

Bolt

Bolt

Spout

The Big Picture …

Twitter feed

Twitter Feed

Twiter Feed

Web Activity

Web Activity

Web Activity

Analytics Data

Research Data

Counters

Reference Data

StormData feeds (Kafka, Twitter,..) Cassandra, MongoDB, Hbase,..

End to End Latency

Page 23: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23

Storm performance and reliability Assumes success is normal Uses batching and pipelining for performance

Storm plug-ins has significant effect on performance and reliability Spout must be able to replay tuples on demand in case of error.

Storm uses topology semantics for ensuring consistency through event ordering Can be tedious for handling counters Doesn’t ensure the state of the counters

Your as as strong as your weakest link

Page 24: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved24

Typical user experience…

Now, Kafka is *fast*. When running the Kafka Spout by itself, I easily reproduced Kafka's claim that you can consume "hundreds of thousands of messages per second".

When I first fired up the topology, things went

well for the first minute, but then quickly crashed as the Kafka spout emitted too fast for the Cassandra Bolt to keep up. Even though Cassandra is fast as well, it is still

orders of magnitude slower than Kafka

Source: A Big Data Trifecta: Storm, Kafka and Cassandra. Brian Oniells Blog

Page 25: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25

What if we could put everything In Memory?

An Alternative Approach

Page 26: Real-Time Big Data at In-Memory Speed, Using Storm

Did you know?

Facebook keeps 80% of its data in Memory (Stanford research)

RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec

Page 27: Real-Time Big Data at In-Memory Speed, Using Storm

RAM is the new disk Data partitioned across a cluster

Large “virtual” memory space Transactional Highly available Code with Data

In Memory Data Grid Review

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27

Page 28: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28

Integrating with Storm

Bolt

Bolt

Spout

Web Activity

Web Activity

Web Activity

Analytics Data

Research Data

Counters

Reference Data

In Memory Data Grid(via Storm Trident State plug-in)

In Memory Data Stream (Via Storm Spout Plugin)

Page 29: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved29

In Memory Streaming Word Count with Storm

Storm has a simple builder interface to creating stream processing topologies

Storm delegates persistence to external providers

Page 30: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30

Integrating with Hadoop, NoSQL DB..

Bolt

Bolt

Spout

Web Activity

Web Activity

Web Activity

Analytics Data

Research Data

Counters

Reference Data

In Memory Data Grid In Memory Data Stream Storm Plugin

Hadoop, NoSQL, RDBMS,…

Write Behind LRU based Policy

Page 31: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31

Live Demo – Word Count At In Memory Speed

Page 32: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved32

Recent Benchmarks..

Gresham Computing plc, achieved over 50,000 equity trade transactions per second of load and match into a database.

Page 33: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved33

Page 34: Real-Time Big Data at In-Memory Speed, Using Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34

References Try the Cloudify recipe

Download Cloudify : http://www.cloudifysource.org/ Download the Recipe (apps/xapstream, services/xapstream):

– https://github.com/CloudifySource/cloudify-recipes XAP – Cassandra Interface Details;

http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency Check out the source for the XAP Spout and a sample state

implementation backed by XAP, and a Storm friendly streaming implemention on github: https://github.com/Gigaspaces/storm-integration

For more background on the effort, check out my recent blog posts at http://blog.gigaspaces.com/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/ Part 3 coming soon.