What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplified Log Management Simplified Log Management

Apache Storm What We Learned About Scaling with Apache Storm

Manoj Chaudhary CTO & VP of Engineering August 2014

| Log management as a service Simplify Log Management

We’re the world’s most popular cloud-based log management service

§  More than 5,000 customers §  Near real-time indexing of events

Distributed architecture, built on AWS

Initial production services in 2011 §  Loggly Generation 2 released in Sept 2013

What Loggly Does


§  The unique challenges of log management §  Overview of the Loggly event pipeline §  Use of open source technologies §  Lessons we have learned §  Why we removed Storm §  Conclusions: the Storm 411

Agenda for this Presentation


Everyone starts with … §  A bunch of log files (syslog, application specific) §  On a bunch of machines

Management consists of doing the simple stuff:

§  Rotate files, compress and delete §  Information is there but awkward to find

specific events §  Log retention policies evolve over time

How Log Management Starts


Log Volume

Self-Inflicted Pain

“…hmmm, our logs are getting a bit bloated”

“…let’s spend time managing our log capacity”

“…how can I make this someone else’s problem!”

As Log Data Grows


Use existing logging infrastructure §  Real time syslog forwarding is built in §  Application log file watching

Store logs in the cloud §  Accessible when there is a system failure §  Cost-effective data retention

Log messages in machine parsable format §  JSON encoding when logging structured

information §  Key-value pairs

Loggly Makes Log Management Much Easier


Gen1 • 2011-2013 • AWS EC2 deployment • SOLR Cloud • ZeroMQ for message

queue

Gen2 • Launched September

2013 • AWS deployment • Utilized ElasticSearch,

Kafka, Storm

Incremental Improvements

and Scale

Loggly’s Evolution


§  Big data §  >750 billion events logged to

date §  Sustained bursts of 100,000+

events per second §  Data space measured in

petabytes §  Need for high fault tolerance §  Near real-time indexing

requirements §  Time-series index

management

The Challenges of Log Management at Scale


Open sourced by Twitter in September 2011 §  Now an Apache Software Foundation project

§ Currently Incubator Status

Framework is for stream processing §  Distributed §  Fault tolerant §  Computation §  Fail-fast components

About Apache Storm


Storm Logical View

Bolt

Bolt

Spout Bolt Bolt

Spouts emit source stream Bolts perform stream processing

Example Topology


Nimbus

ZooKeeper

ZooKeeper

Supervisor Worker

Supervisor Worker

Supervisor Worker

Supervisor

Supervisor

Executor Task ZooKeeper

Storm Physical View

Master Daemon §  Distributes Code §  Assigns Tasks §  Monitors Failures

Storing Operational Cluster State

Java thread spawned by Worker, runs tasks of same component.

Daemon listening for work assigned to its node.

Component (spout / bolt) instance, performs the actual data processing.

Java process executing a subset of topology

Worker Node

Worker Process


Load Balancing

Kafka Stage

2

Log Ingestion and Processing Overview

Storm Event

Processing


§  Storm provides Complex Event Processing §  Where we run much of our secret-sauce

§  Stage 1 contains the raw Events §  Stage 2 contains processed Events §  Snapshot the last day of Stage 2 events to S3

Event Pipeline in Summary


§  Spout and bolts principle fit our network approach, where logs could move from bolt to bolt sequentially or need to be consumed by several bolts in parallel

§  Guaranteed data processing of data stream §  Allowed us to focus on writing the best possible code

for different bolts

§  Dynamic deployment makes it easy to add or remove new nodes to adjust for actual loads and requirements §  Log data has peaks and valleys

What Attracted Us to Storm


Kafka Stage 1

S3 Bucket

Identify Customer

Summary Statistics

Loggly Gen2 at Launch: Where Storm Fits In

Kafka Stage 2


What We Learned


Guaranteed delivery feature needed for log management resilience but…

Guaranteed Delivery Causes Big Performance Hit

Bolt

Bolt

Spout Bolt Bolt

Spouts emit source stream Bolts perform stream processing

Example Topology

2.5x hit to performance!!

ack

ack

ack ack

ack


Preload Kafka broker

•  Kafka partitions with 8 spouts and 20 mapper bolts

•  4K provisioned IPOS backend AWS instance

Deploy Storm

topology with Kafka

spout

•  TOPOLOGY_ACKERS set to 0 •  Kafka disks red hot

Ack’ing per tuple

turned off

•  Kafka disks not saturated •  Bolts not running on high capacity

Ack’ing per tuple enabled

Our Performance Testing

- 50,000

100,000 150,000 200,000 250,000

Without guaranteed

delivery

With guaranteed

delivery

Average events per second processed per

cluster •  50 GB of raw log data from production

cluster


§  Ack a set of logs instead of individual events §  PROBLEM: not consistent with Storm’s

semantics of a “message”

Potential Workaround: Batch Logs

It is not trivial to change the Kafka spout as well as each bolt to reinterpret a single message as a bunch of logs.


Load Balancing

Kafka Stage

2

Loggly Custom Module

Ultimate Solution: Build Custom Queue for Module-to-Module Communication


§  High-performance, reliable communication that implements our workflow

§  Supports sustained rates of 100K+ events per second

§  Relatively easy to port

Benefits of New Approach


Conclusions

Storm 0.82 has plenty of potential

But… Log management’s unique challenges drive the need for a custom framework


Log Management is Our Full-Time Job. It Shouldn’t Be Yours.

About Us: Loggly is the world’s most popular cloud-based log management solution, used by more than 5,000 happy customers to effortlessly spot problems in real-time, easily pinpoint root causes and resolve issues faster to ensure application success.

Unless You Want it to Be (Join us!) Check out our career page to see if there’s a great match for your skills! loggly.com/careers.

Try Loggly for Free! → http://bit.ly/ScaleApacheStorm

Visit us at loggly.com or follow @loggly on Twitter.

https://www.loggly.com/about-loggly/careers/?utm_source=slideshare&utm_medium=link&utm_campaign=SlideshareLastPage

https://www.loggly.com/?utm_source=slideshare&utm_medium=link&utm_campaign=SlideshareLastPage

http://bit.ly/ScaleApacheStorm

https://twitter.com/loggly

What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

Technology

Transcript of What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope