What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

23
| Log management as a service Simplified Log Management Simplified Log Management Apache Storm What We Learned About Scaling with Apache Storm Manoj Chaudhary CTO & VP of Engineering August 2014

description

Log management isn’t easy to do at scale. We designed Loggly Gen2 using the latest social-media-scale technologies—including ElasticSearch, Kafka from LinkedIn, and Apache Storm—as the backbone of ingestion processing for our multi-tenant, geo-distributed, and real-time log management system. Since we launched Gen2, we’ve learned a lot more about these technologies. We regularly contribute back to the open source community, so we decided that it’s time to give an update on our experience with Storm and explain why we have dropped it from our platform, at least for now. Read full blog post here: http://bit.ly/ScaleApacheStorm

Transcript of What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

Page 1: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplified Log Management Simplified Log Management

Apache Storm What We Learned About Scaling with Apache Storm

Manoj Chaudhary CTO & VP of Engineering August 2014

Page 2: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

We’re the world’s most popular cloud-based log management service

§  More than 5,000 customers §  Near real-time indexing of events

Distributed architecture, built on AWS

Initial production services in 2011 §  Loggly Generation 2 released in Sept 2013

What Loggly Does

Page 3: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

§  The unique challenges of log management §  Overview of the Loggly event pipeline §  Use of open source technologies §  Lessons we have learned §  Why we removed Storm §  Conclusions: the Storm 411

Agenda for this Presentation

Page 4: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

Everyone starts with … §  A bunch of log files (syslog, application specific) §  On a bunch of machines

Management consists of doing the simple stuff:

§  Rotate files, compress and delete §  Information is there but awkward to find

specific events §  Log retention policies evolve over time

How Log Management Starts

Page 5: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

Log Volume

Self-Inflicted Pain

“…hmmm, our logs are getting a bit bloated”

“…let’s spend time managing our log capacity”

“…how can I make this someone else’s problem!”

As Log Data Grows

Page 6: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

Use existing logging infrastructure §  Real time syslog forwarding is built in §  Application log file watching

Store logs in the cloud §  Accessible when there is a system failure §  Cost-effective data retention

Log messages in machine parsable format §  JSON encoding when logging structured

information §  Key-value pairs

Loggly Makes Log Management Much Easier

Page 7: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

Gen1 • 2011-2013 • AWS EC2 deployment • SOLR Cloud • ZeroMQ for message

queue

Gen2 • Launched September

2013 • AWS deployment • Utilized ElasticSearch,

Kafka, Storm

Incremental Improvements

and Scale

Loggly’s Evolution

Page 8: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

§  Big data §  >750 billion events logged to

date §  Sustained bursts of 100,000+

events per second §  Data space measured in

petabytes §  Need for high fault tolerance §  Near real-time indexing

requirements §  Time-series index

management

The Challenges of Log Management at Scale

Page 9: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

Open sourced by Twitter in September 2011 §  Now an Apache Software Foundation project

§ Currently Incubator Status

Framework is for stream processing §  Distributed §  Fault tolerant §  Computation §  Fail-fast components

About Apache Storm

Page 10: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

Storm Logical View

Bolt

Bolt

Spout Bolt Bolt

Spouts emit source stream Bolts perform stream processing

Example Topology

Page 11: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

Nimbus

ZooKeeper

ZooKeeper

Supervisor Worker

Supervisor Worker

Supervisor Worker

Supervisor

Supervisor

Executor Task ZooKeeper

Storm Physical View

Master Daemon §  Distributes Code §  Assigns Tasks §  Monitors Failures

Storing Operational Cluster State

Java thread spawned by Worker, runs tasks of same component.

Daemon listening for work assigned to its node.

Component (spout / bolt) instance, performs the actual data processing.

Java process executing a subset of topology

Worker Node

Worker Process

Page 12: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

Load Balancing

Kafka Stage

2

Log Ingestion and Processing Overview

Storm Event

Processing

Page 13: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

§  Storm provides Complex Event Processing §  Where we run much of our secret-sauce

§  Stage 1 contains the raw Events §  Stage 2 contains processed Events §  Snapshot the last day of Stage 2 events to S3

Event Pipeline in Summary

Page 14: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

§  Spout and bolts principle fit our network approach, where logs could move from bolt to bolt sequentially or need to be consumed by several bolts in parallel

§   Guaranteed data processing of data stream §  Allowed us to focus on writing the best possible code

for different bolts

§  Dynamic deployment makes it easy to add or remove new nodes to adjust for actual loads and requirements §  Log data has peaks and valleys

What Attracted Us to Storm

Page 15: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

Kafka Stage 1

S3 Bucket

Identify Customer

Summary Statistics

Loggly Gen2 at Launch: Where Storm Fits In

Kafka Stage 2

Page 16: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

What We Learned

Page 17: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

Guaranteed delivery feature needed for log management resilience but…

Guaranteed Delivery Causes Big Performance Hit

Bolt

Bolt

Spout Bolt Bolt

Spouts emit source stream Bolts perform stream processing

Example Topology

2.5x hit to performance!!

ack

ack

ack ack

ack

Page 18: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

Preload Kafka broker

•  Kafka partitions with 8 spouts and 20 mapper bolts

•  4K provisioned IPOS backend AWS instance

Deploy Storm

topology with Kafka

spout

•  TOPOLOGY_ACKERS set to 0 •  Kafka disks red hot

Ack’ing per tuple

turned off

•  Kafka disks not saturated •  Bolts not running on high capacity

Ack’ing per tuple enabled

Our Performance Testing

- 50,000

100,000 150,000 200,000 250,000

Without guaranteed

delivery

With guaranteed

delivery

Average events per second processed per

cluster •  50 GB of raw log data from production

cluster

Page 19: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

§  Ack a set of logs instead of individual events §  PROBLEM: not consistent with Storm’s

semantics of a “message”

Potential Workaround: Batch Logs

It is not trivial to change the Kafka spout as well as each bolt to reinterpret a single message as a bunch of logs.

Page 20: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

Load Balancing

Kafka Stage

2

Loggly Custom Module

Ultimate Solution: Build Custom Queue for Module-to-Module Communication

Page 21: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

§  High-performance, reliable communication that implements our workflow

§  Supports sustained rates of 100K+ events per second

§  Relatively easy to port

Benefits of New Approach

Page 22: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

Conclusions

Storm 0.82 has plenty of potential

But… Log management’s unique challenges drive the need for a custom framework

Page 23: What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

| Log management as a service Simplify Log Management

Log Management is Our Full-Time Job. It Shouldn’t Be Yours.

About Us: Loggly is the world’s most popular cloud-based log management solution, used by more than 5,000 happy customers to effortlessly spot problems in real-time, easily pinpoint root causes and resolve issues faster to ensure application success.

Unless You Want it to Be (Join us!) Check out our career page to see if there’s a great match for your skills! loggly.com/careers.

Try Loggly for Free! → http://bit.ly/ScaleApacheStorm

Visit us at loggly.com or follow @loggly on Twitter.