Flume @ Austin HUG 2/17/11

Jonathan Hsieh, Henry Robinson, Patrick Hunt

Cloudera, Inc

Hadoop World 2010, 10/12/2010

Reliable Distributed

Streaming Log Collection

4 months after Hadoop

World 2010

Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer, Bruce Mitchener

Cloudera, Inc

Austin Hadoop Users Group 2/17/2011

Who Am I?

• Cloudera:

– Software Engineer on the Platform Team

– Flume Project Lead / Designer / Architect

• U of Washington:

– “On Leave” from PhD program

– Research in Systems and Programming Languages

• Previously:

– Computer Security, Embedded Systems.

4Austin Hadoop User Group, 2/17/2011

The basic scenario

• You have a bunch of servers generating log files.

• You figured out that your logs are valuable and you want to keep them and analyze them.

• Because of the volume of data, you’ve started using a Apache Hadoop or Cloudera’s Distribution of Apache Hadoop.

• … and you’ve got some ad-hoc, hacked together scripts that copy data from servers to HDFS.

It’s log, log .. Everyone wants a log!

Austin Hadoop User Group, 2/17/2011

Ad-hockery gets complicated

• Reliability– Will you data still get there … if your scripts fail? … if your hardware failed? … if HDFS goes

down? … if EC2 has flaked out?

• Scale– As you add servers will your scripts keep up to 100GB’s per day? Will you have tons of small

files? Are you going to have tons of connections? Are you willing to suffer more latency to mitigate?

• Manageability– How do you know if the script failed on machine 172? What about logs from that other

system? How do you monitor and configure all the servers? Can you deal with elasticity?

• Extensibility– Can you service custom logs? Send data to different places like Hbase, Hive or Incremental

search indexes? Can you do near-realtime?

• Blackbox– What happens when the guy who write it leaves?

Cloudera Flume

Flume is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing.

Project Principles:

• Scalability

• Reliability

• Extensibility

• Manageability

• Openness

: The Standard Use Case

Agent tier Collector tier

AgentAgentAgentAgent

Collector

serverserverserverserver

Collector

Masterserverserverserverserver

serverserverserverserverAgent tier Collector tier

Collector

Masterserverserverserverserver

Flume’s Key Abstractions

• Data path and control path

• Nodes are in the data path – Nodes have a source and a sink

– They can take different roles• A typical topology has agent nodes and collector nodes.

• Optionally it has processor nodes.

• Masters are in the control path.– Centralized point of configuration.

– Specify sources and sinks

– Can control flows of data between nodes

– Use one master or use many with a ZK-backed quorum

Master

nodesinksource

Collector

Flume’s Key Abstractions

• Data path and control path

• Nodes are in the data path – Nodes have a source and a sink

– They can take different roles• A typical topology has agent nodes and collector nodes.

• Optionally it has processor nodes.

• Masters are in the control path.– Centralized point of configuration.

– Specify sources and sinks

– Can control flows of data between nodes

– Use one master or use many with a ZK-backed quorum

Master

nodesinksource

Can I has the codez?

node001: tail(“/var/log/app/log”) | autoE2ESink;

collector1: autoCollectorSource |

collectorSink(“hdfs://logs/app/”,”applogs”)

Outline

• What is Flume?

• Scalability– Horizontal scalability of all nodes and masters

• Reliability– Fault-tolerance and High availability

• Extensibility– Unix principle, all kinds of data, all kinds of sources, all kinds of sinks

• Manageability– Centralized management supporting dynamic reconfiguration

• Openness– Apache v2.0 License and an active and growing community

SCALABILITY

Collector

Data path is horizontally scalable

• Add collectors to increase availability and to handle more data– Assumes a single agent will not dominate a collector– Fewer connections to HDFS.– Larger more efficient writes to HDFS.

• Agents have mechanisms for machine resource tradeoffs• Write log locally to avoid collector disk IO bottleneck and catastrophic failures• Compression and batching (trade cpu for network)• Push computation into the event collection pipeline (balance IO, Mem, and CPU

resource bottlenecks)

HDFSAgentAgentAgentAgent

Collectorserverserverserverserver

RELIABILITY

CollectorAgent

Collector

Tunable failure recovery modes

• Best effort– Fire and forget

• Store on failure + retry– Local acks, local errors

detectable – Failover when faults detected.

• End to end reliability– End to end acks– Data survives compound failures,

and may be retried multiple times

Agent Collector

Load balancing

• Agents are logically partitioned and send to different collectors

• Use randomization to pre-specify failovers when many collectors exist • Spread load if a collector goes down.

• Spread load if new collectors added to the system.

Agent CollectorAgent

Collector

AgentAgent

Load balancing and collector failover

• Agents are logically partitioned and send to different collectors

• Use randomization to pre-specify failovers when many collectors exist • Spread load if a collector goes down.

• Spread load if new collectors added to the system.

Collector

AgentAgent

Control plane is horizontally scalable

• A master controls dynamic configurations of nodes

– Uses consensus protocol to keep state consistent

– Scales well for configuration reads

– Allows for adaptive repartitioning in the future

• Nodes can talk to any master.

• Masters can talk to an existing ZK ensemble

Master

MANAGEABILITY

Wheeeeee!26Austin Hadoop User Group, 2/17/2011

Centralized Dataflow Management Interfaces

• One place to specify node sources, sinks and data flows.

• Basic Web interface

• Flume Shell– Command line interface

– Scriptable

• Cloudera Enterprise– Flume Monitor App

– Graphical web interface

Configuring Flume

Node: tail(“file”) | filter [ console, roll(1000) {

dfs(“hdfs://namenode/user/flume”) } ] ;

• A concise and precise configuration language for specifying dataflows in a node.

• Dynamic updates of configurations– Allows for live failover changes

– Allows for handling newly provisioned machines

– Allows for changing analytics

consoletail filter

fanout

Output bucketing

• Automatic output file management

– Write hdfs files in over time based tags

/logs/web/2010/0715/1200/data-xxx.txt

/logs/web/2010/0715/1200/data-xxy.txt

/logs/web/2010/0715/1300/data-xxy.txt

node : collectorSource | collectorSink

(“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”)

Collector

EXTENSIBILITY

Flume is easy to extend

• Simple source and sink APIs

– An event streaming design

– Many simple operations composes for complex behavior

• Plug-in architecture so you can add your own sources, sinks and decorators and sinks

sinksource deco

fanout

decosource

Variety of Connectors

• Sources produce data– Console, Exec, Syslog, Scribe, IRC, Twitter, – In the works: JMS, AMQP, pubsubhubbub/RSS/Atom

• Sinks consume data– Console, Local files, HDFS, S3– Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra

(Riptano/DataStax), Voldemort, Elastic Search– In the works: JMS, AMQP

• Decorators modify data sent to sinks– Wire batching, compression, sampling, projection,

extraction, throughput throttling– Custom near real-time processing (Meebo)– JRuby event modifiers (InfoChimps)– Cryptographic extensions(Rearden)

source

: Multi Datacenter

I serv

Collector tier

r serv

Collector

apiapiapiapi

apiapiapiproc

: Multi Datacenter

I serv

Collector tier

r serv

Collector

apiapiapiapi

apiapiapiproc

: Near Realtime Aggregator

DB Hive job

CollectorTracker AgentAgentAgentAgentAd svrAd svrAd svrAd svr

reports

verify

quickreports

An enterprise story

I serv

Collector tier

AgentAgentAgentWin

AgentAgentAgentLinux

Collector

apiapiapiapi

Kerberos HDFS

D D DDDD

Active Directory

/ LDAP

An emerging community story

HDFSHive queryAgentAgentAgentAgentsvr

Collector Fanout HBase

Incremental Search Idx

Key lookup

Range query

Search query

Faceted query

Pig query

OPENNESS AND COMMUNITY

Flume is Open Source

• Apache v2.0 Open Source License – Independent from Apache Software Foundation

• GitHub source code repository– http://github.com/cloudera/flume

– Regular tarball update versions every 2-3 months.

– Regular CDH packaging updates every 3-4 months.

• Review Board for code review

• New external committers wanted!– Cloudera folks: Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric

Sammer

– Independent folks: Bruce Mitchener

Austin Hadoop User Group, 2/17/2011 39

Growing user and developer community

• History:

– Initial Open Source Release, June 2010

• Growth:

– Pre-Hadoop Summit (Late June 2010):

• 4 followers, 4 forks (original authors)

– Pre-Hadoop World (October 2010):

• 174 followers, 34 forks

– Pre-CDH3B4 Release (February 2011):

• 288 followers, 51 forks

Austin Hadoop User Group, 2/17/2011 40

Support

• Community-based mailing lists for support

– “an answer in a few days”

– User: https://groups.google.com/a/cloudera.org/group/flume-user

– Dev: https://groups.google.com/a/cloudera.org/group/flume-dev

• Community-based IRC chat room

– “quick questions, quick answers”

– #flume in irc.freenode.net

• Commercial support with Cloudera Enterprise subscription

– Chat with sales@cloudera.com

CONCLUSIONS

Summary

• Flume is a distributed, reliable, scalable, extensible system for collecting and delivering high-volume continuous event data such as logs.

– It is centrally managed, which allows for automated and adaptive configurations.

– This design allows for near-real time processing.

– Apache v2.0 License with active and growing community

• Part of Cloudera’s Distribution for Hadoop, about to be refreshed for CDH3b4.

Questions? (and shameless plugs)

• Contact info:– jon@cloudera.com

– Twitter @jmhsieh

• Cloudera Training in Dallas– Hadoop Training for Developers - March 14-16

– Hadoop Training for Administrators - March 17-18

– Sign up at http://cloudera.eventbrite.com

– 10% discount code for classes "hug“

• Cloudera is Hiring!

Flume @ Austin HUG 2/17/11

Technology

Transcript of Flume @ Austin HUG 2/17/11

Flume User Guide - Welcome to Apache Flume â€” Apache Flume

Centralized logging with Flume

Manual Flow Converter 713 - MJKpt.mjk.com/.../GB_3.1_Flow_Converter_713_Manual_0704.pdf · Parshall flume The most common type of flume is the Parshall flume. The Parshall flume is

CIVE2400-Flume Laboratory Handout

HUG 2003 halFILE Users Conference Welcome to Austin.

Flume 1.5.2 User Guide - Welcome to Apache Flume ......Flume 1.5.2 User Guide Introduction Overview Apache Flume is a distributed, reliable, and available system for efficiently collecting,

Flume - On Top

Flume lspe-110325145754-phpapp01

Flume and HBase

The Venturi Flume

Austin HUG January 2015 meeting: Analytics

Flume intro-100717

1 Presentation Company and Products. 2 Hug Engineering AG and the Hug Group Founded in 1983, Hug Engineering AG is now part of the Hug Group. Hug Engineering.

HUG Songbook Essentials - Halifax Ukulele Gang (HUG)

Flume intro-100715

Apache Flume

FLUME NOTES: TABLE OF FLUME DIMENSIONS TABLE OF …"estimated project quantities". 1070 series will provide structural plans and the flume completed situation plan and the appropriate

Flume 1.8.0 Developer Guide — Apache Flume · Flume 1.8.0 Developer Guide Introduction Overview Apache Flume is a distributed, reliable, ... SecureRpcClientFactory within a user’s

Apache Flume NG

THE PARSHALL MEASURING FLUME