Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
-
Upload
steve-hoffman -
Category
Technology
-
view
658 -
download
0
description
Transcript of Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
![Page 1: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/1.jpg)
Apache FlumeGetting Logs/Data to Hadoop
Steve Hoffman Chicago Hadoop User Group (CHUG)
2014-04-09T10:30:00Z
![Page 3: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/3.jpg)
About Me• Steve Hoffman
• twitter: @bacoboy else: http://bit.ly/bacoboy
• Tech Guy @Orbitz
![Page 4: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/4.jpg)
About Me• Steve Hoffman
• twitter: @bacoboy else: http://bit.ly/bacoboy
• Tech Guy @Orbitz
• Wrote a book on Flume
![Page 5: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/5.jpg)
Why do I need Flume?• Created to deal with streaming data/logs to HDFS
• Can’t mount HDFS (usually)
• Can’t “copy” to files to HDFS if the files aren’t closed (aka log files)
• Need to buffer “some”, then write and close a file — repeat
• May involve multiple hops due to topology (# of machines, datacenter separation, etc).
• A lot can go wrong here…
![Page 6: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/6.jpg)
Agent• Java daemon
• Has a name (usually ‘agent’)
• Receive data from sources and write events to 1 or more channels
• Move events from 1 channel to sink. Remove from channel if successfully written.
![Page 7: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/7.jpg)
Events• Headers = Key/Value Pairs — Map<String, String>
• Body = byte array — byte[]
• For example:
10.10.1.1 - - [29/Jan/2014:03:36:04 -0600] "HEAD /ping.html HTTP/1.1" 200 0 "-" "-" “-"!
{“timestamp”:”1391986793111”, “host”:”server1.example.com”} 31302e31302e312e31202d202d205b32392f4a616e2f323031343a30333a33363a3034202d303630305d202248454144202f70696e672e68746d6c20485454502f312e312220323030203020222d2220222d2220222d22
![Page 8: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/8.jpg)
Channels
• Place to hold Events
• Memory or File Backed (also JDBC, but why?)
• Bounded - Size is configurable
• Resources aren’t infinite
![Page 9: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/9.jpg)
Sources• Feeds data to one or more Channels
• Usually data pushed to it (listen for data on a socket. i.e. HTTP Source) or from Avro log4J appender.
• Or can periodically poll another system and generate events (i.e. run a command every minute, and parse output into Event, Query a DB/Mongo/etc.)
![Page 10: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/10.jpg)
Sinks
• Move Events from a single Channel to a destination
• Only removes from Channel if write successful
• HDFSSink you’ll use the most — most likely…
![Page 11: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/11.jpg)
Configuration Sample# Agent named ‘agent’!# Input (source)!agent.sources.r1.type = seq!agent.sources.r1.channels = c1!!# Output (sink)!agent.sinks.k1.type = logger!agent.sinks.k1.channel = c1!!# Channel!agent.channels.c1.type = memory!agent.channels.c1.capacity = 1000!!# Wire everything together!agent.sources = r1!agent.sinks = k1!agent.channels = c1!
![Page 12: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/12.jpg)
Startup# Agent named ‘agent’!# Input (source)!agent.sources.r1.type = seq!agent.sources.r1.channels = c1!!# Output (sink)!agent.sinks.k1.type = logger!agent.sinks.k1.channel = c1!!# Channel!agent.channels.c1.type = memory!agent.channels.c1.capacity = 1000!!# Wire everything together!agent.sources = r1!agent.sinks = k1!agent.channels = c1!
name.{sources|sinks|channels}
![Page 13: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/13.jpg)
Startup# Agent named ‘agent’!# Input (source)!agent.sources.r1.type = seq!agent.sources.r1.channels = c1!!# Output (sink)!agent.sinks.k1.type = logger!agent.sinks.k1.channel = c1!!# Channel!agent.channels.c1.type = memory!agent.channels.c1.capacity = 1000!!# Wire everything together!agent.sources = r1!agent.sinks = k1!agent.channels = c1!
name.{sources|sinks|channels}
Find instance name + type
![Page 14: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/14.jpg)
Startup# Agent named ‘agent’!# Input (source)!agent.sources.r1.type = seq!agent.sources.r1.channels = c1!!# Output (sink)!agent.sinks.k1.type = logger!agent.sinks.k1.channel = c1!!# Channel!agent.channels.c1.type = memory!agent.channels.c1.capacity = 1000!!# Wire everything together!agent.sources = r1!agent.sinks = k1!agent.channels = c1!
name.{sources|sinks|channels}
Find instance name + type
Connect channel(s)
![Page 15: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/15.jpg)
Startup# Agent named ‘agent’!# Input (source)!agent.sources.r1.type = seq!agent.sources.r1.channels = c1!!# Output (sink)!agent.sinks.k1.type = logger!agent.sinks.k1.channel = c1!!# Channel!agent.channels.c1.type = memory!agent.channels.c1.capacity = 1000!!# Wire everything together!agent.sources = r1!agent.sinks = k1!agent.channels = c1!
name.{sources|sinks|channels}
Find instance name + type
Connect channel(s)
Apply type specificconfigurations
![Page 16: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/16.jpg)
Startup# Agent named ‘agent’!# Input (source)!agent.sources.r1.type = seq!agent.sources.r1.channels = c1!!# Output (sink)!agent.sinks.k1.type = logger!agent.sinks.k1.channel = c1!!# Channel!agent.channels.c1.type = memory!agent.channels.c1.capacity = 1000!!# Wire everything together!agent.sources = r1!agent.sinks = k1!agent.channels = c1!
name.{sources|sinks|channels}
Find instance name + type
Connect channel(s)Apply type specificconfigurations
RTM - Flume User Guide https://flume.apache.org/FlumeUserGuide.html"
or my book :)
![Page 17: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/17.jpg)
Configuration Sample (logs)Creating channels!
Creating instance of channel c1 type memory!
Created channel c1!
Creating instance of source r1, type seq!
Creating instance of sink: k1, type: logger!
Channel c1 connected to [r1, k1]!
Starting new configuration:{ sourceRunners:{r1=PollableSourceRunner: { source:org.apache.flume.source.SequenceGeneratorSource{name:r1,state:IDLE} counterGroup:{ name:null counters:{} } }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@19484a05 counterGroup:{ name:null counters:{} } }} channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} }!
Event: { headers:{} body: 30 0 }!
Event: { headers:{} body: 31 1 }!
Event: { headers:{} body: 32 2 }!
and so on…
![Page 18: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/18.jpg)
Using Cloudera Manager• Same stuff, just in
a GUI
• Centrally managed in a Database (instead of source control/Git)
• Distributed from central location (instead of Chef/Puppet)
![Page 19: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/19.jpg)
Multiple destinations need multiple channels
![Page 20: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/20.jpg)
Channel Selector
• When more than 1 channel specified on Source
• Replicating (Each channel gets a copy) - default
• Multiplexing (Channel picked based on a header value)
• Custom (If these don’t work for you - code one!)
![Page 21: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/21.jpg)
Channel Selector Replicating
• Copy sent to all channels associated with Source
agent.sources.r1.selector.type=replicating agent.sources.r1.channels=c1 c2 c3
• Can specify “optional” channels
agent.sources.r1.selector.optional=c3
• Transaction success if all non-optional channels take the event (in this case c1 & c2)
![Page 22: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/22.jpg)
Channel Selector Multiplexing
• Copy sent to only some of the channels
agent.sources.r1.selector.type=multiplexingagent.sources.r1.channels=c1 c2 c3 c4
• Switch based on header key (i.e. {“currency”:“USD”} → c1)
agent.sources.r1.selector.header=currencyagent.sources.r1.selector.mapping.USD=c1agent.sources.r1.selector.mapping.EUR=c2 c3agent.sources.r1.selector.default=c4
![Page 23: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/23.jpg)
Interceptors• Zero or more on Source (before written to channel)
• Zero or more on Sink (after read from channel)
• Or Both
• Use for transformations of data in-flight (headers OR body)
public Event intercept(Event event);public List<Event> intercept(List<Event> events);
• Return null or empty List to drop Events
![Page 24: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/24.jpg)
Interceptor Chaining• Processed in Order Listed in Configuration (source r1 example):
agent.sources.r1.interceptors=i1 i2 i3agent.sources.r1.interceptors.i1.type=timestamp agent.sources.r1.interceptors.i1.preserveExisting=true agent.sources.r1.interceptors.i2.type=static agent.sources.r1.interceptors.i2.key=datacenter agent.sources.r1.interceptors.i2.value=CHIagent.sources.r1.interceptors.i3.type=hostagent.sources.r1.interceptors.i3.hostHeader=relay agent.sources.r1.interceptors.i3.useIP=false
• Resulting Headers added before writing to Channel:
{“timestamp”:“1392350333234”, “datacenter”:“CHI”, “relay”:“flumebox.example.com”}
![Page 25: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/25.jpg)
Morphlines• Interceptor and Sink forms.
• See Cloudera Website/Blog
• Created to ease transforms and Cloudera Search/Flume integration.
• An example:
# convert the timestamp field to "yyyy-MM-dd'T'HH:mm:ss.SSSZ" # The input may match one of "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'" # or "yyyy-MM-dd'T'HH:mm:ss" or "yyyy-MM-dd". convertTimestamp { field : timestamp inputFormats : ["yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", "yyyy-MM-dd'T'HH:mm:ss", "yyyy-MM-dd"] inputTimezone : America/Chicago outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'" outputTimezone : UTC }
![Page 26: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/26.jpg)
Avro• Apache Avro - Data Serialization
• http://avro.apache.org/
• Storage Format and Wire Protocol
• Self-Describing (schema written with the data)
• Supports Compression of Data (not container — so MapReduce friendly — “splitable”)
• Binary friendly — Doesn’t require records separated by \n
![Page 27: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/27.jpg)
Avro Source/Sink
• Preferred inter-agent transport in Flume
• Simple Configuration (host + port for sink and port for source)
• Minimal transformation needed for Flume Events
• Version of Avro in client & server don’t need to match — only payload versioning matters (think protocol buffers vs Java serialization)
![Page 28: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/28.jpg)
Avro Source/Sink Config
foo.sources=…foo.channels=channel-‐foofoo.channels.channel-‐foo.type=memoryfoo.sinks=sink-‐foofoo.sinks.sink-‐foo.channel=channel-‐foofoo.sinks.sink-‐foo.type=avrofoo.sinks.sink-‐foo.hostname=bar.example.comfoo.sinks.sink-‐foo.port=12345foo.sinks.sink-‐foo.compression-‐type=deflate
bar.sources=datafromfoobar.sources.datafromfoo.type=avrobar.sources.datafromfoo.bind=0.0.0.0bar.sources.datafromfoo.port=12345bar.sources.datafromfoo.compression-‐type=deflate bar.sources.datafromfoo.channels=channel-‐bar bar.channels=channel-‐barbar.channels.channel-‐bar.type=memorybar.sinks=…
![Page 29: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/29.jpg)
log4j Avro Sink• Remember that Web
Server pushing data to Source?
• Use the Flume Avro log4j appender!
• log level, category, etc. become headers in Event
• “message” String becomes the body
![Page 30: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/30.jpg)
log4j Configuration• log4j.properties sender (include flume-‐ng-‐sdk-‐1.X.X.jar in project):
log4j.appender.flume=org.apache.flume.clients.log4jappender.Log4jAppenderlog4j.appender.flume.Hostname=example.comlog4j.appender.flume.Port=12345log4j.appender.flume.UnsafeMode=truelog4j.logger.org.example.MyClass=DEBUG,flume
• flume avro receiver:
agent.sources=logsagent.sources.logs.type=avroagent.sources.logs.bind=0.0.0.0agent.sources.logs.port=12345agent.sources.logs.channels=…
![Page 31: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/31.jpg)
Avro Client• Send data to AvroSource from command line
• Run flume program with avro-‐client instead of agent parameter
$ bin/flume-‐ng avro-‐client -‐H server.example.com -‐p 12345 [-‐F input_file]
• Each line of the file (or stdin if no file given) becomes an event
• Useful for testing or injecting data from outside Flume sources (ExecSource vs cronjob which pipes output to avro-‐client).
![Page 32: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/32.jpg)
HDFSSink• Read from Channel and write
to a file in HDFS in chunks
• Until 1 of 3 things happens:
• some amount of time elapses (rollInterval)
• some number of records have been written (rollCount)
• some size of data has been written (rollSize)
• Close that file and start a new one
![Page 33: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/33.jpg)
HDFS Configurationfoo.sources=…foo.channels=channel-‐foofoo.channels.channel-‐foo.type=memoryfoo.sinks=sink-‐foofoo.sinks.sink-‐foo.channel=channel-‐foofoo.sinks.sink-‐foo.type=hdfsfoo.sinks.sink-‐foo.hdfs.path=hdfs://NN/data/%Y/%m/%d/%H foo.sinks.sink-‐foo.hdfs.rollInterval=60foo.sinks.sink-‐foo.hdfs.filePrefix=logfoo.sinks.sink-‐foo.hdfs.fileSuffix=.avrofoo.sinks.sink-‐foo.hdfs.inUsePrefix=_foo.sinks.sink-‐foo.serializer=avro_eventfoo.sinks.sink-‐foo.serializer.compressionCodec=snappy
![Page 34: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/34.jpg)
HDFS writing…drwxr-‐x-‐-‐-‐ -‐ flume flume 0 2014-‐02-‐16 17:04 /data/2014/02/16/23 -‐rw-‐r-‐-‐-‐-‐-‐ 3 flume flume 0 2014-‐02-‐16 17:04 /data/2014/02/16/23/_log.1392591607925.avro.tmp -‐rw-‐r-‐-‐-‐-‐-‐ 3 flume flume 1877 2014-‐02-‐16 17:01 /data/2014/02/16/23/log.1392591607923.avro -‐rw-‐r-‐-‐-‐-‐-‐ 3 flume flume 1955 2014-‐02-‐16 17:02 /data/2014/02/16/23/log.1392591607924.avro -‐rw-‐r-‐-‐-‐-‐-‐ 3 flume flume 2390 2014-‐02-‐16 17:04 /data/2014/02/16/23/log.1392591798436.avro
• The zero length .tmp file is the current file. Won’t see real size until it closes (just like when you do a hadoop fs -‐put)
• Use …hdfs.inUsePrefix=_ to prevent open files from being included in MapReduce jobs
![Page 35: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/35.jpg)
Event Serializers• Defines how the Event gets written to Sink
• Just the body as a UTF-8 String
agent.sinks.foo-‐sink.serializer=text
• Headers and Body as UTF-8 String
agent.sinks.foo-‐sink.serializer=header_and_text
• Avro (Flume record Schema)
agent.sinks.foo-‐sink-‐serializer=avro_event
• Custom (none of the above meets your needs)
![Page 36: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/36.jpg)
Lessons Learned
![Page 38: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/38.jpg)
Timezones are Evil• Daylight savings time causes problems twice a year (in Spring: no 2am
hour. In Fall: twice the data during 2am hour — 02:15? Which one?)
• Date processing in MapReduce jobs: Hourly jobs, filters, etc.
• Dated paths: hdfs://NN/data/%Y/%m/%d/%H
• Use UTC: -‐Duser.timezone=UTC
• Use one of the ISO8601 formats like 2014-‐02-‐26T18:00:00.000Z
• Sorts the way you usually want
• Every time library supports it* - and if not, easy to parse.
![Page 39: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/39.jpg)
Generally Speaking…• Async handoff doesn’t work under load when bad
stuff happens
Write Read
Filesystem or
Queue or
Database or whatever
Not ∞
![Page 40: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/40.jpg)
Async Handoff Oops
Flume Agent
tail -F foo.log
foo.log
![Page 41: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/41.jpg)
Async Handoff Oops
Flume Agent
tail -F foo.log
foo.log.1
![Page 42: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/42.jpg)
Async Handoff Oops
Flume Agent
tail -F foo.log
foo.log
foo.log.1
![Page 43: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/43.jpg)
Async Handoff Oops
Flume Agent
tail -F foo.log
foo.log
foo.log.2
![Page 44: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/44.jpg)
Async Handoff Oops
Flume Agent
tail -F foo.log
foo.log.1 foo.log.2
![Page 45: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/45.jpg)
Async Handoff Oops
Flume Agent
tail -F foo.log
foo.log.1 foo.log.2
foo.log
![Page 46: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/46.jpg)
Async Handoff Oops
Flume Agent
tail -F foo.log
foo.log.1 foo.log.2
foo.log
![Page 47: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/47.jpg)
Async Handoff Oops
Flume Agent
tail -F foo.log
foo.log.1 foo.log.2
foo.log
X
![Page 48: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/48.jpg)
Don’t Use Tail• Tailing a file for input is bad - assumptions are made that
aren’t guarantees.
• Direct support removed during Flume rewrite
• Handoff can go bad with files: when writer faster than reader
• With Queue: when reader doesn’t read before expire time
• No way to apply “back pressure” to tell tail there is a problem. It isn’t listening…
![Page 49: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/49.jpg)
What can I use?• If you can’t use the log4j Avro Appender…
• Use logrotate to move old logs to “spool” directory
• SpoolingDirectorySource
• Finally, cron job to remove .COMPLETED files (for delayed delete) OR set deletePolicy=true (immediate)
• Alternatively use log rotate with avro_client? (probably other ways too…)
![Page 50: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/50.jpg)
RAM or Disk Channels?
Source:http://blog.scoutapp.com/articles/2011/02/10/understanding-disk-i-o-when-should-you-be-worried
![Page 51: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/51.jpg)
Duplicate Events
• Transactions only at Agent level
• You may see Events more than once
• Distributed Transactions are expensive
• Just deal with in query/scrub phase — much less costly than trying to prevent it from happening
![Page 52: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/52.jpg)
Late Data• Data could be “late”/delayed
• Outages
• Restarts
• Act of Nature
• Only sure thing is a “database” — single write + ACK
• Depending on your monitoring, it could be REALLY LATE.
![Page 53: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/53.jpg)
Monitoring• Know when it breaks so you can fix it before you can’t ingest new data
(and it is lost)
• This time window is small if volume is high
• Flume Monitoring still WIP, but hooks are there
![Page 54: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/54.jpg)
Other Operational Concerns
• resource utilization - number of open files when writing (file descriptors), disk space used for file channel, disk contention, disk speed*
• number of inbound and outbound sockets - may need to tier (Avro Source/Sink)
• minimize hops if possible - another place for data to get stuck
![Page 55: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014](https://reader033.fdocuments.in/reader033/viewer/2022052823/554f8dabb4c905435d8b4ec3/html5/thumbnails/55.jpg)
Not everything is a nail• Flume is great for handling individual records
• What if you need to compute an average?
• Get a Stream Processing system
• Storm (twitter’s)
• Samza (linkedIn’s)
• Others…
• Flume can co-exist with these — use most appropriate tool