DataEngConf SF16 - Collecting and Moving Data at Scale
-
Upload
hakka-labs -
Category
Technology
-
view
339 -
download
0
Transcript of DataEngConf SF16 - Collecting and Moving Data at Scale
![Page 1: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/1.jpg)
COLLECTING AND MOVING DATA AT SCALE
Sada Furuhashi Chief ArchitectInvented Fluentd, Messagepack
![Page 2: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/2.jpg)
BACKGROUND
![Page 3: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/3.jpg)
HIGH LEVEL ANALYTICS ARCHITECTURE
Collect Store Process Visualize
![Page 4: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/4.jpg)
THE CHALLENGE
Collect Store Process Visualize
How do we shorten the collection process?
Easier & Shorter Time ExcelTableau
![Page 5: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/5.jpg)
THE PROBLEM
![Page 6: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/6.jpg)
TYPICAL ARCHITECTURE BEFORE FLUENTD
Log Server
Application
App Server
File FileFile
High latencyMust wait for a day
Hard to analyzeComplex text parsers
Application
App Server
File FileFile
Application
App Server
File FileFile
![Page 7: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/7.jpg)
THE FALSE SOLUTION
![Page 8: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/8.jpg)
MULTIPLY CONNECTIONS / COMBINATION EXPLOSION
LOGFile
script to parse data
cron job forloading
filteringscript
syslogscript
Tweet-fetching
script
aggregationscript
aggregationscript
script to parse data
rsyncserver
![Page 9: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/9.jpg)
THE SOLUTION
![Page 10: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/10.jpg)
CENTRALIZED CONNECTIONS
LOGFILE
![Page 11: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/11.jpg)
FLUENTD INTERNAL ARCHITECTURE
![Page 12: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/12.jpg)
INTERNAL ARCHITECTURE (SIMPLIFIED)
Plugin
Input Filter Buffer Output
Plugin Plugin Plugin
2012-02-04 01:33:51myapp.buylog{
“user”:”me”,“path”: “/buyItem”,“price”: 150,“referer”: “/landing”}
TimeTag
Record
![Page 13: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/13.jpg)
ARCHITECTURE: INPUT PLUGINS
HTTP+JSON (in_http)File tail (in_tail)Syslog (in_syslog)…
Receive logs
Or pull logs from data sources
In non-blocking manner
Plugin
Input
![Page 14: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/14.jpg)
Filter
ARCHITECTURE: FILTER PLUGINS
Transform logs
Filter out unnecessary logs
Enrich logs
Plugin
Encrypt personal dataConvert IP to countriesParse User-Agent…
![Page 15: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/15.jpg)
Buffer
ARCHITECTURE: BUFFER PLUGINS
Plugin
Improve performance
Provide reliability
Provide thread-safety
Memory (buf_memory)File (buf_file)
![Page 16: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/16.jpg)
ARCHITECTURE: OUTPUT PLUGINS
Output
Write or send event logs
Plugin
File (out_file)Amazon S3 (out_s3)MongoDB (out_mongo)…
![Page 17: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/17.jpg)
Buffer
ARCHITECTURE: BUFFER PLUGINS
Chunk
Plugin
Improve performance
Provide reliability
Provide thread-safety
Input
Output
Chunk
Chunk
![Page 18: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/18.jpg)
Retry
Error
Retry
Batch
Stream Error
Retry
Retry
DIVIDE & CONQUER & RETRY
![Page 19: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/19.jpg)
EXAMPLE USE CASES
![Page 20: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/20.jpg)
STREAMING FROM APACHE TO MONGODB PT I
in_tail /var/log/access.log
/var/log/fluentd/buffer
but_file
![Page 21: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/21.jpg)
ERROR HANDLING
in_tail /var/log/access.log
/var/log/fluentd/buffer
but_file
Buffering for any outputs Retrying automatically With exponential wait and persistence on a disk
![Page 22: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/22.jpg)
TAILING FILE INPUT
Supported formats:
Read a log file Custom regexp Custom parser in Ruby
• apache • apache_error • apache2 • nginx
• json • csv • tsv • ltsv
• syslog • multiline • none
pos fileaccess.log
![Page 23: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/23.jpg)
OUT TO MULTIPLE LOCATIONS
Routing based on tags Copy to multiple storages
bufferaccess.log
in_tail
![Page 24: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/24.jpg)
H.A. CONFIGURATION (HIGH AVAILABILITY)
Retry automatically Exponential retry wait Persistent on a disk
bufferAutomatic fail-over Load balancing
access.log
in_tail
![Page 25: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/25.jpg)
FOR HADOOP USERS
Retry automatically Exponential retry wait Persistent on a disk
access.logbuffer
Custom text formatter
Slice files based on time
2016-01-01/01/access.log.gz 2016-01-01/02/access.log.gz 2016-01-01/03/access.log.gz …
in_tail
![Page 26: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/26.jpg)
HADOOP INTEGRATION INTO S3
Retry automatically Exponential retry wait Persistent on a disk
buffer
Slice files based on time
in_tail
2016-01-01/01/access.log.gz 2016-01-01/02/access.log.gz 2016-01-01/03/access.log.gz …
access.log
![Page 27: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/27.jpg)
3RD PARTY INPUT PLUGINS
dstat
df AMQL
munin
jvmwatcher
SQL
![Page 28: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/28.jpg)
3RD PARTY OUTPUT PLUGINS
AMQL
Graphite
![Page 29: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/29.jpg)
REAL WORLD USE CASES
![Page 30: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/30.jpg)
HIGH-VOLUME FORWARDING
T R E A S U R ED A T A
-At-most-once / At-least-once -HA (failover) -Load-balancing
![Page 31: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/31.jpg)
NEAR REALTIME AND BATCH COMBO
Hot data
All data
![Page 32: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/32.jpg)
EXAMPLE CONFIGURATION FOR REAL TIME BATCH COMBO
![Page 33: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/33.jpg)
CEP FOR STREAM PROCESSING
Nora is a SQL based CEP engine: http://norikra.github.io/
![Page 34: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/34.jpg)
CONTAINER LOGGING
T R E A S U R ED A T A
![Page 35: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/35.jpg)
FLUENTD IN PRODUCTION
![Page 36: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/36.jpg)
MICROSOFT
Operations Management Suite uses Fluentd: "The core of the agent uses an existing open source data aggregator called Fluentd. Fluentd has hundreds of existing plugins, which will make it really easy for you to add new data sources."
Syslog
Linux Computer
Operating SystemApache
MySQLContainers
omsconfig (DSC)PS DSC
Prov
ider
s
OMI Server(CIM Server)
omsagent
Fire
wal
l / p
roxy
OM
S Se
rvic
e
Upload Data(HTTPS)
Pullconfiguration
(HTTPS)
![Page 37: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/37.jpg)
ATLASSIAN
"At Atlassian, we've been impressed by Fluentd and have chosen to use it in Atlassian Cloud's logging and analytics pipeline."
Kinesis
Elasticsearchcluster
Ingestionservice
![Page 38: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/38.jpg)
AMAZON WEB SERVICES
The architecture of Fluentd (Sponsored by Treasure Data) is very similar to Apache Flume or Facebook’s Scribe. Fluentd is easier to install and maintain and has better documentation and support than Flume and Scribe.
Types of DataStoreCollectTransactional • Database reads & write (OLTP)• Cache
Search • Logs• Streams
File • Log files (/val/log)• Log collectors & frameworks
Stream • Log records• Sensors & IoT data
Web Apps
IoT
Appl
icat
ions
Logg
ing
Mobile AppsDatabase
Search
File Storage
Stream Storage
![Page 39: DataEngConf SF16 - Collecting and Moving Data at Scale](https://reader037.fdocuments.in/reader037/viewer/2022110109/587e47ef1a28abeb1a8b46e1/html5/thumbnails/39.jpg)
THANK YOU!