Log Data Analysis Platform
-
Upload
valentin-kropov -
Category
Data & Analytics
-
view
108 -
download
3
Transcript of Log Data Analysis Platform
LOG DATA ANALYSIS PLATFORM
May, 2015
Agenda
1) User-Group Introduction
2) Problematic
3) Log Data Analysis System Overview
4) Task Analysis
5) Solution Architecture
6) Trade-off Analysis
7) Automation
8) Performance Testing
9) Outcome & Plans
PROBLEMATIC
Demo Lab: Why we’ve started this project?
1) Increase Internal Experience
2) Create Reference Solution w/o NDA Limitations
3) Get Playground for Tests
4) Provide Demo Environment for Customers (using their data)
5) Decrease time to Market (by introducing automation)
LOG DATA ANALYSIS PLATFORM :
OVERVIEW
Log Data Analysis Platform Details
Key Facts: • ~270-300 Web Servers
• Log Types: HTTPD Access
logs, Error logs, Application
Server Servlet, OS Service
Logs
• ~500K events per minute
• 150GB of data per day
Technologies:• Flume
• Hadoop/HDFS, MapReduce
• Hive, Impala
• Oozie
• Elasticsearch, Kibana 3
• Tableau Analytics platform
• Puppet + Vagrant
Log Data Examples
Access log:
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Error log:
[Sun Mar 7 20:58:27 2004] [info] [client 64.242.88.10] (104)Connection reset by peer: client
stopped connection before send body completed
[Sun Mar 7 21:16:17 2004] [error] [client 24.70.56.49] File does not exist:
/home/httpd/twiki/view/Main/WebHome
Vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0
iostat
Linux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011
avg-cpu: %user %nice %system %iowait %steal %idle
5.68 0.00 0.52 2.03 0.00 91.76
TASK ANALYSIS
Architecture Drivers: Use Cases
Architecture Drivers: Quality Attributes (1/3)
Architecture Drivers: Quality Attributes (2/3)
Architecture Drivers: Quality Attributes (3/3)
Architecture Drivers: Limitations
Demo Lab: Marketecture
SOLUTION ARCHITECTURE
Solution Architecture
Batch Layer Serving Layer
Speed Layer
Raw Data Storage
Data Strea
m
Real-time Views
Static Views Precomputing
PrecomputingAd-hoc Batch
Views
Static Batch Views
Corporate BI Tool
Legend:
Layer boundary
Data flow (with direction indicated)
Query flow
Apache HTTP Servers
Raw Data Storage Pre-computing Batch Views
Real-Time Views
Dashboard/Search
Data Stream
Real-Time Processing and Aggregations
BI Tool
Avro as a Raw Data Storage file format
Parquet as a Batch Views file format
Star schema as a Batch Views data model
Architecture: Flume Topology
Batch ETL
TRADE-OFF ANALYSIS
Distribution Selection
Hive Stinger vs Impala
Compression Ratio
Access Speed
AUTOMATION
Automation (saves time and money)
80% 20%
Development and Debugging F&P Testing, Demo
Local Development Cloud Development
vagrant up
Automation Process
Phase Tool NotesVM Provisioning Vagrant — Supports:
VirtualBox, VMWare ESX, Amazon AWS
VM Bootstraping Puppet — Installs Cloudera Manager, Cloudera DistributionHadoop, ElasticSearch+Kibana, Flume, Microstrategy, LogGenerator.
— Creates Cluster using Cloudera Manager API.
Configure ETL and BI
Puppet — Configures Flume, Oozie, ElasticSearch, Impala, Hive, Microstrategy Dashboards
Integration Tests Puppet — Generates Workload and ensures data go through.— Checks Logs for errors.— Calculates timing/throughput.
PERFORMANCE TESTING
Log Generator
1 Thread can generate:4200 events / second (File source)5500 events / second (TCP source)
Accurate Sizing
100k/min
50k/min
20k/min
200k/min
Calculator!
OUTCOME & PLANS
Outcome
1) Demo lab, playground, testing platform (in 1 hour)
2) Sizing Calculator
3) Help to get 3 new customers (one is really, really
huge)
4) Strategic Partnership with Cloudera
5) Tons of experience and fun
Plans
1) Add support for other Hadoop Distributions
(Hortonworks, MapR)
2) Make Project Open-Source
Thank You!
31
SoftServe US Office
One Congress Plaza,
111 Congress Avenue, Suite 2700 Austin, TX
78701
Tel: 512.516.8880
Contacts Valentyn Kropov
Tel: 866.687.3588 x4341