Log Data Analysis Platform by Valentin Kropov

31
LOG DATA ANALYSIS PLATFORM May, 2015

Transcript of Log Data Analysis Platform by Valentin Kropov

LOG DATA ANALYSIS PLATFORM

May, 2015

Agenda

1) User-Group Introduction

2) Problematic

3) Log Data Analysis System Overview

4) Task Analysis

5) Solution Architecture

6) Trade-off Analysis

7) Automation

8) Performance Testing

9) Outcome & Plans

PROBLEMATIC

Demo Lab: Why we’ve started this project?

1) Increase Internal Experience

2) Create Reference Solution w/o NDA Limitations

3) Get Playground for Tests

4) Provide Demo Environment for Customers (using their data)

5) Decrease time to Market (by introducing automation)

LOG DATA ANALYSIS PLATFORM : OVERVIEW

Log Data Analysis Platform Details

Key Facts: • ~270-300 Web Servers • Log Types: HTTPD Access

logs, Error logs, Application Server Servlet, OS Service Logs

• ~500K events per minute

• 150GB of data per day

Technologies:• Flume• Hadoop/HDFS,

MapReduce• Hive, Impala• Oozie• Elasticsearch, Kibana 3• Tableau Analytics

platform• Puppet + Vagrant

Log Data ExamplesAccess log:127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 Error log:[Sun Mar 7 20:58:27 2004] [info] [client 64.242.88.10] (104)Connection reset by peer: client stopped connection before send body completed[Sun Mar 7 21:16:17 2004] [error] [client 24.70.56.49] File does not exist: /home/httpd/twiki/view/Main/WebHome Vmstatprocs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0 iostatLinux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011 avg-cpu: %user %nice %system %iowait %steal %idle 5.68 0.00 0.52 2.03 0.00 91.76

TASK ANALYSIS

Architecture Drivers: Use Cases

Architecture Drivers: Quality Attributes (1/3)

Architecture Drivers: Quality Attributes (2/3)

Architecture Drivers: Quality Attributes (3/3)

Architecture Drivers: Limitations

Demo Lab: Marketecture

SOLUTION ARCHITECTURE

Solution Architecture

Batch Layer Serving Layer

Speed Layer

Raw Data Storage

Data Strea

m

Real-time Views

Static Views Precomputing

PrecomputingAd-hoc Batch

Views

Static Batch Views

Corporate BI Tool

Legend:Layer boundary

Data flow (with direction indicated)

Query flow

Apache HTTP Servers

Raw Data Storage Pre-computing Batch Views

Real-Time ViewsDashboard/

Search

Data Stream

Real-Time Processing and Aggregations

BI Tool

Avro as a Raw Data Storage file format

Parquet as a Batch Views file format

Star schema as a Batch Views data model

Architecture: Flume Topology

Batch ETL

TRADE-OFF ANALYSIS

Distribution Selection

Hive Stinger vs Impala

Compression Ratio

Access Speed

AUTOMATION

Automation (saves time and money)

80% 20%

Development and Debugging F&P Testing, Demo

Local Development Cloud Development

vagrant up

Automation Process

Phase Tool NotesVM Provisioning Vagrant — Supports:

VirtualBox, VMWare ESX, Amazon AWS

VM Bootstraping Puppet — Installs Cloudera Manager, Cloudera Distribution Hadoop, ElasticSearch+Kibana, Flume, Microstrategy, Log Generator.

— Creates Cluster using Cloudera Manager API.Configure ETL and BI

Puppet — Configures Flume, Oozie, ElasticSearch, Impala, Hive, Microstrategy Dashboards

Integration Tests Puppet — Generates Workload and ensures data go through.— Checks Logs for errors.— Calculates timing/throughput.

PERFORMANCE TESTING

Log Generator

1 Thread can generate:4200 events / second (File source)5500 events / second (TCP source)

Accurate Sizing

100k/min

50k/min

20k/min

200k/min

Calculator!

OUTCOME & PLANS

Outcome

1) Demo lab, playground, testing platform (in 1 hour)

2) Sizing Calculator3) Help to get 3 new customers (one is really,

really huge)4) Strategic Partnership with Cloudera5) Tons of experience and fun

Plans

1) Add support for other Hadoop Distributions (Hortonworks, MapR)

2) Make Project Open-Source

31

Thank You!

SoftServe US OfficeOne Congress Plaza, 111 Congress Avenue, Suite 2700 Austin, TX 78701 Tel: 512.516.8880

Contacts Valentyn [email protected]: 866.687.3588 x4341