Log Data Analysis Platform

31
L OG D ATA ANALYSIS PLATFORM May, 2015

Transcript of Log Data Analysis Platform

Page 1: Log Data Analysis Platform

LOG DATA ANALYSIS PLATFORM

May, 2015

Page 2: Log Data Analysis Platform

Agenda

1) User-Group Introduction

2) Problematic

3) Log Data Analysis System Overview

4) Task Analysis

5) Solution Architecture

6) Trade-off Analysis

7) Automation

8) Performance Testing

9) Outcome & Plans

Page 3: Log Data Analysis Platform

PROBLEMATIC

Page 4: Log Data Analysis Platform

Demo Lab: Why we’ve started this project?

1) Increase Internal Experience

2) Create Reference Solution w/o NDA Limitations

3) Get Playground for Tests

4) Provide Demo Environment for Customers (using their data)

5) Decrease time to Market (by introducing automation)

Page 5: Log Data Analysis Platform

LOG DATA ANALYSIS PLATFORM :

OVERVIEW

Page 6: Log Data Analysis Platform

Log Data Analysis Platform Details

Key Facts: • ~270-300 Web Servers

• Log Types: HTTPD Access

logs, Error logs, Application

Server Servlet, OS Service

Logs

• ~500K events per minute

• 150GB of data per day

Technologies:• Flume

• Hadoop/HDFS, MapReduce

• Hive, Impala

• Oozie

• Elasticsearch, Kibana 3

• Tableau Analytics platform

• Puppet + Vagrant

Page 7: Log Data Analysis Platform

Log Data Examples

Access log:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

Error log:

[Sun Mar 7 20:58:27 2004] [info] [client 64.242.88.10] (104)Connection reset by peer: client

stopped connection before send body completed

[Sun Mar 7 21:16:17 2004] [error] [client 24.70.56.49] File does not exist:

/home/httpd/twiki/view/Main/WebHome

Vmstat

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------

r b swpd free buff cache si so bi bo in cs us sy id wa st

0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0

iostat

Linux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011

avg-cpu: %user %nice %system %iowait %steal %idle

5.68 0.00 0.52 2.03 0.00 91.76

Page 8: Log Data Analysis Platform

TASK ANALYSIS

Page 9: Log Data Analysis Platform

Architecture Drivers: Use Cases

Page 10: Log Data Analysis Platform

Architecture Drivers: Quality Attributes (1/3)

Page 11: Log Data Analysis Platform

Architecture Drivers: Quality Attributes (2/3)

Page 12: Log Data Analysis Platform

Architecture Drivers: Quality Attributes (3/3)

Page 13: Log Data Analysis Platform

Architecture Drivers: Limitations

Page 14: Log Data Analysis Platform

Demo Lab: Marketecture

Page 15: Log Data Analysis Platform

SOLUTION ARCHITECTURE

Page 16: Log Data Analysis Platform

Solution Architecture

Batch Layer Serving Layer

Speed Layer

Raw Data Storage

Data Strea

m

Real-time Views

Static Views Precomputing

PrecomputingAd-hoc Batch

Views

Static Batch Views

Corporate BI Tool

Legend:

Layer boundary

Data flow (with direction indicated)

Query flow

Apache HTTP Servers

Raw Data Storage Pre-computing Batch Views

Real-Time Views

Dashboard/Search

Data Stream

Real-Time Processing and Aggregations

BI Tool

Avro as a Raw Data Storage file format

Parquet as a Batch Views file format

Star schema as a Batch Views data model

Page 17: Log Data Analysis Platform

Architecture: Flume Topology

Page 18: Log Data Analysis Platform

Batch ETL

Page 19: Log Data Analysis Platform

TRADE-OFF ANALYSIS

Page 20: Log Data Analysis Platform

Distribution Selection

Page 21: Log Data Analysis Platform

Hive Stinger vs Impala

Compression Ratio

Access Speed

Page 22: Log Data Analysis Platform

AUTOMATION

Page 23: Log Data Analysis Platform

Automation (saves time and money)

80% 20%

Development and Debugging F&P Testing, Demo

Local Development Cloud Development

Page 24: Log Data Analysis Platform

vagrant up

Page 25: Log Data Analysis Platform

Automation Process

Phase Tool NotesVM Provisioning Vagrant — Supports:

VirtualBox, VMWare ESX, Amazon AWS

VM Bootstraping Puppet — Installs Cloudera Manager, Cloudera DistributionHadoop, ElasticSearch+Kibana, Flume, Microstrategy, LogGenerator.

— Creates Cluster using Cloudera Manager API.

Configure ETL and BI

Puppet — Configures Flume, Oozie, ElasticSearch, Impala, Hive, Microstrategy Dashboards

Integration Tests Puppet — Generates Workload and ensures data go through.— Checks Logs for errors.— Calculates timing/throughput.

Page 26: Log Data Analysis Platform

PERFORMANCE TESTING

Page 27: Log Data Analysis Platform

Log Generator

1 Thread can generate:4200 events / second (File source)5500 events / second (TCP source)

Page 28: Log Data Analysis Platform

Accurate Sizing

100k/min

50k/min

20k/min

200k/min

Calculator!

Page 29: Log Data Analysis Platform

OUTCOME & PLANS

Page 30: Log Data Analysis Platform

Outcome

1) Demo lab, playground, testing platform (in 1 hour)

2) Sizing Calculator

3) Help to get 3 new customers (one is really, really

huge)

4) Strategic Partnership with Cloudera

5) Tons of experience and fun

Plans

1) Add support for other Hadoop Distributions

(Hortonworks, MapR)

2) Make Project Open-Source

Page 31: Log Data Analysis Platform

Thank You!

31

SoftServe US Office

One Congress Plaza,

111 Congress Avenue, Suite 2700 Austin, TX

78701

Tel: 512.516.8880

Contacts Valentyn Kropov

[email protected]

Tel: 866.687.3588 x4341