SoftServe's Hadoop Demo Lab

Post on 22-May-2015

162 views 3 download

Tags:

description

SoftServe's Hadoop Demo Lab - is a project to aggregate log files from 300 Apache HTTPD web-servers and populate them into Hadoop/ElasticSearch cluster for future analysis using Microstrategy and Kibana. It is even more interesting keeping in mind that all deployment is fully automated using Vagrant and Puppet.

Transcript of SoftServe's Hadoop Demo Lab

SOFTSERVE’SHADOOP DEMO LAB

Aug, 2014

Agenda

1) Who we are & What we do

2) Why we started this project

3) High-Level Task Overview

4) Task Analysis

5) Solution Architecture

6) Trade-off Analysis

7) Development Aspects

WHO WE ARE&

WHAT WE DO

4

▪ Leading global Product and Application Development partner founded in 1993

▪ 3,300+ employees across North America, Ukraine and Western Europe

▪ Thousands of successful outsourcing projects!

SaaS/Cloud Solutions . Mobility Solutions . UX/UI BI/Analytics/Big Data . Software Architecture . Security

Clients include:

Why SoftServe

• Dedicated Architecture Group (including BI and BigData, 40+ architects)

• Demo Hadoop Environment

• Reference architecture library

• 10+ successful BI/BigData projects

• Certified Big Data engineers (Hadoop, MongoDB)

• Partnership with major RDBMS, BI and BigData vendors

What we do: Services

1) Design & Assessment

2) Optimization & Modernization

3) POC & Prototyping

4) Development and Quality Control

5) Production and Non-Production Support

Technology Stack

Data Warehouse

Data Integration

Big Data and NoSQL

BI Platforms

Big Data Analytics Expertise

WHY WE STARTED THIS PROJECT

Demo Lab: Why?

1) Increase Internal Experience

2) Create Reference Solution w/o NDA Limitations

3) Get Playground for Tests

4) Provide Demo Environment for Customers (using their data)

5) Decrease time to Deliver (by introducing automation)

HADOOP DEMO LAB:HIGH-LEVEL TASK

OVERVIEW

Demo Lab: Input Data

Data Volume270-300 Web Servers (Apache HTTPD)447 392 events per minute644 245 094 events / day~100-250 bytes per event150GB of data per day

Log Types1) Apache HTTPD access log2) Apache HTTPD error log3) Service log (CPU, RAM, Disk I/O, Disk Space)4) Application server servlet log

RetentionLast 30 days: Raw dataLast 24 hours: per minute aggregationWhole period: per hour aggregation

Demo Lab: Log Data Examples

Access log:127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 Error log:[Sun Mar 7 20:58:27 2004] [info] [client 64.242.88.10] (104)Connection reset by peer: client stopped connection before send body completed[Sun Mar 7 21:16:17 2004] [error] [client 24.70.56.49] File does not exist: /home/httpd/twiki/view/Main/WebHome Vmstatprocs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0 iostatLinux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011 avg-cpu: %user %nice %system %iowait %steal %idle 5.68 0.00 0.52 2.03 0.00 91.76

TASK ANALYSIS

Architecture Drivers: Use Cases

Architecture Drivers: Quality Attributes (1/3)

Architecture Drivers: Quality Attributes (2/3)

Architecture Drivers: Quality Attributes (3/3)

Architecture Drivers: Limitations

Demo Lab: Marketecture

SOLUTION ARCHITECTURE

Reference Architecture Selection

Options:

• Traditional Relational• Extended Relational• Non-Relational• Lambda Architecture (Hybrid)• Data Refinery (Hybrid)

Lambda Architecture:

• Simultaneous access to real-time and historical data• Isolated Design and Development• Increased Fault Tolerance

Lambda Architecture

Data Flow

Infrastructure View

TRADE-OFF ANALYSIS

Distribution Selection

Hive Stinger vs Impala

Compression Ratio

Access Speed

BI Tool SelectionOptions:• Tableau• JasperSoft• Microstrategy• QlikView

Microstrategy:• Powerful and Feature-Rich BI Tool• 31 days trial period w/o trial key• Well-integrated with Hadoop (and Impala)• Easy to install in a silent-mode (command-line)

DEVELOPMENT ASPECTS

Automatization

80% 20%

Development and Debugging F&P Testing, Demo

Local Development Cloud Development

Log Generator

4200 events / second (File source)5500 events / second (TCP source)

Accurate Sizing

100k/min

50k/min

20k/min

200k/min

Calculator!

Reference

35Click to add the title

1) Install Hadoop (CDH4) on 5 nodes with VMWare, CDH4, Cloudera Manager 4https://www.youtube.com/watch?v=CobVqNMiqww

2) Puppet & Vagrant Tutorialhttp://puppetlabs.com/blog/puppet-and-vagrant-tutorial 3) Hardware for Hadoophttp://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.0/bk_cluster-planning-guide/content/hardware-selection-for-hbase.html 4) How to Refine and Visualize Server Log Datahttp://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-server-log-data/ 5) Hadoop Cluster Sizinghttp://hortonworks.com/wp-content/uploads/downloads/2013/06/Hortonworks.ClusterConfigGuide.1.0.pdf

36

Thank You!

SoftServe US OfficeOne Congress Plaza, 111 Congress Avenue, Suite 2700 Austin, TX 78701 Tel: 512.516.8880

Contacts Valentyn Kropovvkrop@softserveinc.comTel: 866.687.3588 x4341