Download - Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Transcript
Page 1: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 1

Experiences in migration of large analytics platform from MPP database to Hadoop YARN

Srinivas Nimmagadda Roopesh VarierTechnical Director, CPE Director, CPE

Page 2: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Agenda

Introduction1

Big Data Needs2

MPP Platform and Challenges3

New Platform based on Hadoop/YARN4

Lessons learned during transition to Hadoop5

2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier

Page 3: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Overview

• Symantec Cloud Platform Engineering (CPE)– Build consolidated cloud infrastructure and platform services

for next generation data powered Symantec applications.

– Open source components as building blocks• Hadoop and Openstack• Bridge capability gaps and contribute back

• A big data platform for batch and stream analytics integrated with Openstack. – Security, multi-tenancy, and reliability

• Using large scale data analytics for security and data management work loads– Analytics – Reputation based security, Managed Security

Services, Fraud Detection, Dial home application logs

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 3

Page 4: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 5

Big Data Challenge

• Hundreds of millions of users• Billions of files– File good or not?

• Millions of URLs– URL safe or not?

• Hundreds of thousands of applications– Stable or Crashed

• Constant feed of information – Real time

– Across the global

– From our applications and appliances

Page 5: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 6

Value from Volume

• Volume of data– Multi-petabyte historical datasets

– Multi-terabyte daily incremental datasets

– Wide variety of input data formats

– How do we manage?

• Variety of workloads– ETL jobs

– Batch applications

– Interactive ad-hoc analysis

• How to extract value from volume near real-time?

Page 6: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Agenda

Introduction1

Big Data Needs2

MPP Platform and Challenges3

New Platform based on Hadoop/YARN4

Lessons learned during transition to Hadoop5

2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier

Page 7: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 8

MPP Platform

ETL Cluster Platform Services

Raw Data Store

Data Sources Applications

Batch

Interactive

MPP DB Engine

Page 8: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 9

Legacy MPP Analytics Solution• Custom Platform Services– Task/Job management (DAG based, Fault-tolerant)

– Functional and performance monitoring

– Automatic data lifecycle management

– Inter cluster data transfers

– Cluster tenancy management

• ETL cluster • RDS (raw data store) on NAS• MPP (Massively Parallel Processing) DB engine at the core

Page 9: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 10

Key Challenges• Scalability– Supporting rapid data growth

– No support for heterogeneous hardware.

• Operational costs– OpEx and Software licenses

• Supporting new use models– Not Only SQL patterns in analytics (columnar storage, search, streaming)

• Cluster operational challenges– Limited resource management (limits/quotas, utilization throttling)

– Load balancing across multi-mode and multi-tenant workloads

– Integrated secure tenancy services

– HA and DR

Page 10: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Agenda

Introduction1

Big Data Needs2

MPP Platform and Challenges3

New Platform based on Hadoop/YARN4

Lessons learned during transition to Hadoop5

2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier

Page 11: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 12

MPP Platform

Raw Data Store ETL Cluster Platform

Services

Data Sources Applications

Batch

Interactive

MPP DB Engine

Page 12: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 13

7: YARN/HDFS6: DistCP, Falcon5: DAG: Oozie

MPP DB Engine3: HDFS

MPP to Big Data Platform

Raw Data Store

Platform Services

Data Sources Applications

Batch

Interactive

1: Commodity Hardware

2: Hadoop Cluster 4: YARN

ETL

Job Management

State Transfer

Tenancy GuardETL Cluster

Batch

Interactive

Interactive Batch

YARN

Page 13: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 14

Big Data Platform

Multi-tenant

Data Sources Applications

BatchInteractive

1: Cluster Infrastructure 2: Hadoop 2.x Stack

3: HDFS

5: Oozie

4: YARN

ETLInteractive Batch

Raw Data Store ETL Jobs Batch Interactive

Ad-hoc

workloads

Role-based provisioning Unified Logging

API

Page 14: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Agenda

Introduction1

Big Data Needs2

MPP Platform and Challenges3

New Platform based on Hadoop/YARN4

Lessons learned during transition to Hadoop5

2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier

Page 15: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 16

Cluster Build Experiences• Node selection– Single Node SKU, use commodity hardware components

– Memory will be cheap, keep expansion options open

– Spindle-Core-LAN Network ratios (1 : 2.5 : 1.5 Gbs)

• Balance mixed workloads using YARN– Large clusters are better for effective resource utilization

– Balance between ETL, Batch, Interactive jobs with YARN

• Platform features and best practices– Central monitoring, log aggregation, and alerting metrics (ELK stack)

– Role based automated deployment of OS and Hadoop configuration

Page 16: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 17

Journey to Hadoop• Goals– Open Source platform

– Scalable Distributed Processing

• Existing app base built around SQL• Many technology choices in Hadoop ecosystem– Technology choices: Distributed Query Engines vs. fast MR

– Evaluation with multi-PB data sets using 15 of our representative workloads.• e.g., complex joins (data shuffle), queries with variety of data

– Criteria: Scale, Functionality, Stability, Performance, Integration with other open source ecosystem

– Hive was the only technology able to scale and provide easy migration from our SQL workloads.

– With Tez we had an acceptable performance trade off.

Page 17: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 18

RDS and ETL Process

• Platform features for ETL– File ingestion and Job management APIs

– Secure tenancy, Replication

• Conversion of 5 GB log file(.gz to .bz2)1. Single node outside Hadoop: ~28 mins

2. In Hadoop, single mapper, parallel read and write approach: ~5 mins

• A parallel RDS and ETL using YARN– Source file ingested from remote location

– Converted to bz2 and stored in HDFS Raw Data Store (Passive data)

– Data is transformed and loaded into Hive (Active data in ORC format)

– Mix “active” and “passive” datasets in HDFS

Use YARN for managing ETL

API

NN

DNDN

DN

DNDN

DN

Local .gz->bz2

MR based .gz->bz2

1

2

Page 18: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 19

Large Cluster YARN Performance Modeling

• Multi-mode:– ETL jobs: Guaranteed throughput – window computing– Ad-hoc queries – Low latency, fast execution– Batch analytics applications – Throughput

• Multi-level– Departments/Projects, Users

• How do we model and use YARN for above workloads?

Page 19: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 20

Example

Performance Modeling

ETL

Batch

Ad-hoc

Map Tasks

Reduce Tasks

HDFS Storage

Step 1: Compile your workload model

Page 20: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 21

YARN Queue Model - 1

ETL Queue

Ad-hoc Queue

Batch Queue

Root Queue

Projects Queues

Jobs

Cluster Utilization:

Avg Latency:

Throughput (jobs):

Step 2: Develop your YARN queue resource allocation hierarchy

Page 21: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 22

YARN Queue Model - 2

ETL Ad-HocBatch

Root Queue

Project Queues

Jobs

Cluster Utilization:

Avg Latency:

Throughput (jobs):

Page 22: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 23

YARN Queue Model - 3

ETL Queue

Batch Queue

Root Queue

Ad-hocProject Queues

Jobs

Project Queues

Step 3: Run jobs, iterate thru’ models and pick optimal

Cluster Utilization

Avg Wait Time

Throughput (jobs):

Page 23: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 24

Right Balance

• Optimal solution is about right balance– Cluster infrastructure

– Use the right software stack from Hadoop ecosystem

– Data management

– Application design and workload balancing with YARN

– Good tools for monitoring and management

• Approach– Start small and iterate faster

– When in doubt, experiment and get data to make decisions.

– Keep up customer use cases in perspective.

Page 24: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 25

Summary

– Incremental transition from MPP to Big Data– A journey towards open source distributed computing– Uniform Computing!

• Infrastructure building blocks• Single large YARN cluster for variety of compute and storage loads

– Open source – use and contribute• Work with community to address gaps

– Share your ideas

Page 25: Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 26

Q & A