Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier 1

Experiences in migration of large analytics platform from MPP database to Hadoop YARN

Srinivas Nimmagadda Roopesh VarierTechnical Director, CPE Director, CPE

Agenda

Introduction1

Big Data Needs2

MPP Platform and Challenges3

New Platform based on Hadoop/YARN4

Lessons learned during transition to Hadoop5

2Hadoop Summit 2014 – Srinivas Nimmagadda & Roopesh Varier

Overview

• Symantec Cloud Platform Engineering (CPE)– Build consolidated cloud infrastructure and platform services

for next generation data powered Symantec applications.

– Open source components as building blocks• Hadoop and Openstack• Bridge capability gaps and contribute back

• A big data platform for batch and stream analytics integrated with Openstack. – Security, multi-tenancy, and reliability

• Using large scale data analytics for security and data management work loads– Analytics – Reputation based security, Managed Security

Services, Fraud Detection, Dial home application logs

Big Data Challenge

• Hundreds of millions of users• Billions of files– File good or not?

• Millions of URLs– URL safe or not?

• Hundreds of thousands of applications– Stable or Crashed

• Constant feed of information – Real time

– Across the global

– From our applications and appliances

Value from Volume

• Volume of data– Multi-petabyte historical datasets

– Multi-terabyte daily incremental datasets

– Wide variety of input data formats

– How do we manage?

• Variety of workloads– ETL jobs

– Batch applications

– Interactive ad-hoc analysis

• How to extract value from volume near real-time?

Agenda

Introduction1

Big Data Needs2

MPP Platform

ETL Cluster Platform Services

Raw Data Store

Data Sources Applications

Interactive

MPP DB Engine

Legacy MPP Analytics Solution• Custom Platform Services– Task/Job management (DAG based, Fault-tolerant)

– Functional and performance monitoring

– Automatic data lifecycle management

– Inter cluster data transfers

– Cluster tenancy management

• ETL cluster • RDS (raw data store) on NAS• MPP (Massively Parallel Processing) DB engine at the core

Key Challenges• Scalability– Supporting rapid data growth

– No support for heterogeneous hardware.

• Operational costs– OpEx and Software licenses

• Supporting new use models– Not Only SQL patterns in analytics (columnar storage, search, streaming)

• Cluster operational challenges– Limited resource management (limits/quotas, utilization throttling)

– Load balancing across multi-mode and multi-tenant workloads

– Integrated secure tenancy services

– HA and DR

Agenda

Introduction1

Big Data Needs2

MPP Platform

Raw Data Store ETL Cluster Platform

Services

Interactive

MPP DB Engine

7: YARN/HDFS6: DistCP, Falcon5: DAG: Oozie

MPP DB Engine3: HDFS

MPP to Big Data Platform

Raw Data Store

Platform Services

Interactive

1: Commodity Hardware

2: Hadoop Cluster 4: YARN

Job Management

State Transfer

Tenancy GuardETL Cluster

Interactive

Interactive Batch

Big Data Platform

Multi-tenant

BatchInteractive

1: Cluster Infrastructure 2: Hadoop 2.x Stack

3: HDFS

5: Oozie

4: YARN

ETLInteractive Batch

Raw Data Store ETL Jobs Batch Interactive

Ad-hoc

workloads

Role-based provisioning Unified Logging

Agenda

Introduction1

Big Data Needs2

Cluster Build Experiences• Node selection– Single Node SKU, use commodity hardware components

– Memory will be cheap, keep expansion options open

– Spindle-Core-LAN Network ratios (1 : 2.5 : 1.5 Gbs)

• Balance mixed workloads using YARN– Large clusters are better for effective resource utilization

– Balance between ETL, Batch, Interactive jobs with YARN

• Platform features and best practices– Central monitoring, log aggregation, and alerting metrics (ELK stack)

– Role based automated deployment of OS and Hadoop configuration

Journey to Hadoop• Goals– Open Source platform

– Scalable Distributed Processing

• Existing app base built around SQL• Many technology choices in Hadoop ecosystem– Technology choices: Distributed Query Engines vs. fast MR

– Evaluation with multi-PB data sets using 15 of our representative workloads.• e.g., complex joins (data shuffle), queries with variety of data

– Criteria: Scale, Functionality, Stability, Performance, Integration with other open source ecosystem

– Hive was the only technology able to scale and provide easy migration from our SQL workloads.

– With Tez we had an acceptable performance trade off.

RDS and ETL Process

• Platform features for ETL– File ingestion and Job management APIs

– Secure tenancy, Replication

• Conversion of 5 GB log file(.gz to .bz2)1. Single node outside Hadoop: ~28 mins

2. In Hadoop, single mapper, parallel read and write approach: ~5 mins

• A parallel RDS and ETL using YARN– Source file ingested from remote location

– Converted to bz2 and stored in HDFS Raw Data Store (Passive data)

– Data is transformed and loaded into Hive (Active data in ORC format)

– Mix “active” and “passive” datasets in HDFS

Use YARN for managing ETL

Local .gz->bz2

MR based .gz->bz2

Large Cluster YARN Performance Modeling

• Multi-mode:– ETL jobs: Guaranteed throughput – window computing– Ad-hoc queries – Low latency, fast execution– Batch analytics applications – Throughput

• Multi-level– Departments/Projects, Users

• How do we model and use YARN for above workloads?

Example

Performance Modeling

Ad-hoc

Map Tasks

Reduce Tasks

HDFS Storage

Step 1: Compile your workload model

YARN Queue Model - 1

ETL Queue

Ad-hoc Queue

Batch Queue

Root Queue

Projects Queues

Cluster Utilization:

Avg Latency:

Throughput (jobs):

Step 2: Develop your YARN queue resource allocation hierarchy

ETL Ad-HocBatch

Root Queue

Project Queues

Cluster Utilization:

Avg Latency:

Throughput (jobs):

ETL Queue

Batch Queue

Root Queue

Ad-hocProject Queues

Project Queues

Step 3: Run jobs, iterate thru’ models and pick optimal

Cluster Utilization

Avg Wait Time

Throughput (jobs):

Right Balance

• Optimal solution is about right balance– Cluster infrastructure

– Use the right software stack from Hadoop ecosystem

– Data management

– Application design and workload balancing with YARN

– Good tools for monitoring and management

• Approach– Start small and iterate faster

– When in doubt, experiment and get data to make decisions.

– Keep up customer use cases in perspective.

Summary

– Incremental transition from MPP to Big Data– A journey towards open source distributed computing– Uniform Computing!

• Infrastructure building blocks• Single large YARN cluster for variety of compute and storage loads

– Open source – use and contribute• Work with community to address gaps

– Share your ideas

Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Technology

Transcript of Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Apache Hadoop YARN - The Future of Data Processing with Hadoop

Hadoop 2.0 | Hadoop Admin & Development | - Introduction : Yarn (Inclusive)

YARN - Next Generation Compute Platform fo Hadoop

Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future

YARN - Presented At Dallas Hadoop User Group

Hadoop on Beacon - jics.tennessee.edu · mapreduce.framework.nam e yarn Execution framework set to Hadoop YARN mapreduce.map.memory.m b 2048 Larger resource limit for maps mapreduce.map.java.opts

YARN - way to share cluster BEYOND HADOOP

Bikas saha：the next generation of hadoop– hadoop 2 and yarn

YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Apache Ambari: Managing Hadoop and YARN

Apache Yarn - Hadoop Cluster Management

Hadoop YARN in the Cloud

Debugging Apache Hadoop YARN Cluster in Production

MapReduce on YARN Job Execution - Core Servletscourses.coreservlets.com/Course-Materials/pdf/hadoop/04-MapRed-6... · YARN vs. Old MapReduce • Prior to YARN Hadoop had JobTracker

MPP vs Hadoop

YARN webinar series: Using Scalding to write applications to Hadoop and YARN

Hadoop Summit San Jose 2014 - Apache Hadoop YARN: Best Practices

Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

Hadoop 2.0 and YARN

What's new in Hadoop Yarn- Dec 2014