Running Hadoop as Service in AltiScale Platform

Experiences in running Hadoop As A Service chaiken@altiscale.com = #HadoopSherpa

DAVID CHAIKEN • 21 NOVEMBER 2014

Talk Outline

Altiscale Company Introduction and Perspective

Altiscale Architecture

Use Cases: Performance, Job Analysis, Scheduling

Infinite Hadoop

Challenges to the Hadoop Community

Corporate Background

Hadoop-as-a-Service (HaaS) innovator

Company founded in 2012 (Palo Alto & Chennai)

Founding team from Yahoo •  Raymie Stata, CEO, Former CTO

•  David Chaiken, CTO, Former Chief Architect

•  Charles Wimmer, Head of Operations, Former SRE

Employees from Yahoo, Google, Netflix, LinkedIn, VMware and others

Altiscale Chennai

Long-term colleagues from Yahoo and before

IIT Madras Research Park (back gate of IIT-M)

Architecture, Core Development, Test (Apache Bigtop)

Control Plane agile development, 2-week sprints

Next: Test++, Customer Support, Operations

Everybody Loves Hadoop But…

Significant capex expenditure on infrastructure

•  Complex to manage and maintain

Time to get cluster up and running is long

Capacity planning is difficult

Skillset is difficult to recruit, train and retain

What about the cloud?

True Hadoop-as-a-Service

Altiscale is the industry’s first purpose-built, petabyte scale Hadoop cloud

•  Altiscale operates Hadoop for you •  Infrastructure optimized to run Hadoop

fast and reliably •  Pay for Hadoop service, not

infrastructure

We Team With You To Help Deliver Insights

Poten2al insights from a flood of data generated by the

connected world

Our Opera2ons Team and Hadoop Cloud helps realize

those insights

Customer Al,scale

Customers

How We Do It

Virtual Hadoop Cluster

YARN Service

HDFS Service

More Apps

File Transfer

KaRa Flume

Data Connect

Hive Pig Oozie

Pre-‐configured Apps We op2mize the job to complete fast

and cost-‐effec2vely

Your data is migrated to HDFS

and a virtual Hadoop cluster in

our cloud

Our Hadoop Helpdesk gives you access to Hadoop experts

Our Hadoop Opera2ons Team maintains the

cluster and plans the job

Our team monitors and manages the job through to comple2on

We provide an up2me SLA so our Hadoop

Altiscale Architecture: Data and Control Planes

Altiscale Architecture: Customer Environments

Altiscale Architecture: O&O Hadoop Cluster

Altiscale Architecture: Host Components

Altiscale Architecture: Workbenches

Altiscale Architecture: Data Transfer

Altiscale Architecture: Portal and REST API

Altiscale Architecture: Control Plane Databases

Altiscale Architecture: Control Plane Services

Altiscale Architecture: Hadoop-Based Analysis

Hadoop as a Service Offering

Data is migrated to our HDFS service HDFS Service

Data Connectors

Foundry Apps Apache Mahout Cascading Revolu2on R KaRa/Camus Avro Pentaho Ke\le Matlab Spark Sqoop H2O

Core Apps Apache Hive Apache Pig Apache Oozie Apache HCatalog Apache Flume R JDK/JRE Python H\pFS FUSE LZOP, Snappy, gzip

Terminal access to Hadoop cluster and associated apps

Portal provides job status, billing and support information

Challenges…

Disks: Configuration, Controllers, Density, Cost

Network: Jumbo Packet MTU

Memory: echo never > \

/sys/kernel/mm/redhat_transparent_hugepage/enabled

Network: When does locality matter?

Flash: When to use SSD?

Performance Challenges…

Customer provided Hive query + data sets (100GBs to ~5 TBs) Needed help optimizing the query Didn’t rewrite query immediately Wanted to characterize query performance and isolate bottlenecks first

Customer Case Study: Analyze Query

Ran original query on the datasets in our environment: •  Two M/R Stages: Stage-1, Stage-2

Long running reducers run out of memory •  set mapreduce.reduce.memory.mb=5120!•  Reduces slots and extends reduce time

Query fails to launch Stage-2 with out of memory •  set HADOOP_HEAPSIZE=1024 on client machine

Query has 250,000 Mappers in Stage-2 which causes failure

•  set mapred.max.split.size=5368709120 to reduce Mappers

Analyze and Tune Execution

Next challenge - how to visualize job execution? Existing hadoop/hive logs not sufficient for this task Wrote internal tools

•  parse job history files •  plot mapper and reducer execution

Analysis: Job Execution Characteristics

Analysis: Map (Stage-1)

Single reduce task

Analysis: Reduce (Stage-1) Long Tail

Analysis: Map (Stage-2)

Analysis: Reduce (Stage-2)

Lone, long running reducer in first stage of query Analyzed input data:

•  Query split input data by userId •  Bucketizing input data by userId •  One very large bucket: “invalid” userId •  Discussed “invalid” userid with customer

An error value is a common pattern! •  Need to differentiate between “Don’t know and don’t care”

or “don’t know and do care.”

Analysis Execution: Findings

Loading data into DRAM makes processing fast! Examples: Spark, Impala, 0xdata, …, [SAP HANA], … Streaming systems (Storm, DataTorrent) may be similar Need to increase YARN container memory size

Interactive (DRAM-centric) Processing Systems

Caution: larger YARN container settings for interactive jobs may not be right for batch systems like Hive Container size: needs to combine vcores and memory: yarn.scheduler.maximum-allocation-vcores yarn.nodemanager.resource.cpu-vcores ...!

Hive + Interactive: Watch Out for Container Size

Attempting to schedule interactive systems and batch systems like Hive may result in fragmentation Interactive systems may require all-or-nothing scheduling Batch jobs with little tasks may starve interactive jobs

Hive + Interactive: Watch Out for Fragmentation

Solutions for fragmentation… Reserve interactive nodes before starting batch jobs Reduce interactive container size (if the algorithm permits) Node labels (YARN-726) and gang scheduling (YARN-624)

Hive + Interactive: Watch Out for Fragmentation

Altiscale’s point of view on Hadoop as a Service:

•  sell HDFS in increments of 10 TB

•  sell compute in increments of 10K TaskHours/Month

We market Infinite Hadoop, and provide services so that customers need not worry about cluster nodes.

But Apache Hadoop user interfaces provide node-oriented view of clusters…

Altiscale: Hadoop Storage and Compute

ResourceManager User Interface

NameNode User Interface

Feedback from Customers Storage plan normally easy to estimate

Compute plan is hard to estimate •  Customer pain point: achieving necessary

computation needs sometimes requires more peak compute capacity than provided by the number of nodes required for storage

•  Opportunity: average compute often requires less than the number of nodes required for storage

Solution: Change Altiscale’s Product! Make “Infinite” computation available to customers

Multitenancy implementation phases, each of which includes a milestone with production deliverables

0. Automation for burn/add/remove nodes 1. Deploy Linux containers using Docker 2. Decouple compute/storage + manual bursting 3. Automation: orchestrate add/remove nodes according to

allocation plan from the capacity team. 4. Optimized: predictive allocation, economic incentives

Physical Cluster per Customer

NM and DN in Docker Containers

Decouple Compute/Storage

What Customers Get On demand access to “Infinite” Computation

Ability to handle unexpected needs without contacting Altiscale

“Access to a $10M cluster for just $1M”

Future…

Ability to package Hadoop job environment using Docker (YARN-1964)

Hive + Hadoop debugging can get very complex •  Sifting through many logs and screens

•  Automatic transmission versus manual transmission

Static partitioning induced by Java Virtual Machine has benefits but also induces challenges. Where there are difficulties, there’s opportunity:

•  Better tooling, instrumentation, integration of logs/metrics

YARN still evolving into an operating system Just starting to build real multitenancy into Hadoop. Hadoop as a Service: aggregate and share expertise

Challenges to the Hadoop Community

Running Hadoop as Service in AltiScale Platform

Technology

Transcript of Running Hadoop as Service in AltiScale Platform

MapReduce, Hadoop and Amazon AWSlopes/teaching/cs221W15/slides/Hadoop-AWS.pdf · Hadoop Map/Reduce - Terminology •Running “Word Count” across 20 files is one job •Job Tracker

Running Hadoop

Apache Hadoop - Hortonworkshortonworks.com/wp-content/uploads/2012/01/ApacheHadoop-Next.pdf · – Formerly, Architect Hadoop MapReduce, Yahoo – Responsible for running Hadoop MR

Hadoop 2.6 Conﬁguration and More Examplestorlone/bigdata/E1-AWS.pdfApache Hadoop & YARN Apache Hadoop (1.X) De facto Big Data open source platform Running for about 5 years in production

OC Big Data Monthly Meetup #5 - Session 1 - Altiscale

Word count problem - dbdmg.polito.it€¦ · Toy example file for Hadoop. Hadoop running example. (toy, 1) (example, 2) (file, 1) (for, 1) (hadoop, 2) (running, 1) Word count problem

Running Spark & Hadoop with Dell EMC Isilon · Running Spark & Hadoop with Dell EMC Isilon Boni Bruno, CISSP, CISM, CGEIT Chief Solutions Architect, Analytics ABSTRACT This white

VIRT1445BU Extreme Performance: Fast Virtualized Hadoop or ... · • Previous VMware tests running MapReduce v1 apps show virtualized Hadoop performance at parity or faster than

Hadoop Basics with InfoSphere BigInsights · For this lab all Hadoop components should be up and running. If all components are running you may move on to Section 2 of this lab. Otherwise

Datameer Professional · The Altiscale Data Cloud provides the perfect foundation for Datameer Professional edition. By providing an end-to-end solution, Datameer and Altiscale eliminate

Lessons Learned Running Hadoop and Spark in Docker Containers

Introduction to Running Hadoop on the High Performance ...bina/cse487/spring2013/Intro-Hadoop-Spring-2013.pdfIntroduction to Running Hadoop on the High Performance Clusters at the

Hadoop 3.X more examplestorlone.dia.uniroma3.it/bigdata/E2-Cluster-20.pdf · 2020-04-01 · Hadoop 3 Running on AWS Connection to hadoop of EMR cluster: $:~ ssh hadoop@ec2-54-164-153-7.compute-1.amazonaws.com

SDN - Talentica · PROBLEM STATEMENT - TERASORT We are running Hadoop TeraSort as a Hadoop MapReduce job for ﬂow generation in an SDN-enabled topology. TeraSort is a Hadoop benchmarking

Hadoop 2.X on a cluster environment - Roma Tre Universitytorlone/bigdata/E2-Cluster.pdf · 2017-04-06 · Hadoop 2 Running on AWS Connection to hadoop of EMR cluster: $:~ ssh hadoop@ec2-54-164-153-7.compute-1.amazonaws.com

6 Hadoop, ioenotes.eduioenotes.edu.np/media/notes/big-data/6-Hadoop.pdf · Hadoop Hadoop is an open source framework for writing and running distributed applications that process

Running Hadoop on Amazon EC2

Getting Hadoop, Hive and HBase up and running in …archive.apachecon.com/na2013/presentations/26-Tuesday...Getting Hadoop, Hive and HBase up and running in less than 15 mins ApacheCon

Running Hadoop. Hadoop Platforms Platforms: Unix and on Windows. – Linux: the only supported production platform. – Other variants of Unix, like Mac OS.

by Nitesh Jain. Look for Become a Certified Hadoop …eecs.csuohio.edu/~sschung/cis612/Setting-up-Hadoop-made...Hadoop 1.2.1 2. Ubuntu LTS 12.04 (running on virtual Machine) 64 Bit