Managing large clusters resources - KTH...•Provide advance scheduling policies •Run efficiently...

ID2210

Gautier Berthou (SICS)

Managing large clusters resources

Big Data Processing with No Data Locality

Job(“/crawler/bot/jd.io/1”)

Workflow Manager

21 3 5 6 53 6 21 4 41 52 34 6

Compute Grid Node Job

submit

This doesn’t scale.Bandwidth is the bottleneck

1 6 3 2 54

MapReduce – Data Locality

Job Tracker

21 3 5 6 53 6 21 4 41 52 34 6

Task Tracker

submit

Job Job Job Job Job Job

DN DN DN DN DN DN

R RR R = esultFile(s)

First step

Hadoop 1.x

HDFS(distributed storage)

MapReduce(resource mgmt, job scheduler,

data processing)

Single Processing Framework

Batch Apps

06/05/2015 Managing large clusters resources, Gautier Berthou 4

MapReduce – Data Locality

Job Tracker

21 3 5 6 53 6 21 4 41 52 34 6

Task Tracker

submit

Job Job Job Job Job Job

DN DN DN DN DN DN

R RR R = esultFile(s)

The Job Tracker Job

•Distribute the map and reduce tasks on the nodes of the cluster

•Ensure fairness of the cluster resource attributions

•Track the progress of these tasks

•Authenticate job tenants and make sure that each job is isolated from the others

•Etc.

Limitations of MapReduce [Zaharia’11]

Reduce

Input Output

•MapReduce is based on an acyclic data flow from stable storage to stable storage.

- Slow writes data to HDFS at every stage in the pipeline

•Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data:

-Iterative algorithms (machine learning, graphs)

-Interactive data mining tools (R, Excel, Python)

Limitations

•Only one programming model.

•The map reduce framework is not using the cluster at its maximum.

•The job tracker is a bottle neck.

•The job tracker is a single point of failure.

06/05/2015 Managing large clusters resources, Gautier Berthou

Goals for a new Scheduler

•Being able to run different frameworks

•Scale

•Provide advance scheduling policies

•Run efficiently with different kind of workloads

Second step

Hadoop 2.x

MapReduce(data processing)

HDFS(distributed storage)

YARN(resource mgmt, job scheduler)

Others(spark, mpi, giraph, etc)

Multiple Processing Frameworks

Batch, Interactive, Streaming …

Examples of scheduling Policies

•Capacity scheduler:

- Applications have different levels of priorities.

•Fair scheduler:

- Applications have different levels of priorities.

- Used resources can be preempted.

•Reservation-based scheduler(1):

- Applications can indicate how long they will run and whenthey have to be finished.

(1) Reservation-based Scheduling: If you’re late don’t blame us!, C. Curino & al., Microsoft tech-report

Scenario

•3 kinds of jobs:

- Emergency jobs: need to be run as soon as possible.

- Production jobs: have a deadline, a known running time and are very exigent on the nodes they can be scheduled on.

- Best effort jobs: interactive jobs that have lower priority, but on which users expect low latency.

Capacity scheduler

Best effort

production

Best effort

emergencyBest effort

Best effort

NowProduction work deadline

production

Fair scheduler

Best effort

emergency

Best effort

production

Reservation-based scheduler

Best effort

emergency

Best effort

productionBest effort

production

Scheduler Architectures

Omega: flexible, scalable schedulers for large compute clusters, Malte Schwarzkopf & al., EuroSys’13

The monolithic Scheduler

•Yarn:- Apache Hadoop YARN: Yet Another Resource Negotiator, V. K.

Vavilapalli & al., SoCC’13.

•Borg:- Large-scale cluster management at Google with Borg, A. Verma &

al., EuroSys’15.

Architecture 1/3

Architecture 2/3

Data node

Resources Manager

Architecture 3/3

Data node

Resources ManagerStandby

Resources Manager

StandbyResources Manager

MasterResources Manager

zookeeper

Pros and Cons

•Pros:

- Fine knowledge of the state of the cluster state -> optimal use of the cluster resources.

- Easy to implement new scheduling policies.

•Cons:

- Bottle neck.

- The failure of the master scheduler has a big impact on the cluster usage.

Two level Scheduler

•Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, B. Hindman & al., NSDI’11

Architecture 1/2

Architecture 2/2

Data node

Mesos Master

MapReduceScheduler

SparkScheduler

FlinkScheduler

Partial State

Pros and Cons

•Pros:

- Scale out by adding schedulers.

- Concurrent scheduling of tasks.

•Cons:

- Suboptimal use of the cluster. Especially when there exist long running tasks.

Shared State Scheduler

•Omega: flexible, scalable schedulers for large compute clusters, M. Schwarzkopf & al. EuroSys’13

Architecture 1/2

Architecture 2/2

Data node

State Manager

MapReduceScheduler

SparkScheduler

FlinkScheduler

Global state`

Architecture 2/2

Data node

State Manager

MapReduceScheduler

SparkScheduler

FlinkScheduler

Global state Conflict

Pros and Cons

•Pros

- Scalable.

- Good use of the cluster resources.

•Cons

- Unpredictable interaction between the different schedulers’ policies.

Comparison

Performance comparison 1/

Two-level

Simple monolithic Advence monolithic

Shared state

Performance Comparison 2/

What the previous evaluation does not show about the Two-levelscheduling:

Performance Comparison 3/

How to write a good paper.How to write an accepted paper.

Trying to handle more batch jobs in Omega by running several batch schedulers in parallel.

Sum up

•Two-Level and Shared state Schedulers scale better.

•Shared state Schedulers use the cluster resources more optimally than Two-level Schedulers.

•Monolithic Scheduler are a potential Bottleneck.

•But as Monolithic schedulers are easier to design, allow finer allocation of resources and more advance scheduling policies, they are the ones used in practice.

Making Yarn more scalable

•HOPS YARN: a one and a half level scheduler

Hadoop Yarn HA Implementation

Data node

Resources Manager

zookeeper

Hops Yarn HA Implementation3/3

Data node

Resources Manager

•Distributed, In-memory

•2-Phase Commit

- Replicate DB, not the Log!

•Real-time

- Low TransactionInactive timeouts

•Commodity Hardware

•Scales out

- Millions of transactions/sec

- TB-sized datasets (48 nodes)

•Split-Brain solved with Arbitrator Pattern

•SQL and Native Blocking/Non-Blocking APIs

MySQL Cluster (NDB) – Shared Nothing DB

SQL API NDB API

30+ million update transactions/second

on a 30-node cluster

Standby is boring

Data node

Dificulties 1/2

Difficulties 2/2

•Pulling from the database when the state is needed is inefficient.

•Having an independent thread that regularly pull from the database is difficult to tune and cause lock problems.

Solution

•Luckily NDB has an event API.

With streaming

Data node

Conclusion

•There exists three architectures for large cluster resource scheduling:

- Monolithic

- Two-levels

- Shared State

•Each of these architectures has pros and cons.

•The monolithic architectur is the one presently usedbecause it is easyer to use and develop.

•At KTH and SICS we are exploring the possibilitiesfor a new architecture ensuring more scalabilitywhile keeping the advantages of the monolithicarchitecture.

References

•Reservation-based Scheduling: If you’re late don’t blame us!, C. Curino & al., Microsoft tech-report

•Omega: flexible, scalable schedulers for large compute clusters, Malte Schwarzkopf & al., EuroSys’13

•Apache Hadoop YARN: Yet Another Resource Negotiator, V. K. Vavilapalli & al., SoCC’13.

•Large-scale cluster management at Google with Borg, A. Verma & al., EuroSys’15.

•Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, B. Hindman & al., NSDI’11

Managing large clusters resources - KTH...•Provide advance scheduling policies •Run efficiently...

Documents

Transcript of Managing large clusters resources - KTH...•Provide advance scheduling policies •Run efficiently...

HPE Reference Configuration for Elastic Platform for Analytics … · extend Hadoop beyond the o ften I/O intensive, high latency MapReduce batch analytics workloads, to also address

1. Introduction to MapReduce - UPMlsd.ls.fi.upm.es/.../IntroToMapReduce.pdf · Processing of massive data: MapReduce – 1. Introduction to MapReduce MapReduce has a 'low semantic

Synthetic Enterprise Application workloads · Enterprise Application Workloads . Test Profile – IOPS & RT levels of different synthetic workloads . 6 . Composite Enterprise . Workloads

Introduction to Hadoop - KTH · Limitations of MapReduce [Zaharia’11] Map Map Map Reduce Reduce Input Output •MapReduce is based on an acyclic data flow from stable storage to

MapReduce & Hadoop IIcslui/CMSC5702/mapreduce_hadoop2.pdf · MapReduce & Hadoop II ... MapReduce & Hadoop MapReduce Recap ... example, the combiners aggregate term counts across the

A Storage-Centric Analysis of MapReduce Workloads: File ...publish.illinois.edu/assured-cloudcomputing/files/2014/03/A-Storage... · log trace (Jun. 9th, 2011 through Dec. 8, 2011),

Workload Analysis, Implications and Optimization on a ...weisong.eng.wayne.edu/_resources/pdfs/ren13-taobao.pdf · Abstract—Understanding the characteristics of MapReduce workloads

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid1, Kamil BajdaPawlikowski1, Daniel Abadi1, Avi.

Energy Efficiency for Large-Scale MapReduce Workloads with

Internet Security - KTH | V¤lkommen till KTH

BigDataBench Technical Report - ict.ac.cnprof.ict.ac.cn/BigDataBench/wp-content/uploads/... · analytics workloads using MapReduce, MPI, Spark, DataMPI, interactive an-alytics and

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals

HadoopDB : An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Design Insights for MapReduce from Diverse Production ... · Design Insights for MapReduce from Diverse Production Workloads Yanpei Chen, Sara Alspaugh, Randy Katz University of California,

MREv: An Automatic MapReduce Evaluation Tool for Big Data ... · MREv: an Automatic MapReduce Evaluation Tool for Big Data Workloads JorgeVeiga,RobertoR.Exp´osito,GuillermoL.Taboada,andJuanTouri˜no

Hadoop/MapReduce - 123seminarsonly.comHadoop MapReduce • MapReduce is a programming model and software framework first developed by Google (Google’s MapReduce paper submitted in

Social Network System Design - KTH | V¤lkommen till KTH

Data Management in Large-Scale Distributed Systems - MapReduce … · Introduction to MapReduce The Hadoop Eco-System HDFS Hadoop MapReduce 4. MapReduce at Google Publication The

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

MapReduce and Hadoop File Systemnsrit.edu.in/admin/img/cms/10096mapreduce.pdf · The Outline Introduction to MapReduce From CS Foundation to MapReduce MapReduce programming model