Dynamic Resource Allocation Spark on YARN

Dynamic Resource Allocation for Spark on

YARNozawa@apache.org Tsuyoshi Ozawa

What’s YARN

• A resource manager implementationfor computer cluster

Hadoop Stack

MapReduceSpark Tez

YARN overview• All resources are managed by ResourceManager

• All tasks are launched on NodeManager

• Client submit jobs via ResourceManager

NodeManager NodeManager

ResourceManager client

Spark on YARN• 2 mode

• yarn-cluster

• yarn-client

yarn-cluster mode• Launching Spark driver on YARN container

• Working well with spark-submit

NodeManager NodeManager NM

container1 container2Spark AppMaster

clientResource Manager 1 submit

2 launching master

3 launching executers

spark driver

yarn-client mode• Launching Spark driver at client side

• Working well with spark-shell

2 launching master

3 launching executers spark driver

4. send commands

Spark on YARN• yarn-cluster mode

Node1 Node2 Node3

container1

container2

AppMaster container2

Problem• Inefficient resource management

• containers cannot exit until job exits

Node1 Node2

container container container container

stage1

stage2

100% 100% 100% 100%

100% 0% 0% 0%

Dynamic resource allocation(since v1.2)

• Allocating containers more dynamically

• number of executers are decided by workload

2 launching master

3 launching executers/

kill executors

spark driver

Yak shaving• Where should we hold the state of Spark RDD?

• If executers are killed, it’ll be lost…

NodeManager

executer executerRDD RDD

external shuffle • Saving Spark RDD to NodeManager

• NodeManager has a interface, external shuffle plugin

• Now executers are stateless!

NodeManager

executer executerexternal

shuffle plugin

RDD (IntermediateFile)

How to install (with Apache Hadoop)

• Copy shuffle plugin to nodemanager’s classpath

• Edit yarn-site.xml

• Edit spark-defaults.conf

Copy shuffle jar to nodemanager’s classpath

$ cp \ lib/spark-*-yarn-shuffle.jar \ /home/ubuntu/hadoop/share/hadoop/yarn/

Edit yarn-site.xml• Adding shuffle plugin

• Note that documentation for 1.2 includes typo - I PRed :-)

• See documentation for 1.4

Edit spark-defaults.conf

We’re ready!!

• num-executers are defined automatically

Summary• Spark on YARN

• yarn-client mode

• yarn-cluster mode

• Spark can launch jobs efficiently on YARN with dynamic allocation

Dynamic Resource Allocation Spark on YARN

Documents

Transcript of Dynamic Resource Allocation Spark on YARN

Troubleshooting Oracle Stream Analytics · 1 Troubleshooting Oracle Stream Analytics 1.1 Pipeline Debug and Monitoring Metrics 1-1 1.1.1 Spark Standalone 1-1 1.1.2 Spark on YARN 1-2

Hadoop 2.x Core: YARN, Tez, and Spark · Hadoop Version 2.x •Hadoop 2.x has two core components. –HDFS provides distributed, scalable, and highly available data storage. –YARN

1 V - static.ucloud.cn · kjwc.jari spark-streaming-kafka-assembly_2.10-1.5.2.jar x(l Ï; U Ô QìÐjÛ§ spark-submit --master yarn --jars spark-streaming-kafka-assembly_2.10-1.

Installation Guide Version 1.7 - KNIME · • master = yarn-client for running Spark in YARN-client mode • master = spark://localhost:7077 for stand-alone mode • master = local[4]

On-premise Spark as a Service with YARN

Apache Spark Guide - Clouderadrwxr-x--x - spark spark 0 2018-03-09 15:18 /user/spark drwxr-xr-x - hdfs supergroup 0 2018-03-09 15:18 /user/yarn [testuser@myhost root]# su impala

Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn: Spark Summit East talk by Ed Barnes and Ruslan Vaulin

DATA INTELLIGENCE FOR ALL Distributed - Spark Summit · PDF fileDistributed DataFrame on Spark: Simplifying Big Data ... HDFS / YARN Scala + Java API ... Design Principles

Spark - Meetupfiles.meetup.com/3138542/Spark in 2015 - Wendell.pdfmachine learning& Spark& SQL Kaa,&S3,&Cassandra,&HDFS& YARN,&Mesos,Standalone Userapp& Schema&RDD&/&Data&Frame&API&

Research of Decision Tree on YARN Using …worldcomp-proceedings.com/proc/p2014/ABD3234.pdfResearch of Decision Tree on YARN Using MapReduce and Spark Hua Wang 1, Bin Wu , Shuai Yang1,

SPARK Hive 9 & 10, June 2018 - annauniv.edu on Bigdata.pdfBigdata Overview Hadoop Basics HDFS Commands YARN Architecture MapReduce Hadoop Ecosystem & Projects Sqoop Hive SPARK H o

Analytics with Spark - Meetupfiles.meetup.com/16395762/Analytics_with_Spark.pdfRemote Spark Context •Being created and living outside HiveServer2 •In yarn-cluster mode, Spark context

Yarn spark next_gen_hadoop_8_jan_2014

CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark •Spark streaming 2 Papers •MapReduce •GFS •Spark Exercises •MapReduce •Futures,

Hive on Spark: What It Means To You · Hive on Spark: What It Means To You Xuefu Zhang Cloudera Apache Hive PMC ... Defines how Spark utilizes YARN resources (core, memory) – spark.executor.cores

Producing Spark on YARN for ETL

Big Data for Engineers – Exercises - ETH...Big Data for Engineers – Exercises Spring 2019 – Week 8 – ETH Zurich YARN + Spark Part 1: YARN What is YARN? Fundamentally, “Yet

Hortonworks Data Platform - Apache Spark Component Guide · Spark shell and the Spark Thrift server run in YARN-client mode only. HDP 2.6 supports Spark versions 1.6 and 2.0; Livy,

IIHTiihttrichy.com/brochuers/BigdataHadoopCoursesBrochure.pdf · Java Fundamentals, Hadoop Fundamentals, HDFS, MapReduce, Spark, Hive, Pig and Latin, HBase, Sqoop, Yarn, MongoDB and

Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by Jim Dowling