Gandiva: Introspective Cluster Scheduling for Deep …patelp1/misc/gandiva...memory on resume - Or,...

- DL jobs have different sensitivities to resource affinity, due to network architecture or hyper- parameters (e.g., batch-size) - Other cases: Inter-server locality, 1-GPU interference, NIC interference Gandiva: Introspective Cluster Scheduling for Deep Learning Characteristics of Deep Learning Jobs Scheduling Mechanisms for Deep Learning Experimental Results Wencong Xiao ♠†* , Romil Bhardwaj †* , Ramachandran Ramjee † , Muthian Sivathanu † , Nipun Kwatra † , Zhenhua Han ♣† , Pratyush Patel†, Xuan Peng ♦† , Hanyu Zhao ●† , Quanlu Zhang † , Fan Yang † , Lidong Zhou † ♠ Beihang University, † Microsoft Research, ♣ The University of Hong Kong, ♦ Huazhong University of Science and Technology, ● Peking University Model Sensitivity Job3 Job4 … JobN Cluster Scheduler - Allocation reactively at job arrival and departure - Dedicated GPUs for a job in its whole lifetime - Jobs queued if no qualified resources Intra-job Predictability - GPU Memory usage follows a cyclic pattern aligned with mini-batch boundaries, usually with more than 10x difference in utilization within a mini-batch Feedback-driven exploration Job Queue - Opportunistically scale jobs to idle GPUs - Vacate GPUs on-demand - Depends on job capabilities to utilize additional GPUs Hyperparameter Search with Time-slicing Time to find a qualified model (minutes) Search across 12 dimensions - LeNet on CIFAR-10 Cluster Experiment: Time-slicing + Migration Cluster Experiment: Time-slicing + Packing 9-day trace from Microsoft servers on 100 GPUs CDF of job completion time Low overhead Suspend/Resume & Migration Migration time of real workloads Mixed PyTorch jobs on 180 Tesla GPUs Number of successful packing decisions Average GPU util(%) Time to find a qualified model (minutes) GPU Memory usage during training Intra-server locality - Deep learning experiments today use manual or automatic (AutoML) trial-and-error techniques to find the best model Job1 Job2 Job3 Job4 Job5 Job6 Which model is better? GPU cluster GPU Introspective Policies Up to 7x faster hyperparameter search 27% reduction in job completion time 13.6x faster AutoML in shared environments 26% increase in cluster GPU utilization 4.5x faster job feedback Over-subscription - Time-slice to allow multiple jobs to run simultaneously with a weighted time-share - Pack multiple jobs in the same server if jobs have light-weight resource requirements Profiling for introspection - Monitor resource utilization (e.g., GPU utilization and memory) - Non-invasive progress rate estimation for scheduling decisions - Copy GPU memory to CPU at mini-batch boundaries - Restore state from CPU memory on resume - Or, run multiple processes simultaneously on a GPU. - Generic GPU processes: Use CRIU to dump and restore process state across machines - Checkpoint-aware processes: repurpose TF checkpointing APIs to save and restore state. Pre-warm libraries for fast migration. Traditional GPU Allocation Suspend-Resume/Packing Migration Grow-Shrink Runtime adjustment - Migrate jobs at mini-batch boundary if better resources appear - Defrag GPUs to better compact resources for multi-GPU jobs - Grow to more resources when available and shrink when required Server 1 Server 2 Server 3 Server 4 J 1 J 1 J 2 J 2 Completed Job Multi-GPU Jobs Packed Jobs Migrate J 3 J 3

Upload
others
Category

Documents
view
2
download
0

Embed Size (px):

Transcript of Gandiva: Introspective Cluster Scheduling for Deep …patelp1/misc/gandiva...memory on resume - Or,...

Page 1: Gandiva: Introspective Cluster Scheduling for Deep …patelp1/misc/gandiva...memory on resume - Or, run multiple processes simultaneously on a GPU. -Generic GPU processes: Use CRIU

- DL jobs have different sensitivities to resource affinity, due to network architecture or hyper-parameters (e.g., batch-size)

- Other cases: Inter-server locality, 1-GPU interference, NIC interference

Gandiva: Introspective Cluster Scheduling for Deep Learning

Characteristics of Deep Learning Jobs

Scheduling Mechanisms for Deep Learning

Experimental Results

Wencong Xiao♠†*, Romil Bhardwaj†*, Ramachandran Ramjee†, Muthian Sivathanu†, Nipun Kwatra†, Zhenhua Han♣†, Pratyush Patel†, Xuan Peng♦†, Hanyu Zhao●†, Quanlu Zhang†, Fan Yang†, Lidong Zhou†

♠ Beihang University, † Microsoft Research, ♣ The University of Hong Kong,♦ Huazhong University of Science and Technology, ● Peking University

Model Sensitivity

Job3Job4…JobNCluster

Scheduler

- Allocation reactively at job arrival and departure

- Dedicated GPUs for a job in its whole lifetime

- Jobs queued if no qualified resources

Intra-job Predictability

- GPU Memory usage follows a cyclic pattern aligned with mini-batch boundaries, usually with more than 10x difference in utilization within a mini-batch

Feedback-driven exploration

Job Queue

- Opportunistically scale jobs to idle GPUs

- Vacate GPUs on-demand

- Depends on job capabilities to utilize additional GPUs

Hyperparameter Search with Time-slicing

Time to find a qualified model (minutes)

Search across 12 dimensions - LeNet on CIFAR-10

Cluster Experiment: Time-slicing + Migration

Cluster Experiment: Time-slicing + Packing

9-day trace from Microsoft servers on 100 GPUs

CDF of job completion time

Low overhead Suspend/Resume & Migration

Migration time of real workloads

Mixed PyTorch jobs on 180 Tesla GPUs

er o

f su

cces

sfu

l p

acki

dec

isio

Ave

rage

U u

til(

Time to find a qualified model (minutes)

GPU Memory usage during training

Intra-server locality

- Deep learning experiments today use manual or automatic (AutoML) trial-and-error techniques to find the best model

Job1Job2

Job3Job4

Job5 Job6

Which model is better?

GPU cluster

GPU

Introspective Policies

Up to 7x faster hyperparameter search

27% reduction in job completion time13.6x faster AutoML in shared environments

26% increase in cluster GPU utilization4.5x faster job feedback

Over-subscription

- Time-slice to allow multiple jobs to run simultaneously with a weighted time-share

- Pack multiple jobs in the same server if jobs have light-weight resource requirements

Profiling for introspection

- Monitor resource utilization (e.g., GPU utilization and memory)

- Non-invasive progress rate estimation for scheduling decisions

- Copy GPU memory to CPU at mini-batch boundaries

- Restore state from CPU memory on resume

- Or, run multiple processes simultaneously on a GPU.

- Generic GPU processes: Use CRIU to dump and restore process state across machines

- Checkpoint-aware processes:repurpose TF checkpointing APIs to save and restore state. Pre-warm libraries for fast migration.

Traditional GPU Allocation

Suspend-Resume/Packing

Migration

Grow-Shrink

Runtime adjustment

- Migrate jobs at mini-batch boundary if better resources appear

- Defrag GPUs to better compact resources for multi-GPU jobs

- Grow to more resources when available and shrink when required

Server 1

Server 2

Server 3

Server 4

Completed Job Multi-GPU Jobs Packed JobsMigrate

J3 J3

GPU Physics - Nvidiadeveloper.download.nvidia.com/.../siggraph/gpu_physics-siggraph-06.pdf · NVIDIA GPU Physics Multi-GPU configurations, mixed or same GPU type One GPU does both

GPU Computing with Matlab® @ CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software.

Property of American Airlines - Amazon S3 · GPU - 409 INFORMATION 1. Unit Description . A. GPU-406 and GPU-409 The GPU-406 and GPU-409 are self-contained, diesel enginedriven, ground

GPU-Accelerated Gaussian Processes for Object Detection

GPU GPU and Supercomputer - University of Rochester€¦ · 4/16/2015 9 Outline: GPU GPGPU and CUDA CPU + GPU SuperComputer (TianHe 1A) Latest Trends GPU accelerated supercomputer

GPU, GP-GPU, GPU computing

Gandiva: Introspective Cluster Scheduling ... - Romil Bhardwajromilbhardwaj.github.io/files/gandiva-poster-osdi18-v2.pdf · memory on resume - Or, run multiple processes simultaneously

GPU-Accelerated Gaussian Processes for Object Detection · Heriot-Watt University SSPD 2015 9th September 2015 1. IntroMethodResultsSummary Contents Introduction & Motivation Method

GPU Architecture and Programming. GPU vs CPU .

GPU Computing with MATLAB - GPU Technology Conference

PyCUDA: Even Simpler GPU Programming with Python · Python Code GPU Code GPU Compiler GPU Binary GPU Result Machine Human In GPU scripting, GPU code does not need to be a compile-time

GPU Acceleration for Seismic Interpretation Algorithmsdeveloper.download.nvidia.com/GTC/...GTC2012-GPU-Acceleration-Sei… · GPU Acceleration for Seismic ... GPU Acceleration for

$Multi-GPU Accelerated Refraction-Corrected Reflection ...on-demand.gputechconf.com/gtc/2015/...Reconstruction Stage Single CPU time Single GPU time Single GPU speedup Two GPU time$