Design of a Virtualization Framework to Enable GPU …...sharing app 2 m c HD k 2 c DH f m c HD k 1...

Design of a Virtualization Framework to Enable GPU Sharing

in Cluster Environments

Kittisak Sajjapongse Michela Becchi University of Missouri

nps.missouri.edu

GPUs in Clusters & Clouds

• Many-core GPUs are used in supercomputers

3 out of the top 10 supercomputers use GPUs

Titan: > 20 petaflops, > 700 terabytes memory

18,688 nodes:

• 16-core AMD CPU

• 1 Nvidia Tesla K20 GPU

• Many-core GPUs are used in cloud computing

2 Kittisak Sajjapongse

Different usage paradigms Accelerator model Cluster/cloud model

1 application Multi-tenancy

GPU: dedicated resource GPU: shared resource

Explicit procurement of GPUs Resource virtualization & Transparency

Static (or programmer-defined) binding of application to GPUs

Dynamic (or runtime) binding of applications to GPUs → better resource utilization and load balancing

Intra-application scheduling Intra- and Inter-application scheduling

Memory management within application

Advanced memory management across applications required

Context

AMBER GROMACS NAMD GPUBlast LAMMPS

AMBER Blast

Kittisak Sajjapongse

We have designed a runtime that…

• Abstracts GPUs from end-users

• Schedules applications on GPUs

• Dynamically binds applications to GPUs

• Allows GPU sharing

• Provides memory management

• Provides dynamic recovery and load balancing in case of GPU failure/upgrade/downgrade

Deployment scenarios • With cluster-level schedulers

– E.g.: TORQUE, SLURM

driver/runtime GPU1 GPU2 GPU3

Our RUNTIME

Cluster-level scheduler

Intercept library

Our RUNTIME

Intercept library Intercept library

CUDA app CUDA app CUDA app CUDA app CUDA app CUDA app

• With VM-based systems for cloud computing – E.g.: Eucalyptus

host OS

Our RUNTIME

host OS

Our RUNTIME

host OS

Our RUNTIME

guest OS

CUDA app

Intercept library guest OS

CUDA app

Intercept library

VM1 VM2 VM3 VM4 VMk VMn

VM manager

GPU sharing

• Inter-kernel sharing [HPDC’11] - When: GPU underutilized within a kernel

- Why: limited parallelism, small datasets

- How: kernel consolidation across applications

• Inter-application sharing [HPDC’12] - When: GPU underutilized within an application

- Why: long CPU phases

- How: application multiplexing on GPU

k1 k1 k1 k1 k1

k1 k1 k1 k2 k2

GPU CPU

app1 app1

app2 app1

app1 app2

GPU sharing

• Multi-process application sharing [HPDC’13] – When: GPU underutilized by multi-process applications (e.g. MPI)

– Why: Synchronization leads to intra- & inter-application imbalance

– How: Preempt some inactive processes to allow other processes to progress

B0 B1 B0

Inter-kernel sharing [HPDC’11]

m cHD k1 cDH f app1

serialized execution

Inter-kernel sharing

app2 m cHD k2 cDH f

m cHD k1 cDH f m cHD k2 cDH f

m cHD m cHD combined k1 & k2 cDH f cDH f

app1 and app2 have no conflicting memory requirements

Space- vs. time-sharing: some results

BS+KM BO+KNN PDE+MD EU+IP BS+BO KM+KNN BO+EU BS+MD

Workload Mix

SPACE-SHARING TIME-SHARING

BATCH 1 BATCH 2 BATCH 3 BATCH 4

GPU2 GPU2

Molding

• Idea: – Downgrade the execution configuration of kernels so to

force beneficial sharing

– Penalize single application to improve overall throughput • Limiting # blocks → force space sharing

• Limiting # threads/block → force time sharing w/ interleaved execution

b11 b12 b13 b14 k1

kernel1

b21 b22 b23 b14 k1

kernel2

b11 b12 b13 b21 b22

kernel1 and kernel2 can space-share GPU after molding

Downgrade to 3 blocks

Downgrade to 2 blocks

Molding: some results

• Molding can improve overall throughput despite penalizing single applications

IP+BS PDE+MD IP+BO BS+KM

Worklaod Mix

No Molding MoldingGPU2

GPU2 GPU1

FORCED TIME-SHARING FORCED SPACE-SHARING

Inter-application sharing [HPDC’12]

m cpu cHD k11 k12 cDH f cpu app1

m cpu cHD k21 cpu cDH app2 f f m cHD k22 cpu k23 cDH

sharing w/o conflicting

memory req.

sharing w/ conflicting

memory req.

CPU→ GPU xfer (app1)

GPU→ CPU xfer (app1)

GPU CPU

Our runtime: node-level view

GPU1 GPU2 GPUn

CUDA driver/runtime

vGPU11

vGPU12

vGPU1k

vGPU21

vGPU22

vGPU2k

vGPUn1

vGPUn2

vGPUnk …

connection

manager and

offload control

Our RUNTIME

dispatcher dispatcher dispatcher

Waiting contexts

Assigned contexts

Failed contexts

Memory manager

node-to-node offloading

Virtual GPUs

Abstraction

GPU Sharing

Dispatcher

Scheduling

GPU binary

registration

Memory

manager

Virtual

memory

handling

app1 app2 app3 appj appN

Intercept

Lib Intercept

Intercept

Lib Intercept

Intercept

Mapping and scheduling (FCFS)

vGPU11

vGPU12

vGPU21

vGPU22

vGPU31

vGPU32

CUDA driver/runtime

dispatcher

Waiting ctx

Assigned ctx

Failed ctx

t1 t2 t1 t2 t3 t1 t2 t3

app1 app2 app3

c11 c12 c21 c22 c23 c31 c32 c33

connection

manager and

offload control Memory manager

FE library

Our RUNTIME

FE library FE library

Mapping and scheduling (FCFS)

vGPU11

vGPU12

vGPU21

vGPU22

vGPU31

vGPU32

CUDA driver/runtime

dispatcher

Waiting ctx

Assigned ctx

Failed ctx

t1 t2 t1 t2 t3 t1 t2 t3

app1 app2 app3

c11 c12 c21 c22 c23 c31 c32 c33

c11 c12 c21 c22 c23 c31

complete!

Hardware

configuration

and application-

GPU mapping

abstracted from

end-users

Time-sharing of

FE library

connection

manager and

offload control Memory manager

Our RUNTIME

FE library FE library

Delayed binding

vGPU11

vGPU12

CUDA driver/runtime

Memory manager

dispatcher

malloc1

copyHD11

copyHD12

kernel1

copyDH1

malloc2

copyHD2

kernel2

copyDH2

app2 appj appN

app2 d2

c1 c2 d1

connection

manager and

offload control

Deferral of application-GPU mapping o Better scheduling

decisions o GPU memory

allocation when needed

Memory manager in runtime o dynamic binding

vGPU21

Dynamic binding & swapping

vGPU11

vGPU12

CUDA driver/runtime

Memory manager

dispatcher Swap

malloc1

copyHD1

kernel11

kernel12

malloc2

copyHD2

kernel2

malloc3

copyHD3

kernel3

malloc4

copyHD4

kernel4

app2 app3 app4

vGPU22

connection

manager and

offload control

GPU sharing among applications with conflicting memory requirements

Migration of applications from slower to faster GPUs

High availability in case of GPU failure

Load balancing in case of GPU upgrade-downgrade

Experiments: sharing & swapping

• 2 Tesla C2050 and 1 Tesla C1060 GPUs

• 36 matmul jobs w/ 5 kernel calls and varying CPU phases

• Sharing increases performances by hiding CPU phases

12 49 51 75 86

0 0.5 1 1.5 2

Fraction of CPU code

serialized execution (1 vGPU)

GPU sharing (4 VGPUs)

Experiments: cluster w/ TORQUE

• 3 node cluster w/ 2 GPU nodes (2 Tesla C2050s and 1 C1060) • >2X performance improvement due to sharing, further 20% due to

offloading

Tot (16 jobs) Avg (16 jobs) Tot (32 jobs) Avg (32 jobs) Tot (48 jobs) Avg (48 jobs)

Metric (# jobs)

GPU sharing (4 vGPUs)

GPU sharing + load balancing

Load Imbalance in Multi-process applications [HPDC’13]

• Causes of load imbalance

– Intrinsic load-imbalance (intra-application)

– Different GPU capabilities (intra-application)

– Unmatched amount of GPUs to processes (inter-application)

– Synchronization among processes

Intra-application Imbalance

Inter-application Imbalance

Preemption Policies

• Maximum idle time-driven preemption – Preempt a context (process) whenever it does not utilize

the GPU for a predefined amount of time • PROS: easy implementation

• CONS: need for setting maximum idle time parameter

• Synchronization call-driven preemption – Preempt a context (process) whenever a collective

communication or synchronization call is serviced • PROS: no need for parameters setting

• CONS: – Either need for bookkeeping (complex implementation, overhead)

– Or unnecessary preemptions (last process enters synchronization point)

Experiment: Node-level

• Intra-application imbalance – Batch scheduler fails to capture intra-application imbalance

– N-way sharing hides CPU execution behind GPU execution phases of co-located processes

– Combining 2-way sharing and preemption further improves performance

10 20 30 40 50

Percentage imbalance

Batch scheduling

2-way sharing

4-way sharing

Preemptive sharing

Preemptive 2-way sharing

Experiment: Node-level

• Inter-application imbalance – Batch scheduler causes GPU underutilization, leading to performance loss

– N-way sharing provides improvement only if the imbalance is high

– Preemptive sharing allows correcting the imbalance and leads to performance improvement

3x[4] + 1x[3] 2x[4] + 2x[3] 1x[4] + 3x[3] 4x[3]

Workload composition

Batch scheduling

2-way sharing

4-way sharing

Preemptive sharing

Preemptive 2-way sharing

Experiment: Cluster-level

• 2 nodes with 7 GPUs

• Batch scheduler is unable to schedule jobs with more processes than GPUs

• 4-way sharing and preemptive sharing lead to 25-30% and 40-45% performance improvement, respectively

batch scheduling 4-way sharing Preemptive 2-way sharing

Scheduling scheme

4 processes/app 6 processes/app 8 processes/app

Conclusion • Node-level runtime providing

– GPU virtualization (Manageability)

– GPU sharing (Utilization, Latency Hiding)

– Flexible scheduling (Configurability)

– Dynamic binding & Preemption (Utilization, Latency Hiding)

• What lies ahead… – Integration with cluster-level scheduler

– Dynamic scheduling at the cluster-level

– Power efficiency considerations

Thanks • My coauthors

• You all for the attention!

Michela @ MU Xiang @ MU Ian @ MU Adam @ MU Vignesh @ AMD Chak @ NEC

Understanding GPU resource utilization (cont’d)

# thread-blocks

< # SMs

b11 b12 b13 k1 k1

SM underutilization

b11 b12 b13 b21 b22

SPACE SHARING

b11 b12 b13 b14 b15 all SMs busy

TIME SHARING

SERIALIZED EXECUTION of THREAD-BLOCKS

INTERLEAVED EXECUTION of THREAD-BLOCKS

Latencies hiding

co-scheduled

thread-blocks have conflicting

register/sh.mem req.

case Worst

Experiments: runtime overhead

• 1 Tesla C2050 GPU, short-running jobs • overhead < 10% in worst case, and amortized through GPU sharing

1 2 4 8

# of jobs

CUDA Runtime

1 vGPU

2 vGPUs

4 vGPUs

8 vGPUs

HPDC’13

• From single-process, single threaded applications to multi-process/multi-threaded applications

• Challenge: synchronizations (e.g. barrier synchronizations, communication primitive) can introduce GPU underutilization

• Solution: Preemptive GPU sharing

(b) Controlled 2-way sharing

syncA1 syncA2syncB1

syncC1

A41 A42

C11B11

(c) Preemptive sharing syncA1 syncC1syncB1syncA2

B12A11

syncA1 syncA2(a) Batch Scheduling

syncB1syncC1

Scenario 1: Intra-application Imbalance

vGPU00GPU0

vGPU10GPU1

vGPU20GPU2

vGPU30GPU3

B13 C21

(a) Batch scheduling

C21vGPU00GPU0

vGPU10GPU1

vGPU20GPU2

vGPU30GPU3

B21 B31

B22B32 B12

(c) Preemptive sharing

vGPU00

vGPU01

vGPU10

vGPU11

vGPU20

vGPU21

vGPU30

vGPU31

A32 A33

B1,2 B13

(b) Controlled 2-way sharing

vGPU00

vGPU01

vGPU10

vGPU11

vGPU20

vGPU21

vGPU30

vGPU31

C33 B13

(d) Preemptive 2-way sharing

App C App C sync. point

Idle time

App A sync. point

App B sync. point

Legend

Scenario 2: Inter-application Imbalance

Types of swapping operations

• Inter-application swapping – Time-sharing of GPU among applications with conflicting

memory requirements

• Intra-application swapping – Memory footprint of one application is the memory

footprint of the “largest” kernel

malloc(&A_d, size); malloc(&B_d, size); malloc(&C_d, size); copyHD(A_d, A_h, size); matmul(A_d, A_d, B_d); //B_d = A_d * A_d matmul(B_d, B_d, C_d); //C_d = B_d * B_d copyDH(B_h, B_d, size); copyDH(C_h, C_d, size);

memory requirements

malloc(&A_d, size); ON THE BARE CUDA RUNTIME… malloc(&B_d, size); malloc(&C_d, size); MEMORY CAPACITY EXCEEDED → RUNTIME ERROR! copyHD(A_d, A_h, size); matmul(A_d, A_d, B_d); //B_d = A_d * A_d matmul(B_d, B_d, C_d); //C_d = B_d * B_d copyDH(B_h, B_d, size); copyDH(C_h, C_d, size);

memory requirements

malloc(&A_d, size); ON OUR RUNTIME… malloc(&B_d, size); malloc(&C_d, size); copyHD(A_d, A_h, size); matmul(A_d, A_d, B_d); FIRST MEMORY ALLOCATION & DATA XFER TO GPU (A_d & B_d) matmul(B_d, B_d, C_d); SWAP(A_d) & MEMORY ALLOCATION (C_d) copyDH(B_h, B_d, size); copyDH(C_h, C_d, size);

Experiments: load balancing w/ dynamic binding

• Unbalanced system: 2 Tesla C2050 and 1 Quadro 2000 GPUs • Especially on small batches of jobs, dynamic binding improves

performance

12 24 36 12 24 36

# of jobs

no load balancing

load balancing through dynamic binding

cpu fraction = 0 cpu fraction = 1

Runtime configurations

• Only initial memory transfer deferral

– Only memory transfers before 1st kernel call deferred

– Pros: Overlap computation/communication

– Cons: More swapping overhead

• Unconditional memory transfer deferral

– All memory transfers are deferred

– Pros: Less swapping overhead

– Cons: No overlapping computation/communication

Application call Actions performed by runtime Errors returned by the runtime

Malloc Create PTE A virtual address cannot be assigned

Allocate swap

Swap memory cannot be allocated

CopyHD Check valid PTE No valid PTE

Move data to swap Swap-data size mismatch

CopyDH Check valid PTE No valid PTE

If (PTE.toCopy2Swap)cudaMemcpyDH -

Free Check valid PTE No valid PTE

De-allocate swap Cannot de-allocate swap

If (PTE.isAllocated)

cudaFree

Launch Check valid PTE No valid PTE

If (^PTE.isAllocated) cudaMalloc -

If (PTE.toCopy2Dev) cudaMemcpyHD -

cudaLaunch -

Swap Check valid PTE No valid PTE

If (PTE.toCopy2Swap) cudaMemcpyDH -

If (PTE.isAllocated) cudaFree -

Flags for Page Table Entries handling

F/F/F F/T/F

T/F/F T/T/F

copyDH

launch

copyHD

copyDH

copyHD

copyDH

isAllocated/toCopy2D/toCopy2S

swap swap

launch

copyDH copyDH

Design of a Virtualization Framework to Enable GPU …...sharing app 2 m c HD k 2 c DH f m c HD k 1...

Documents

Transcript of Design of a Virtualization Framework to Enable GPU …...sharing app 2 m c HD k 2 c DH f m c HD k 1...

Dh 1 Million Dh 500,000 Dh 250,000 Dh 125,000 Dh 64,000 Dh 32,000 Dh 16,000 Dh 8,000 Dh 4,000

2009 catalog - Sweets€¦ · These nature friendly products may contribute to the LEED ... HD-025 Ice Onyx C-6 Haze HD-19 Eclipse C-7 Delft Pansy HD-035 Cloud C-10 Earth HD-062 Mediterranean

EU STATE PARTNERSHIP PROGRAMME-Rajasthan GP IWRM PLANwater.rajasthan.gov.in/content/dam/water/water... · dh vf/kd Hkkxhnkjh c

progressiv HD c+

Demag hoist units – Volume 1...Demag hoist units – Volume 1 DH 160, DH 200, DH 300, DH 400, DH 500, DH 600, DH 1000, DH 2000 140703 EN/DE 203 341 44 714 IS 813 Demag hoist units

HD 5/11 C, HD 5/12 C, HD 5/12 CX, HD 5/13 C, HD 5/13 CX ...

HD D C A C HD HD HD D HD D C HD D to f 3 12 2 HDMI Display … · 2019. 12. 9. · This displays the result of the AI analysis and how confident the analysis is. 2 | TRY THE DEMO

CP C $dh ggfgg

Dh 1 Million Dh 500,000 Dh 250,000 Dh 125,000 Dh 64,000 Dh 32,000 Dh 16,000 Dh 8,000 Dh 4,000

Qualification - WB HEALTH list of...150 Birbhum & Rampurhat HD Dr. Tushar Kanti Bhattaryay Gynae MO DH 151 Birbhum & Rampurhat HD Dr. Nipun Bala MBBS MO CHC 152 Birbhum & Rampurhat

Dh 1 Million Dh 500,000 Dh 250,000 Dh 125,000 Dh 64,000 Dh 32,000 Dh 16,000 Dh 8,000 Dh 4,000 Dh 2,000 Dh 1,000 Dh 500 Dh 400 Dh 300 Dh 200 Dh 100.

CARY,N. c. ~+hd f3G - wakespace.lib.wfu.edu

DH-HAC-HF3231E - CCTV Centercctvcentersl.es/upload/Catalogos/HAC-HF3231E_eng.pdfUltra Series | DH-HAC-HF3231E System Overview Experience full HD 1080P video and complete features with

Dahua HD Mini IR Waterproof Bullet Network Camera Quick ...la.dahuasecurity.com/download/DH-IPC-HFW1320S_QSG_20160815.pdf · Dahua HD Mini IR Waterproof Bullet Network Camera Quick

DIRECTV CHANNEL LINEUPS Effective 2.9 · Country Music Television (CMT) HD CNBC HD CNBC World CNN HD Comedy Central HD C-SPAN C-SPAN2 ... Disney Channel (West) Disney XD HD E! Entertainment

Co-chruinneacha dh, ain, &c. &c. = A collection of ... · CO-CHRUINNEACHA DHAN,ORAIN, 4t.

DH-HCVR5424-5432L-S2 - ITRONICSCLOUDitronicscloud.com/.../dh-hd-cvi-tri-brid/DH-HCVR5424-5432L-S2.pdf · DH-HCVR5424/5432L-S2 NEW 24/32CH Tribrid 720P-Pro I .5U I-IDCVI DVR Technical

Dh Htm 0401 Part c Acc

NETWORK CAMERA Lite Series DH-lPC-HFWlOOOSN-0360B 1 HD …

30x Starlight Laser PTZ Dome...2 MP Camera DH-SD6ALA230FN-HNI 2MP HD PTZ 30x Zoom, Starlight Laser Camera, True WDR, Auto-tracking, IVS (power supply included) Accessories DH-PFB303W