Post on 10-Aug-2020
Design of a Virtualization Framework to Enable GPU Sharing
in Cluster Environments
Kittisak Sajjapongse Michela Becchi University of Missouri
nps.missouri.edu
GPUs in Clusters & Clouds
• Many-core GPUs are used in supercomputers
3 out of the top 10 supercomputers use GPUs
Titan: > 20 petaflops, > 700 terabytes memory
18,688 nodes:
• 16-core AMD CPU
• 1 Nvidia Tesla K20 GPU
• Many-core GPUs are used in cloud computing
2 Kittisak Sajjapongse
Different usage paradigms Accelerator model Cluster/cloud model
1 application Multi-tenancy
GPU: dedicated resource GPU: shared resource
Explicit procurement of GPUs Resource virtualization & Transparency
Static (or programmer-defined) binding of application to GPUs
Dynamic (or runtime) binding of applications to GPUs → better resource utilization and load balancing
Intra-application scheduling Intra- and Inter-application scheduling
Memory management within application
Advanced memory management across applications required
3 Kittisak Sajjapongse
Context
AMBER GROMACS NAMD GPUBlast LAMMPS
NAMD
AMBER Blast
Kittisak Sajjapongse
We have designed a runtime that…
• Abstracts GPUs from end-users
• Schedules applications on GPUs
• Dynamically binds applications to GPUs
• Allows GPU sharing
• Provides memory management
• Provides dynamic recovery and load balancing in case of GPU failure/upgrade/downgrade
5 Kittisak Sajjapongse
Deployment scenarios • With cluster-level schedulers
– E.g.: TORQUE, SLURM
OS
CUDA
driver/runtime GPU1 GPU2 GPU3
Our RUNTIME
GPUn
Cluster-level scheduler
Intercept library
OS
CUDA
driver/runtime GPU1 GPU2 GPU3
Our RUNTIME
GPUn
OS
CUDA
driver/runtime GPU1 GPU2 GPU3
Our RUNTIME
GPUn
Intercept library Intercept library
CUDA app CUDA app CUDA app CUDA app CUDA app CUDA app
• With VM-based systems for cloud computing – E.g.: Eucalyptus
host OS
CUDA
driver/runtime GPU1 GPU2 GPU3
Our RUNTIME
GPUn
host OS
CUDA
driver/runtime GPU1 GPU2 GPU3
Our RUNTIME
GPUn
host OS
CUDA
driver/runtime GPU1 GPU2 GPU3
Our RUNTIME
GPUn
guest OS
CUDA app
Intercept library guest OS
CUDA app
Intercept library guest OS
CUDA app
Intercept library guest OS
CUDA app
Intercept library guest OS
CUDA app
Intercept library guest OS
CUDA app
Intercept library
VM1 VM2 VM3 VM4 VMk VMn
VM manager
6 Kittisak Sajjapongse
GPU sharing
• Inter-kernel sharing [HPDC’11] - When: GPU underutilized within a kernel
- Why: limited parallelism, small datasets
- How: kernel consolidation across applications
• Inter-application sharing [HPDC’12] - When: GPU underutilized within an application
- Why: long CPU phases
- How: application multiplexing on GPU
GPU
k1 k1 k1 k1 k1
k1 k1 k1 k2 k2
GPU CPU
app1 app1
app1 app1
time
app2 app1
app1 app2
time
7 Kittisak Sajjapongse
GPU sharing
8
• Multi-process application sharing [HPDC’13] – When: GPU underutilized by multi-process applications (e.g. MPI)
– Why: Synchronization leads to intra- & inter-application imbalance
– How: Preempt some inactive processes to allow other processes to progress
Kittisak Sajjapongse
GPU 0
GPU 1
A0
A1
A0
A1
B0
B1
B0
B1
time
GPU 0
GPU 1
A0
A1
A0
A1
B0 B1 B0
B1
time
Inter-kernel sharing [HPDC’11]
time
m cHD k1 cDH f app1
serialized execution
Inter-kernel sharing
app2 m cHD k2 cDH f
m cHD k1 cDH f m cHD k2 cDH f
m cHD m cHD combined k1 & k2 cDH f cDH f
app1 and app2 have no conflicting memory requirements
9 Kittisak Sajjapongse
Space- vs. time-sharing: some results
10
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
BS+KM BO+KNN PDE+MD EU+IP BS+BO KM+KNN BO+EU BS+MD
Re
lati
ve T
hro
ugh
pu
t B
en
efi
t
Workload Mix
SPACE-SHARING TIME-SHARING
BATCH 1 BATCH 2 BATCH 3 BATCH 4
GPU1
GPU1
GPU1
GPU1
GPU2 GPU2
GPU2 GPU2
Kittisak Sajjapongse
Molding
• Idea: – Downgrade the execution configuration of kernels so to
force beneficial sharing
– Penalize single application to improve overall throughput • Limiting # blocks → force space sharing
• Limiting # threads/block → force time sharing w/ interleaved execution
11
b11 b12 b13 b14 k1
kernel1
b21 b22 b23 b14 k1
kernel2
b11 b12 b13 b21 b22
kernel1 and kernel2 can space-share GPU after molding
Downgrade to 3 blocks
Downgrade to 2 blocks
Kittisak Sajjapongse
Molding: some results
• Molding can improve overall throughput despite penalizing single applications
12
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
IP+BS PDE+MD IP+BO BS+KM
Re
lati
ve T
hro
ugh
pu
t B
en
efi
t
Worklaod Mix
No Molding MoldingGPU2
GPU2 GPU1
GPU1
FORCED TIME-SHARING FORCED SPACE-SHARING
Kittisak Sajjapongse
Inter-application sharing [HPDC’12]
time
m cpu cHD k11 k12 cDH f cpu app1
m cpu cHD k21 cpu cDH app2 f f m cHD k22 cpu k23 cDH
serialized execution
sharing w/o conflicting
memory req.
sharing w/ conflicting
memory req.
CPU→ GPU xfer (app1)
GPU→ CPU xfer (app1)
GPU CPU
GPU CPU
GPU CPU
13 Kittisak Sajjapongse
Our runtime: node-level view
GPU1 GPU2 GPUn
CUDA driver/runtime
…
vGPU11
vGPU12
vGPU1k
vGPU21
vGPU22
vGPU2k
vGPUn1
vGPUn2
vGPUnk …
connection
manager and
offload control
Our RUNTIME
dispatcher dispatcher dispatcher
Waiting contexts
Assigned contexts
Failed contexts
Memory manager
Swap
area
Page
table
node-to-node offloading
Virtual GPUs
Abstraction
GPU Sharing
Dispatcher
Scheduling
GPU binary
registration
Memory
manager
Virtual
memory
handling
14
app1 app2 app3 appj appN
Intercept
Lib Intercept
Lib
Intercept
Lib Intercept
Lib
Intercept
Lib
Kittisak Sajjapongse
Mapping and scheduling (FCFS)
GPU1
vGPU11
vGPU12
GPU2
vGPU21
vGPU22
GPU3
vGPU31
vGPU32
CUDA driver/runtime
dispatcher
Waiting ctx
Assigned ctx
Failed ctx
t1 t2 t1 t2 t3 t1 t2 t3
app1 app2 app3
c11 c12 c21 c22 c23 c31 c32 c33
connection
manager and
offload control Memory manager
Swap
area
Page
table
FE library
Our RUNTIME
15
FE library FE library
Kittisak Sajjapongse
Mapping and scheduling (FCFS)
GPU1
vGPU11
vGPU12
GPU2
vGPU21
vGPU22
GPU3
vGPU31
vGPU32
CUDA driver/runtime
dispatcher
Waiting ctx
Assigned ctx
Failed ctx
t1 t2 t1 t2 t3 t1 t2 t3
app1 app2 app3
c11 c12 c21 c22 c23 c31 c32 c33
c11 c12 c21 c22 c23 c31
complete!
c32
Hardware
configuration
and application-
GPU mapping
abstracted from
end-users
Time-sharing of
GPUs
FE library
connection
manager and
offload control Memory manager
Swap
area
Page
table
Our RUNTIME
16
FE library FE library
Kittisak Sajjapongse
Delayed binding
GPU1
vGPU11
vGPU12
CUDA driver/runtime
Memory manager
dispatcher
Swap
area
Page
table
malloc1
copyHD11
copyHD12
kernel1
copyDH1
app1
malloc2
copyHD2
kernel2
copyDH2
app2 appj appN
app1
app2 d2
c1 c2 d1
connection
manager and
offload control
Deferral of application-GPU mapping o Better scheduling
decisions o GPU memory
allocation when needed
Memory manager in runtime o dynamic binding
17 Kittisak Sajjapongse
vGPU21
Dynamic binding & swapping
GPU1
vGPU11
vGPU12
CUDA driver/runtime
Memory manager
dispatcher Swap
area
Page
table
malloc1
copyHD1
kernel11
kernel12
app1
malloc2
copyHD2
kernel2
malloc3
copyHD3
kernel3
malloc4
copyHD4
kernel4
app2 app3 app4
app1
app2
c1
c2
GPU2
vGPU22
app3
app4
c4 c3
d1
d1
d3
d3 d4
d4
d2 d2
Full!
Swap!
d1
connection
manager and
offload control
GPU sharing among applications with conflicting memory requirements
Migration of applications from slower to faster GPUs
High availability in case of GPU failure
Load balancing in case of GPU upgrade-downgrade
18 Kittisak Sajjapongse
Experiments: sharing & swapping
• 2 Tesla C2050 and 1 Tesla C1060 GPUs
• 36 matmul jobs w/ 5 kernel calls and varying CPU phases
• Sharing increases performances by hiding CPU phases
0
0
0
0
0
12 49 51 75 86
0
50
100
150
200
250
300
350
400
450
500
0 0.5 1 1.5 2
Tota
l Exe
cuti
on
tim
e (s
ec)
Fraction of CPU code
serialized execution (1 vGPU)
GPU sharing (4 VGPUs)
19 Kittisak Sajjapongse
Experiments: cluster w/ TORQUE
• 3 node cluster w/ 2 GPU nodes (2 Tesla C2050s and 1 C1060) • >2X performance improvement due to sharing, further 20% due to
offloading
0
200
400
600
800
1000
1200
1400
Tot (16 jobs) Avg (16 jobs) Tot (32 jobs) Avg (32 jobs) Tot (48 jobs) Avg (48 jobs)
Exec
uti
on
Tim
e (s
ec)
Metric (# jobs)
serialized execution
GPU sharing (4 vGPUs)
GPU sharing + load balancing
20 Kittisak Sajjapongse
Load Imbalance in Multi-process applications [HPDC’13]
• Causes of load imbalance
– Intrinsic load-imbalance (intra-application)
– Different GPU capabilities (intra-application)
– Unmatched amount of GPUs to processes (inter-application)
– Synchronization among processes
Intra-application Imbalance
Inter-application Imbalance
Preemption Policies
• Maximum idle time-driven preemption – Preempt a context (process) whenever it does not utilize
the GPU for a predefined amount of time • PROS: easy implementation
• CONS: need for setting maximum idle time parameter
• Synchronization call-driven preemption – Preempt a context (process) whenever a collective
communication or synchronization call is serviced • PROS: no need for parameters setting
• CONS: – Either need for bookkeeping (complex implementation, overhead)
– Or unnecessary preemptions (last process enters synchronization point)
Experiment: Node-level
• Intra-application imbalance – Batch scheduler fails to capture intra-application imbalance
– N-way sharing hides CPU execution behind GPU execution phases of co-located processes
– Combining 2-way sharing and preemption further improves performance
150
170
190
210
230
250
270
290
10 20 30 40 50
Ove
rall
exec
uti
on
tim
e (s
eco
nd
s)
Percentage imbalance
Batch scheduling
2-way sharing
4-way sharing
Preemptive sharing
Preemptive 2-way sharing
38%
8.8%
Experiment: Node-level
• Inter-application imbalance – Batch scheduler causes GPU underutilization, leading to performance loss
– N-way sharing provides improvement only if the imbalance is high
– Preemptive sharing allows correcting the imbalance and leads to performance improvement
150
170
190
210
230
250
270
290
3x[4] + 1x[3] 2x[4] + 2x[3] 1x[4] + 3x[3] 4x[3]
Ove
rall
exec
uti
on
tim
e (s
eco
nd
s)
Workload composition
Batch scheduling
2-way sharing
4-way sharing
Preemptive sharing
Preemptive 2-way sharing
Experiment: Cluster-level
• 2 nodes with 7 GPUs
• Batch scheduler is unable to schedule jobs with more processes than GPUs
• 4-way sharing and preemptive sharing lead to 25-30% and 40-45% performance improvement, respectively
0
100
200
300
400
500
600
700
800
900
batch scheduling 4-way sharing Preemptive 2-way sharing
Ove
rall
exec
uti
on
tim
e (s
eco
nd
s)
Scheduling scheme
4 processes/app 6 processes/app 8 processes/app
Conclusion • Node-level runtime providing
– GPU virtualization (Manageability)
– GPU sharing (Utilization, Latency Hiding)
– Flexible scheduling (Configurability)
– Dynamic binding & Preemption (Utilization, Latency Hiding)
• What lies ahead… – Integration with cluster-level scheduler
– Dynamic scheduling at the cluster-level
– Power efficiency considerations
28 Kittisak Sajjapongse
Thanks • My coauthors
29
• You all for the attention!
Michela @ MU Xiang @ MU Ian @ MU Adam @ MU Vignesh @ AMD Chak @ NEC
Kittisak Sajjapongse
Understanding GPU resource utilization (cont’d)
30
# thread-blocks
< # SMs
GPU
b11 b12 b13 k1 k1
SM underutilization
b11 b12 b13 b21 b22
SPACE SHARING
GPU
b11 b12 b13 b14 b15 all SMs busy
b11
b21
b12
b22
b13
b23
b14
b24
b15
b25
TIME SHARING
SERIALIZED EXECUTION of THREAD-BLOCKS
INTERLEAVED EXECUTION of THREAD-BLOCKS
Latencies hiding
Y
co-scheduled
thread-blocks have conflicting
register/sh.mem req.
Y N
N
Best
case Worst
case
Kittisak Sajjapongse
Experiments: runtime overhead
• 1 Tesla C2050 GPU, short-running jobs • overhead < 10% in worst case, and amortized through GPU sharing
0
5
10
15
20
25
1 2 4 8
Tota
l Exe
cuti
on
tim
e (s
ec)
# of jobs
CUDA Runtime
1 vGPU
2 vGPUs
4 vGPUs
8 vGPUs
31 Kittisak Sajjapongse
HPDC’13
• From single-process, single threaded applications to multi-process/multi-threaded applications
• Challenge: synchronizations (e.g. barrier synchronizations, communication primitive) can introduce GPU underutilization
• Solution: Preemptive GPU sharing
32 Kittisak Sajjapongse
33
A11
A21
A31
A41
B11
A42
A32
B21
(b) Controlled 2-way sharing
C21
C11
syncA1 syncA2syncB1
A12
A22
syncC1
GPU0
GPU1
GPU2
GPU3
time
A11
A21
A31
A41 A42
A22
A32
A12
C21
C11B11
B21
(c) Preemptive sharing syncA1 syncC1syncB1syncA2
GPU0
GPU1
GPU2
GPU3
B12A11
A21
A31
A41
A12
A22
A32
C21
syncA1 syncA2(a) Batch Scheduling
A42
B11
syncB1syncC1
GPU0
GPU1
GPU2
GPU3
C11
Scenario 1: Intra-application Imbalance
Kittisak Sajjapongse
34
vGPU00GPU0
vGPU10GPU1
vGPU20GPU2
vGPU30GPU3
A11
A21
A31
B11
A12
A22
A32
A13
A23
A33
B21
B31
C11
B22
B32
B12
B23
B33
B13 C21
C31
C22
C12
C32
C23
C13
C33
(a) Batch scheduling
C21vGPU00GPU0
vGPU10GPU1
vGPU20GPU2
vGPU30GPU3
A11
A21
A31
B11
A12
A22
A32
A13
A23
A33
B21 B31
C11
B22B32 B12
B13
B23
B33
C31
C22
C12
C32
C23
C13
C33
(c) Preemptive sharing
B32
B22
C21
vGPU00
vGPU01
GPU0
vGPU10
vGPU11
GPU1
vGPU20
vGPU21
GPU2
vGPU30
vGPU31
GPU3
A11
A21
A31
B11
B21
B31
C11
A12
A22
A32 A33
A23
A13
B1,2 B13
B23
B33
C31
C22
C12
C32
C23
C13
C33
(b) Controlled 2-way sharing
vGPU00
vGPU01
GPU0
vGPU10
vGPU11
GPU1
vGPU20
vGPU21
GPU2
vGPU30
vGPU31
GPU3
A11
A12
A13
B11
B21
B31
C11
C21
C31
A12
A22
A23
C12
C22
C32
B1,2
B32
B22
A23
A13
A33
C13
B23
B33
C23
C33 B13
(d) Preemptive 2-way sharing
App C App C sync. point
Idle time
App A
App B
App A sync. point
App B sync. point
Legend
time
Scenario 2: Inter-application Imbalance
Kittisak Sajjapongse
Types of swapping operations
• Inter-application swapping – Time-sharing of GPU among applications with conflicting
memory requirements
• Intra-application swapping – Memory footprint of one application is the memory
footprint of the “largest” kernel
malloc(&A_d, size); malloc(&B_d, size); malloc(&C_d, size); copyHD(A_d, A_h, size); matmul(A_d, A_d, B_d); //B_d = A_d * A_d matmul(B_d, B_d, C_d); //C_d = B_d * B_d copyDH(B_h, B_d, size); copyDH(C_h, C_d, size);
35 Kittisak Sajjapongse
Types of swapping operations
• Inter-application swapping – Time-sharing of GPU among applications with conflicting
memory requirements
• Intra-application swapping – Memory footprint of one application is the memory
footprint of the “largest” kernel
malloc(&A_d, size); ON THE BARE CUDA RUNTIME… malloc(&B_d, size); malloc(&C_d, size); MEMORY CAPACITY EXCEEDED → RUNTIME ERROR! copyHD(A_d, A_h, size); matmul(A_d, A_d, B_d); //B_d = A_d * A_d matmul(B_d, B_d, C_d); //C_d = B_d * B_d copyDH(B_h, B_d, size); copyDH(C_h, C_d, size);
36 Kittisak Sajjapongse
Types of swapping operations
• Inter-application swapping – Time-sharing of GPU among applications with conflicting
memory requirements
• Intra-application swapping – Memory footprint of one application is the memory
footprint of the “largest” kernel
malloc(&A_d, size); ON OUR RUNTIME… malloc(&B_d, size); malloc(&C_d, size); copyHD(A_d, A_h, size); matmul(A_d, A_d, B_d); FIRST MEMORY ALLOCATION & DATA XFER TO GPU (A_d & B_d) matmul(B_d, B_d, C_d); SWAP(A_d) & MEMORY ALLOCATION (C_d) copyDH(B_h, B_d, size); copyDH(C_h, C_d, size);
37 Kittisak Sajjapongse
Experiments: load balancing w/ dynamic binding
• Unbalanced system: 2 Tesla C2050 and 1 Quadro 2000 GPUs • Especially on small batches of jobs, dynamic binding improves
performance
0 0
0
0 0
0
4
4
0
4
4
4
0
200
400
600
800
1000
1200
1400
12 24 36 12 24 36
Tota
l exe
cuti
on
tim
e (s
ec)
# of jobs
no load balancing
load balancing through dynamic binding
cpu fraction = 0 cpu fraction = 1
38 Kittisak Sajjapongse
Runtime configurations
• Only initial memory transfer deferral
– Only memory transfers before 1st kernel call deferred
– Pros: Overlap computation/communication
– Cons: More swapping overhead
• Unconditional memory transfer deferral
– All memory transfers are deferred
– Pros: Less swapping overhead
– Cons: No overlapping computation/communication
39 Kittisak Sajjapongse
Application call Actions performed by runtime Errors returned by the runtime
Malloc Create PTE A virtual address cannot be assigned
Allocate swap
Swap memory cannot be allocated
CopyHD Check valid PTE No valid PTE
Move data to swap Swap-data size mismatch
CopyDH Check valid PTE No valid PTE
If (PTE.toCopy2Swap)cudaMemcpyDH -
Free Check valid PTE No valid PTE
De-allocate swap Cannot de-allocate swap
If (PTE.isAllocated)
cudaFree
-
Launch Check valid PTE No valid PTE
If (^PTE.isAllocated) cudaMalloc -
If (PTE.toCopy2Dev) cudaMemcpyHD -
cudaLaunch -
Swap Check valid PTE No valid PTE
If (PTE.toCopy2Swap) cudaMemcpyDH -
If (PTE.isAllocated) cudaFree -
40 Kittisak Sajjapongse
Flags for Page Table Entries handling
F/F/F F/T/F
T/F/F T/T/F
T/F/T
copyDH
swap
launch
launch
copyHD
copyDH
copyHD
copyHD
copyHD
copyDH
swap
isAllocated/toCopy2D/toCopy2S
swap swap
launch
copyDH copyDH
41 Kittisak Sajjapongse