Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N....
Transcript of Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N....
![Page 1: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/1.jpg)
Real-Time GPU Management
Heechul Yun
1
![Page 2: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/2.jpg)
This Week
• Topic: General Purpose Graphic Processing Unit (GPGPU) management
• Today
– GPU architecture
– GPU programming model
– Challenges
– Real-Time GPU management
2
![Page 3: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/3.jpg)
History
• GPU– Graphic is embarrassingly parallel by nature– GeForce 6800 (2003): 53GFLOPs (MUL)
– Some PhDs tried to use GPU to do some general purpose computing, but difficult to program
• GPGPU– Ian Buck (Stanford PhD, 2004) joined Nvidia and
created CUDA language and runtime.– General purpose: (relatively) easy to program, many
scientific applications
3
![Page 4: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/4.jpg)
Discrete GPU
• Add-on PCIe cards on PC– GPU and CPU memories are separate
– GPU memory (GDDR) is much faster than CPU one (DDR)
4
4992 GPU cores 4 CPU cores
Graphic DRAM Host DRAM
PCIE 3.0
Nvidia Tesla K80 Intel Core i7
![Page 5: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/5.jpg)
Integrated CPU-GPU SoC
• Tighter integration of CPU and GPU
– Memory is shared by both CPU and GPU
– Good for embedded systems (e.g., smartphone)
5
GPUSync: A Famework for Real-Time GP Management
Nvidia Tegra K1
Core1
Shared DRAM
Shared Memory Controller
GPU coresCore2 Core3 Core4
![Page 6: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/6.jpg)
NVIDIA Titan Xp
• 3840 CUDA cores, 12GB GDDR5X
• Peak performance: 12 TFLOPS
6
![Page 7: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/7.jpg)
NVIDIA Jetson TX2
• 256 CUDA GPU cores + 4 CPU cores
7
Image credit: T. Amert et al., “GPU Scheduling on the NVIDIA TX2: Hidde
n Details Revealed,” RTSS17
![Page 8: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/8.jpg)
NVIDIA Jetson Platforms
8
![Page 9: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/9.jpg)
CPU vs. GPGPU
• CPU– Designed to run sequential programs faster
– High ILP: pipeline, superscalar, out-of-order, multi-level cache hierarchy
– Powerful, but complex and big
• GPGPU– Designed to compute math faster for embarrassingly
parallel data (e.g., pixels)
– No need for complex logics (no superscalar, out-of-order, cache)
– Simple, less powerful, but small---can put many of them
9
![Page 10: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/10.jpg)
10“From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University
![Page 11: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/11.jpg)
11“From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University
![Page 12: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/12.jpg)
12“From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University
![Page 13: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/13.jpg)
13“From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University
![Page 14: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/14.jpg)
14“From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University
![Page 15: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/15.jpg)
15“From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University
![Page 16: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/16.jpg)
GPU Programming Model
• Host = CPU
• Device = GPU
• Kernel
– Function that executes on the device
– Multiple threads execute each kernel
16
![Page 17: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/17.jpg)
17Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf
![Page 18: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/18.jpg)
18Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf
![Page 19: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/19.jpg)
19Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf
![Page 20: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/20.jpg)
20Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf
![Page 21: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/21.jpg)
Challenges for Discrete GPU
• Data movement problem– Host mem <-> gpu mem
– Copy overhead can be high
• Scheduling problem– Limited ability to prioritize important GPU kernels
– Most (old) GPUs don’t support preemption
– New GPUs support preemption within a process
21
User buffer
Kernel buffer
user
kernel
4992 GPU cores
4 CPU cores
Graphic DRAM
Host DRAM
PCIE 3.0
GPU CPU
![Page 22: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/22.jpg)
Data Movement Challenge
22
4992 GPU cores 4 CPU cores
Graphic DRAM Host DRAM
PCIE 3.0
Nvidia Tesla K80 Intel Core i7
480 GB/s 25 GB/s
16 GB/s
Data transfer is the bottleneck
![Page 23: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/23.jpg)
An Example
23
CPU GPU GPU CPU
PTask: Operating System Abstractions To Manage GPUs as Compute Devices, SOSP'11
![Page 24: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/24.jpg)
Inefficient Data migration
OS executive
capture
GPU
Run!
camdrv GPU driver
PCI-xfer PCI-xfer
xform
copyto
GPU
copyfrom GPU
PCI-xfer PCI-xfer
filter
copyfrom GPU
detect
IRP
HIDdrv
read()
copyto
GPU
write() read() write() read() write() read()
capture xform filter detect
#> capture | xform | filter | detect &
24Acknowledgement: This slide is from the paper’s author’s slide
A lot of copies
![Page 25: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/25.jpg)
Scheduling Challenge
CPU priorities do notapply to GPU
Long running GPU task(xform) is not preemptibledelaying short GPU task(mouse update)
PTask: Operating System Abstractions To Manage GPUs as Compute Devices, SOSP'11
![Page 26: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/26.jpg)
Challenges for Integrated CPU-GPU
• Memory is shared by both CPU and GPU
• Data movement may be easier but…
26
GPUSync: A Famework for Real-Time GP Management
Nvidia Tegra X2
Core1
PMC
Shared DRAM
Shared Memory Controller
GPU coresCore2
PMC
Core3
PMC
Core4
PMC
16 GB/s
![Page 27: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/27.jpg)
Memory Bandwidth Contention
27
3GPU
2 1 0
GPU AppCo-runners
CPU
Co-scheduling memory intensive CPU task affects GPU performanceon Integrated CPU-GPU SoC
Waqar Ali, Heechul Yun. Protecting Real-Time GPU Kernels on Integrated CPU-GPU SoC Platforms.Euromicro Conference on Real-Time Systems (ECRTS), 2018 [pdf] [arXiv] [ppt] [code]
![Page 28: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/28.jpg)
Summary
• GPU Architecture– Many simple in-order cores
• GPU Programming Model– SIMD
• Challenges– Data movement cost
– Scheduling
– Bandwidth bottleneck
– NOT time predictable!
28
![Page 29: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/29.jpg)
Real-Time GPU Management
• Goal
– Time predictable and efficient GPU sharing in multi-tasking environment
• Challenges
– High data copy overhead
– Real-time scheduling support -- preemption
– Shared resource (bandwidth) contention
29
![Page 30: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/30.jpg)
References
• Timegraph: Gpu scheduling for real-time multi-tasking environments. In ATC, 2011.
• Gdev: First-class gpu resource management in the operating system. In ATC, 2012.
• GPES: a preemptive execution system for gpgpu computing. In RTAS, 2015
• Gpusync: A framework for real-time gpu management. In RTSS, 2013.
• A server based approach for predictable gpu access control. In RTCSA, 2017.
30
![Page 31: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/31.jpg)
Real-Time GPU Scheduling
• Early real-time GPU schedulers– Timegraph
– Gdev
• GPU kernel slicing– GPES
• Synchronization (Lock) based approach– GPUSync
• Server based approach– GPU Server
31
![Page 32: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/32.jpg)
GPU Software Stack
32Acknowledgement: This slide is from the paper author’s slide
Gdev: First-class gpu resource management in the operating system. In ATC, 2012.
![Page 33: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/33.jpg)
TimeGraph
• First work to support “soft” real-time GPU scheduling.
• Implemented at the device driver level
33S. Kato, K. Lakshmanan, R. R. Rajkumar, and Y. Ishikawa, “TimeGraph: GPU scheduling for real-time multi-tasking environments,” in USENIX ATC, 2011
![Page 34: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/34.jpg)
TimeGraph Scheduling
• GPU commands are not immediately sent to the GPU, if it is busy• Schedule high-priority GPU commands when GPU becomes idle
34
![Page 35: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/35.jpg)
GDev
• Implemented at the kernel level on top of the stock GPU driver
35Acknowledgement: This slide is from the paper author’s slide
Gdev: First-class gpu resource management in the operating system. In ATC, 2012.
![Page 36: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/36.jpg)
TimeGraph Scheduling
• high priority tasks can still suffer long delays
• Due to lack of hardware preemption36
Acknowledgement: This slide is from the paper author’s slideGdev: First-class gpu resource management in the operating system. In ATC, 2012.
![Page 37: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/37.jpg)
GDev’s BAND Scheduler
• Monitor consumed b/w, add some delay to wait high-priority requests
• Non-work conserving, no real-time guarantee
37Acknowledgement: This slide is from the paper author’s slide
Gdev: First-class gpu resource management in the operating system. In ATC, 2012.
![Page 38: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/38.jpg)
GPES
• Based on Gdev
• Implement kernel slicing to reduce latency
38(*) GPES: A Preemptive Execution System for GPGPU Computing, RTAS'14
![Page 39: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/39.jpg)
39
GPUSync
http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf
![Page 40: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/40.jpg)
40http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf
![Page 41: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/41.jpg)
41http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf
![Page 42: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/42.jpg)
42http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf
![Page 43: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/43.jpg)
43http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf
![Page 44: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/44.jpg)
Hardware Preemption
• Recent GPUs (NVIDIA Pascal) support hardware preemption capability
– Problem solved?
• Issues
– Works only between GPU streams within a single address space (process)
– High context switching overhead
• ~100us per context switch (*)
44(*) AnandTech, “Preemption Improved: Fine-Grained Preemption for Time-Critical Tasks”
![Page 45: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/45.jpg)
Hardware Preemption
45
![Page 46: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/46.jpg)
Discussion
• Long running low priority GPU kernel?
• Memory interference from the CPU?
46
![Page 47: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/47.jpg)
Challenges for Integrated CPU-GPU
• Memory is shared by both CPU and GPU
• Data movement may be easier but…
47
GPUSync: A Famework for Real-Time GP Management
Nvidia Tegra X2
Core1
PMC
Shared DRAM
Shared Memory Controller
GPU coresCore2
PMC
Core3
PMC
Core4
PMC
16 GB/s
![Page 48: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/48.jpg)
References
• SiGAMMA: server based integrated GPU arbitration mechanism for memory accesses, RTNS, 2017
• GPUguard: towards supporting a predictable execution model for heterogeneous SoC, DATE, 2017
• Protecting Real-Time GPU Applications on Integrated CPU-GPU SoC Platforms, ECRTS, 2018
48
![Page 49: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/49.jpg)
SiGAMMA
• Protect PREM compliant real-time CPU tasks
• Throttle GPU when CPU is in a mem-phase
49
![Page 50: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/50.jpg)
SiGAMMA
• GPU is throttled by launching a high-priority spinning kernel
50
![Page 51: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/51.jpg)
GPUGuard
• PREM compliant GPU and CPU tasks.
• CPU is throttled using MemGuard
• Need GPU source code modification
51
![Page 52: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/52.jpg)
BWLOCK++
• Protect real-time GPU tasks by throttling CPU
• Runtime binary instrumentation overriding CUDA API– No code modification is needed
52Waqar Ali, Heechul Yun. Protecting Real-Time GPU Kernels on Integrated CPU-GPU SoC Platforms.Euromicro Conference on Real-Time Systems (ECRTS), 2018 [pdf] [arXiv] [ppt] [code]
![Page 53: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/53.jpg)
Dynamic Instrumentation
• Begin/stop throttling by instrumenting CUDA
53
CPU GPU
cudaMalloc(...)
cudaMemcpy(...)
cudaMemcpy(...)
kernel<<<...>>>(...)
cudaFree(...)
cudaLaunch()
cudaSynchronize()
Acquire memory bandwidth lock
Instrument binary Using LD_PRELOAD
Release memory bandwidth lock
No source code modification is needed
![Page 54: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/54.jpg)
BWLOCK++
• Real-time GPU kernels are protected.
54
![Page 55: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/55.jpg)
Summary
• Integrated CPU-GPU SoC– Shared main memory is a source of interference
• SiGAMMA– PREM compliant CPU tasks– Throttle GPU to protect real-time CPU tasks
• GPUGuard– PREM compliant GPU (and CPU) tasks. – Need GPU source code modification
• BWLOCK++– Throttle CPU (memguard) to protect real-time GPU tasks– Runtime instrumentation (no code modification)
55
![Page 56: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/56.jpg)
Discussion Papers
• Deadline-based Scheduling for GPU with Preemption Support, RTSS, 2018
• Pipelined Data-Parallel CPU/GPU Scheduling for Multi-DNN Real-Time Inference, RTSS, 2019
56
![Page 57: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/57.jpg)
Deadline-based Scheduling for GPU with Preemption Support
N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru.
RTSS, 2018
57
![Page 58: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/58.jpg)
Pipelined Data-Parallel CPU/GPU Scheduling for Multi-DNN Real-Time
Inference
Yecheng Xiang and Hyoseung Kim
RTSS’19
58
![Page 59: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/59.jpg)
Cameras in Self-Driving Car
59
![Page 60: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/60.jpg)
NVIDIA Jetson TX2
60
![Page 61: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/61.jpg)
DNN Layer Processing Time
61
![Page 62: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/62.jpg)
Problem
• Processing DNNs on a heterogenous platform (e.g., TX2) is not efficient– Using one type of resource (e.g., GPU) may not be
always the best
– Because CPUs don’t do much (if any) and GPU utilization is low for many layers
– Also, sometimes CPUs are faster than GPU
• How to process multiple DNNs efficiently on a heterogeneous platform?
62
![Page 63: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/63.jpg)
Key Ideas
• Group layers into multiple stages
• A stage can be processed in different computing nodes
• Dynamically match make stages and nodes to maximize performance while respecting real-time requirements
63
![Page 64: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/64.jpg)
64
![Page 65: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/65.jpg)
65
![Page 66: Real-Time GPU Managementheechul/courses/eecs753/S20/slides/W10...GPU with Preemption Support N. Capodieci, R. Cavicchioli, M. Bertogna, A. Paramakuru. RTSS, 2018 57 Pipelined Data-Parallel](https://reader031.fdocuments.in/reader031/viewer/2022012001/608469dc18e0bf43986c5987/html5/thumbnails/66.jpg)
Summary & Discussion
• Efficiently schedule multiple DNN models on a heterogeneous computing platform
• Exploit DNN layer-level differences on best computing resource configurations
• Improve throughput and support real-time scheduling
• Downsides?
66