Realtime Signal Processing on Nvidia TX2 using...
Transcript of Realtime Signal Processing on Nvidia TX2 using...
Zürcher Fachhochschule
Armin Weiss
Dr. Amin Mazloumian
Dr. Matthias Rosenthal
Institute of Embedded Systems
High Performance Multimedia Research Group
Zurich University of Applied Sciences
Realtime Signal Processing on Nvidia TX2
using CUDA
Zürcher Fachhochschule
System OverviewDigital Audio Mixing Console
2
Audio
I / O
Card
Control
Unit
Audio
Source
TX2Audio Processing
Audio
Sink
Zürcher Fachhochschule
Motivation
3
CPUGPU
• In Comparison to FPGA / DSP Solutions:
• Performance Gain: 100x (e.g. Analog Devices SHARC)
• Fast Development Time
• System on Single Chip
• Cost Effective
Nvidia TX Series
Zürcher Fachhochschule
Challenges
• Short and Deterministic Latency
• 32 Samples (0.33 ms @ 96 kHz)
• Video (60 Hz): 16.7 ms / Frame
• High Input and Output Data Rate
• 256 Channels ∗ 7 Inputs ∗32 BitInput
∗ 96 kHz = 5.5 Gb/s
• 1080p@60 (24-bit RGB): 3.0 Gb/s
4
Zürcher Fachhochschule
Short and Deterministic LatencyVariability in GPU Kernel Launch
5
< 0.1%
__global__void identity(float *input, float *output, int numElem) {
for (int index = 0; index < numElem; index++) {output[index] = input[index];
}}
numElem = 25
≈ 99.8% as expected
0.1 % - 0.2%
Outliers
Zürcher Fachhochschule
Short and Deterministic LatencyVariability in GPU Kernel Launch
6
Latency ~ Buffer Size
Outliers ≈ 50 ms
Zürcher Fachhochschule
Short and Deterministic LatencyProblems
• How to Improve Deterministic Behavior?
• Solution: Persistent CUDA Kernel
• Eliminate Launch Time
7
Zürcher Fachhochschule
Short and Deterministic LatencyPersistent Kernel
8
…
while (running) {
if (new_audio_samples() == true) {
send_sync_to_GPU();
wait_for_GPU_sync();
}
}
…
__global__ void audioKernel(…) {
…
// Infinite loop
while (true) {
wait_for_CPU_sync();
// Audio channel processing
…
wait_for_all_threads_to_finish();
send_sync_to_CPU();
}
}
CPUHost Application
GPUCUDA Kernel
Zürcher Fachhochschule
Short and Deterministic LatencyPersistent Kernel: Synchronization
9
__global__ void audioKernel(…, volatile int*
gpuToken) {
…
// Infinite loop
while (true) {
wait_for_CPU_sync();
// Audio channel processing
…
wait_for_all_threads_to_finish();
send_sync_to_CPU();
}
}
GPUCUDA Kernel
wait_for_CPU_sync() {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i == 0)
while (*gpuToken == NO_AUDIO);
synchronize_threads();
}
Memory Accessible
by CPU and GPU
Zürcher Fachhochschule
Short and Deterministic LatencyCUDA Memory Comparison Managed <-> Zero Copy
10
TX2
CPU GPU
DRAM 8GB
Memory ControllerIncl. SMMU
Cache Cache
GPU
Buffer
TX2
CPU GPU
DRAM 8GB
Memory ControllerIncl. SMMU
Cache Cache
GPU
Buffer
Zero Copy Managed Memory
Zürcher Fachhochschule
Short and Deterministic LatencyProblems
• Memory Accessible by CPU and GPU
• Use Zero Copy Memory
• What about Parameters?
11
Zürcher Fachhochschule
GPU
Short and Deterministic LatencyInfinite Loop: Parameter Communication
12
System Memory
CPU
Application
Writes at
Arbitrary
Time
Audio Processing Thread
Local Parameter
Copy
Parameter
Memory
(Zero Copy)
Slow
Access!
Periodic Update
Zürcher Fachhochschule
Short and Deterministic LatencyConclusion
• Use Memory Accessible by CPU and GPU
• Use Zero Copy Memory
• What about Parameters?
• Speed-up due to Local Copy
• How to Make CPUs Deterministic?
• CPU Core Isolation
• Initramfs built with Yocto
• No Flash Access Anymore During Runtime
• Minimize Influence from other Applications
13
Zürcher Fachhochschule
Challenges
• Short and Deterministic Latency
• High Input and Output Data Rate
14
Zürcher Fachhochschule
High Input / Output Data RateSeparate Buffers
15
TX2
CPU GPU
DRAM 8GB
Memory ControllerIncl. SMMU
Cache Cache
I/O
Buffer
GPU
Buffer
PCIe
Audio
I / O
Card
memcpy
Zürcher Fachhochschule
High Input / Output Data Ratememcpy() Measurements
16
1.E+00
1.E+03
1.E+06
1.E+09
memcpy() Time4096 bytes @ 48 kHz for 12h on 3 CPUs (A57)
TX1 ( Kernel v3.10.96) TX2 (Kernel v4.4.15)
75 % CPU Usage!
# O
ccurr
ences
Time (µs)
Zürcher Fachhochschule
High Input / Output Data RateShared Buffer
17
TX2
CPU GPU
DRAM 8GB
Memory ControllerIncl. SMMU
Cache Cache
I/O
Buffer
GPU
Buffer
PCIe
Audio
I / O
Card
memcpy
Zürcher Fachhochschule
High Input / Output Data RateShared Buffer
18
TX2
CPU GPU
DRAM 8GB
Memory ControllerIncl. SMMU
Cache Cache
Shared
Buffer
PCIe
Audio
I / O
Card
Zürcher Fachhochschule
High Input / Output Data RateShared Buffer
19
• Existing Solutions for Buffer Sharing
• GPUDirect RDMA
Zürcher Fachhochschule
DesktopDiscrete GPU
TX2Integrated GPU
High Input / Output Data RateGPUDirect RDMA
20
CPUs
System
Memory
3rd
Party
Device
GPU
Bridge
PCIe
CPUs
System
Memory
3rd
Party
Device
PCIe
Controller
PCIe
GPU
Memory Interconnect
Memory
GPUDirect
RDMA
Zürcher Fachhochschule
High Input / Output Data RateShared Buffer
21
• Existing Solutions for Buffer Sharing
• GPUDirect RDMA Not Available
• CudaHostRegister()
Zürcher Fachhochschule
High Input / Output Data RateCudaHostRegister()
22
TX2
CPU GPU
DRAM 8GB
Memory ControllerIncl. SMMU
Cache Cache
I/O
Buffer
PCIe
Audio
I / O
Card
CudaHostRegister()
Zürcher Fachhochschule
High Input / Output Data RateShared Buffer
23
• Existing Solutions for Buffer Sharing
• GPUDirect RDMA Not Available
• CudaHostRegister() Not Implemented on TX2
• Video4Linux2 Userptr Mode
Zürcher Fachhochschule
High Input / Output Data RateVideo4Linux - Userptr
24
TX2
CPU GPU
DRAM 8GB
Memory ControllerIncl. SMMU
Cache Cache
GPU
Buffer
Video
Input
Embedded
Camera
Userptr Mode
Mapped Access
Zürcher Fachhochschule
High Input / Output Data RateVideo4Linux - Userptr
25
TX2
CPU GPU
DRAM 8GB
Memory ControllerIncl. SMMU
Cache Cache
GPU
Buffer
PCIe
Audio
I / O
Card
Userptr Mode
Mapped Access
Zürcher Fachhochschule
High Input / Output Data RateShared Buffer
26
• Existing Solutions for Buffer Sharing
• GPUDirect RDMA Not Available
• CudaHostRegister() Not Implemented on TX2
• Video4Linux2 Userptr Mode✓
Zürcher Fachhochschule
Conclusion
27
• Feasibility of Low-Latency Signal Processing on GPU
• Professional Audio Mixer with 200 Channels
• Short and Deterministic Latency
• Persistent CUDA Kernel
• High Input / Output Data Rate
• Shared Buffer I/O <-> GPU (Video4Linux)
Zürcher Fachhochschule
Get started with signal processing on GPU!
28
Website: http://www.zhaw.ch/ines/
Blog: https://blog.zhaw.ch/high-performance/
Github: https://github.com/ines-hpmm
E-Mail: [email protected]
Questions