Realtime Signal Processing on Nvidia TX2 using...

Zürcher Fachhochschule

Armin Weiss

Dr. Amin Mazloumian

Dr. Matthias Rosenthal

Institute of Embedded Systems

High Performance Multimedia Research Group

Zurich University of Applied Sciences

Realtime Signal Processing on Nvidia TX2

using CUDA


System OverviewDigital Audio Mixing Console

2

Audio

I / O

Card

Control

Unit

Audio

Source

TX2Audio Processing

Audio

Sink


Motivation

3

CPUGPU

• In Comparison to FPGA / DSP Solutions:

• Performance Gain: 100x (e.g. Analog Devices SHARC)

• Fast Development Time

• System on Single Chip

• Cost Effective

Nvidia TX Series


Challenges

• Short and Deterministic Latency

• 32 Samples (0.33 ms @ 96 kHz)

• Video (60 Hz): 16.7 ms / Frame

• High Input and Output Data Rate

• 256 Channels ∗ 7 Inputs ∗32 BitInput

∗ 96 kHz = 5.5 Gb/s

• 1080p@60 (24-bit RGB): 3.0 Gb/s

4


Short and Deterministic LatencyVariability in GPU Kernel Launch

5

< 0.1%

__global__void identity(float *input, float *output, int numElem) {

for (int index = 0; index < numElem; index++) {output[index] = input[index];

}}

numElem = 25

≈ 99.8% as expected

0.1 % - 0.2%

Outliers


Short and Deterministic LatencyVariability in GPU Kernel Launch

6

Latency ~ Buffer Size

Outliers ≈ 50 ms


Short and Deterministic LatencyProblems

• How to Improve Deterministic Behavior?

• Solution: Persistent CUDA Kernel

• Eliminate Launch Time

7


Short and Deterministic LatencyPersistent Kernel

8

…

while (running) {

if (new_audio_samples() == true) {

send_sync_to_GPU();

wait_for_GPU_sync();

}

}

…

__global__ void audioKernel(…) {

…

// Infinite loop

while (true) {

wait_for_CPU_sync();

// Audio channel processing

…

wait_for_all_threads_to_finish();

send_sync_to_CPU();

}

}

CPUHost Application

GPUCUDA Kernel


Short and Deterministic LatencyPersistent Kernel: Synchronization

9

__global__ void audioKernel(…, volatile int*

gpuToken) {

…

// Infinite loop

while (true) {

wait_for_CPU_sync();

// Audio channel processing

…

wait_for_all_threads_to_finish();

send_sync_to_CPU();

}

}

GPUCUDA Kernel

wait_for_CPU_sync() {

int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i == 0)

while (*gpuToken == NO_AUDIO);

synchronize_threads();

}

Memory Accessible

by CPU and GPU


Short and Deterministic LatencyCUDA Memory Comparison Managed <-> Zero Copy

10

TX2

CPU GPU

DRAM 8GB

Memory ControllerIncl. SMMU

Cache Cache

GPU

Buffer

TX2

CPU GPU

DRAM 8GB


Cache Cache

GPU

Buffer

Zero Copy Managed Memory


Short and Deterministic LatencyProblems

• Memory Accessible by CPU and GPU

• Use Zero Copy Memory

• What about Parameters?

11


GPU

Short and Deterministic LatencyInfinite Loop: Parameter Communication

12

System Memory

CPU

Application

Writes at

Arbitrary

Time

Audio Processing Thread

Local Parameter

Copy

Parameter

Memory

(Zero Copy)

Slow

Access!

Periodic Update


Short and Deterministic LatencyConclusion

• Use Memory Accessible by CPU and GPU

• Use Zero Copy Memory

• What about Parameters?

• Speed-up due to Local Copy

• How to Make CPUs Deterministic?

• CPU Core Isolation

• Initramfs built with Yocto

• No Flash Access Anymore During Runtime

• Minimize Influence from other Applications

13


Challenges


• High Input and Output Data Rate

14


High Input / Output Data RateSeparate Buffers

15

TX2

CPU GPU

DRAM 8GB


Cache Cache

I/O

Buffer

GPU

Buffer

PCIe

Audio

I / O

Card

memcpy


High Input / Output Data Ratememcpy() Measurements

16

1.E+00

1.E+03

1.E+06

1.E+09

memcpy() Time4096 bytes @ 48 kHz for 12h on 3 CPUs (A57)

TX1 ( Kernel v3.10.96) TX2 (Kernel v4.4.15)

75 % CPU Usage!

# O

ccurr

ences

Time (µs)


High Input / Output Data RateShared Buffer

17

TX2

CPU GPU

DRAM 8GB


Cache Cache

I/O

Buffer

GPU

Buffer

PCIe

Audio

I / O

Card

memcpy



18

TX2

CPU GPU

DRAM 8GB


Cache Cache

Shared

Buffer

PCIe

Audio

I / O

Card



19

• Existing Solutions for Buffer Sharing

• GPUDirect RDMA


DesktopDiscrete GPU

TX2Integrated GPU

High Input / Output Data RateGPUDirect RDMA

20

CPUs

System

Memory

3rd

Party

Device

GPU

Bridge

PCIe

CPUs

System

Memory

3rd

Party

Device

PCIe

Controller

PCIe

GPU

Memory Interconnect

Memory

GPUDirect

RDMA



21


• GPUDirect RDMA Not Available

• CudaHostRegister()


High Input / Output Data RateCudaHostRegister()

22

TX2

CPU GPU

DRAM 8GB


Cache Cache

I/O

Buffer

PCIe

Audio

I / O

Card

CudaHostRegister()



23



• CudaHostRegister() Not Implemented on TX2

• Video4Linux2 Userptr Mode


High Input / Output Data RateVideo4Linux - Userptr

24

TX2

CPU GPU

DRAM 8GB


Cache Cache

GPU

Buffer

Video

Input

Embedded

Camera

Userptr Mode

Mapped Access


High Input / Output Data RateVideo4Linux - Userptr

25

TX2

CPU GPU

DRAM 8GB


Cache Cache

GPU

Buffer

PCIe

Audio

I / O

Card

Userptr Mode

Mapped Access



26



• CudaHostRegister() Not Implemented on TX2

• Video4Linux2 Userptr Mode✓


Conclusion

27

• Feasibility of Low-Latency Signal Processing on GPU

• Professional Audio Mixer with 200 Channels


• Persistent CUDA Kernel

• High Input / Output Data Rate

• Shared Buffer I/O <-> GPU (Video4Linux)


Get started with signal processing on GPU!

28

Website: http://www.zhaw.ch/ines/

Blog: https://blog.zhaw.ch/high-performance/

Github: https://github.com/ines-hpmm

E-Mail: [email protected]

[email protected]

[email protected]

Questions

http://www.zhaw.ch/ines/en

https://blog.zhaw.ch/high-performance/

https://github.com/ines-hpmm

mailto:[email protected]



Realtime Signal Processing on Nvidia TX2 using...

Documents

Transcript of Realtime Signal Processing on Nvidia TX2 using...