High Performance Video Pipelining: A Flexible Architecture for … · 2014-04-07 · Title: High...

Post on 28-Jul-2020

3 views 0 download

Transcript of High Performance Video Pipelining: A Flexible Architecture for … · 2014-04-07 · Title: High...

High Performance Video Pipelining: A Flexible Architecture for GPU Processing of Broadcast Video

Peter Walsh Chief Emerging Technology Engineer

ESPN

Overview

• Real-time GPU processing of broadcast video

– Maximize GPU utilization

– Maintain flexibility

• High Performance Video Pipeline

– CPU and GPU buffers

– Data transfer

Monday Night Football production truck

NASCAR production truck

Studio (BCS championship “Film Room”)

GPU Processing

• Segmentation (generating chromakey)

• Inserting graphics (linear and chromakeying)

• Field (camera) tracking

• Object (player) tracking

Segmentation

GFX insertion

Field Tracking

Interop

Input Video

CPU GPU

Rendering

Output Video

Object Tracking

Background

• “Best Practices in GPU-Based Video Processing,” Tom True, NVIDIA, GTC 2013

• “Topics in GPU-Based Video Processing,” Tom True, NVIDIA, GTC 2014

Naïve Sequential Implementation

• Acquire

• Upload

• Process

• Download

• Output

1 Frame Time

Simultaneous Operations

• Acquire

• Upload

• Process

• Download

• Output

1 Frame Time

Techniques

• Avoid CPU memory copies

• Use pinned system memory

• DMA Video I/O using pinned memory

• DMA between CPU and GPU

• Asynchronous – using multiple CUDA streams

• Double buffers for simultaneous R/W

Frame Buffers

Pinned System

System

GPU

Frame Buffers

Pinned System

System

GPU

Buffer Allocation • Device • System • Pinned System

• 1D • 2D (pitch specified) • 2D (pitch determined by CUDA allocation)

Pitch

CUDA API

Allocation:

Memory Copies:

cudaMalloc() cudaHostAlloc() cudaMallocPitch()

cudaMemcpy() cudaMemcpy2D() cudaMemcpyAsync() cudaMemcpy2DAsync()

Buffer Transfers

B.Copy(A, pStream)

• Source and destination buffers

– System, pinned system, device

– Different pitches

• Supports Synchronous/Asynchronous transfers

CUDA Kernels

LaunchKernel( A, B, pStream, …)

• Buffers A and B are in device memory

• Sync/Async behavior controlled by pStream

A

B C

D

Processing

Acquire(A) B.Copy(A, pUploadStream) Process(B, C, pProcessingStream, params) D.Copy(C, pDownLoadStream) Output(D)

GPU

CPU

Double Buffering

Dst

Src

Src

Dst

Frame “i”

Frame “i + 1”

Double Buffering

Src

Processing

GPU

CPU

Dst

Src Dst Src Dst

Src Dst

Double Buffering

Src

Processing

GPU

CPU

Dst

Src Dst Src Dst

Src Dst

Segmentation

GFX insertion

Field Tracking

Interop

Input Video

CPU GPU

Rendering

Output Video

Object Tracking

Simultaneous Operations

• Acquire

• Upload

• Process

• Download

• Output

1 Frame Time

Intel IPP ippiFilter_8u_C1R (pSrcImgOffset, srcPitch, pDstImgOffset, dstPitch, roi, filterKernel, kernelSize, anchor, divisor);

NVIDIA NPP nppiFilter_8u_C1R (pSrcImgOffset, srcPitch, pDstImgOffset, dstPitch, roi, filterKernel, kernelSize, anchor, divisor);

HPVP Filter_8u_C1R(pSrc, pDest, roi, pFilterKernel);

Live Filtering

• Acquire(A)

• B.Copy(A, pUploadStream)

• Filter_8u_C3R(B, C, roi, pFilterKernel) *

• D.Copy(C, pDownLoadStream)

• Output(D)

* CUDA stream for processing already defined

References/Links

“Best Practices in GPU-Based Video Processing,” Tom True, NVIDIA, GTC 2013

“Topics in GPU-Based Video Processing,” Tom True, NVIDIA, GTC 2014 http://www.youtube.com/watch?v=QpEV-XVIxNw http://frontrow.espn.go.com/2014/01/espns-advanced-replay-tool-art-graphically-enhances-sports-telecasts/

Questions

Peter Walsh ESPN pete.m.walsh@espn.com (860) 766-2908