VIDEO TRANSCODING WITH AML - Home - AMDdeveloper.amd.com/wordpress/media/2013/06/2146_final.pdf ·...
Transcript of VIDEO TRANSCODING WITH AML - Home - AMDdeveloper.amd.com/wordpress/media/2013/06/2146_final.pdf ·...
VIDEO TRANSCODING WITH AMLON HETEROGENEOUS COMPUTE
Mike SchmitAMDSr. Manager, Software Engineering, Office of the CTO, Multimedia
3 Video Transcoding with AML on Heterogeneous Compute | June 2011
AGENDA
AML Overview
Building a Binary Library of OpenCLTM Kernels
Decoder and Encoder Pipelines
Partitioning Workloads in a Heterogeneous Environment
Performance Measurements
Challenges
Q & A
4 Video Transcoding with AML on Heterogeneous Compute | June 2011
AML OVERVIEW
5 Video Transcoding with AML on Heterogeneous Compute | June 2011
AML (AMD MEDIA LIBRARY) OVERVIEW
AML is a collection of OpenCL media components for GPU shader assisted video compression, post processing and image-related processing
– OpenCL-based low level building blocks– Build your own standards-based codec or a proprietary codec– Pick and choose from the dozens of “tools” within an encoder– Full control over CPU-GPU pipelining– Optimized OpenCL kernels (binaries) will be installed on end-users systems as part of the GPU drivers– A thin SDK layer allows ISV applications to easily access the library of kernels
(called a kernel database or KDB file)– Current development
MPEG-2 & H.264 encode and decode up to 1080p
JPEG decode up to 64 MP
6 Video Transcoding with AML on Heterogeneous Compute | June 2011
APPLICATIONS AND USE CASES FOR AML (A FEW EXAMPLES)
Ultra-fast decode (800+ fps on 1080p)– Smart movie navigation– Video Tapestries and Video narratives (UI with blends from many scenes)– Index movie with object detection, recognition, search, etc– Face detection and recognition (need to leave lots of compute cycles after the decode)– Video skimming (smart fast forward)
Video editing– Many decoder tracks, with one encoder– Fast/smooth scrubbing and full quality preview of complex effects– Clip cataloguing
TranscodingVideo conferencing with one to many participants
7 Video Transcoding with AML on Heterogeneous Compute | June 2011
BUILDING A BINARY LIBRARY OF OPENCL KERNELS
8 Video Transcoding with AML on Heterogeneous Compute | June 2011
Working with OpenCL kernels
OpenCL works with a JIT (just-in-time) model for compiling and execution– The compiler goes through several stages from the OpenCL front end to the final shader compiler (SC)
that creates the ISA (instruction set architecture) binary for the precise GPU hardware target– The application passes a CL source file in a buffer to the run-time compiler to be executed on the GPU– We present a methodology to store intermediate binaries (LLVM) in a KDB (kernel database) file at
build time for mass distribution– At install time these kernels are compiled for the specific GPU installed on the system– At run time the KDB file is opened (via a thin SDK) which checks for proper installation and delivers
each kernel binary for execution via the OpenCL run-time– Upgrades or swapping of GPUs is fully supported via the app startup installation check where a quick
recompile may occur (this is a rare event and may take about a minute)
9 Video Transcoding with AML on Heterogeneous Compute | June 2011
AML Kernel Database
OpenCL Runtime
Installed Image
ISV Application
KDBSDK.lib KDBSDK.dll
AMD Graphics
Prog Obj
GetProgram
Driver installation
OpenCL Environment
KDB Installation and Design Overview
OpenCL API
10 Video Transcoding with AML on Heterogeneous Compute | June 2011
OPENCL COMPILER FLOW
source.cl
LLVM
IL
ISA
OpenCL front end
OpenCL back end
Shader Compiler (SC)
GPU H/W
OpenCL Implementation
GPU H/W specific
JIT
x86
CPU
(see llvm.org)
(GPU intermediate language)
11 Video Transcoding with AML on Heterogeneous Compute | June 2011
KDB (KERNEL DATA BASE) CREATION PROCESS
source.cl
LLVM
ILILLLVM KDB (generic)
Make/Build
KDB (ISA specific)
GPU H/W
Install / OpenCL ->SC
Runtime
Install download package
Post Install KDB on HDD
source.clsource.cl
IL
ISA
12 Video Transcoding with AML on Heterogeneous Compute | June 2011
DECODER AND ENCODER PIPELINE EXAMPLES
13 Video Transcoding with AML on Heterogeneous Compute | June 2011
VIDEO DECODER PIPELINE | With MPEG-2 as a Simple Example
VLD
Block coeff info
Block coeff data
Block coeff info Block coeff
data
Mpeg2Recon KernelMotion CompensationMpeg2McY8x8P KernelMpeg2McC8x4P Kernel
MPEG-2 MC Macroblock InfoMPEG-2 MC Motion
Vector Info
Ref Frame 0
Ref Frame 1
Motion Comp Output
MPEG-2 MC Macroblock InfoMPEG-2 MC Motion
Vector info Decoded Frame
Decoded Frame
CPU
GPU
or
14 Video Transcoding with AML on Heterogeneous Compute | June 2011
VIDEO ENCODER PIPELINE | With MPEG-2 as a Simple Example
Mpeg2MdCalcBasic Kernel
MotionCompMpeg2McY8x8P KernelMpeg2McC8x4P Kernel
CPU
GPU
Frame Read
Frame buffer
Mode Decision MB
Info
Mode Decision KernelsMpeg2MDRCIMpeg2MDRCPMpeg2MDRCB
Mpeg2Recon Kernel
Mpeg2MotionSearchFullNxM*
Mpeg2Diff Kernel
Motion Search MV Info
Block coeff Info
MPEG-2 MC Macroblock Info
MPEG-2 MC Motion Vector info
Motion Comp Output
Ref Frame 0Ref Frame
1
VLE / entropy encode
Block Coeff Data
or
Frame buffer Block Coeff Data
15 Video Transcoding with AML on Heterogeneous Compute | June 2011
MOTION ESTIMATION DETAILS
A motion search implementation might use the following 8 of 50+ motion search kernels
MotionMvScaleMpeg2MotionSearchFull16x16Mpeg2MotionSearchFull5x3Rect16x16Mpeg2MotionSearchFull5x3Rect16x16AvgMpeg2MotionSearchFullNxMRect8x8Mpeg2MotionSearchHPel3x3Rect16x16Mpeg2MotionSearchHPel3x3Rect16x16AvgMotionMvSelect
16 Video Transcoding with AML on Heterogeneous Compute | June 2011
SPLITTING THE WORKLOAD IN A HETEROGENEOUS
COMPUTE ENVIRONMENT
17 Video Transcoding with AML on Heterogeneous Compute | June 2011
PARTITIONING GPU LOAD AND CPU LOAD
18 Video Transcoding with AML on Heterogeneous Compute | June 2011
PARTITIONING GPU LOAD AND CPU LOAD (AS A PERCENTAGE OF ORIGINAL CPU LOAD)
19 Video Transcoding with AML on Heterogeneous Compute | June 2011
AML PERFORMANCE
20 Video Transcoding with AML on Heterogeneous Compute | June 2011
MPEG-2 1080p VIDEO DECODER PERFORMANCE
0
200
400
600
800
1000
1200
1 2 3 4
FPS
CPU Threads
HD 6970HD 5670Llano (2.4 Ghz)2.8 Ghz CPU
21 Video Transcoding with AML on Heterogeneous Compute | June 2011
H.264 1080p VIDEO ENCODER PERFORMANCE
0
30
60
90
120
150
180
210
Llano (2.4 GHz)
Phenom II (2.8 GHz)
Llano(+GPU)
Phenom II+HD 5670
Phenom II+HD 5770
Phenom II+HD 6970
FPS
CPU software only CPU + GPU
22 Video Transcoding with AML on Heterogeneous Compute | June 2011
CHALLENGES
Memory Bandwidth– Need lots of CPU and GPU compute time relative to CPU->GPU and GPU->CPU data transfers– Future platform directions; FSA (Fusion System Architecture) will address this– But still a good practice to use lots of compute time per data load, just like for a CPU cache
Pipeline adds latency– High throughput may be the tradeoff for longer latency
Pipeline with rate control– Getting feedback on the bit consumption along with high parallelism/performance can be tricky
Motion estimation: lots of choices– The biggest consumption of cycles in most algorithms– Low-end GPUs – do less; high-end do more
23 Video Transcoding with AML on Heterogeneous Compute | June 2011
H.264 SPECIFIC ISSUES
Intra-prediction mode and Deblocking– Both of these H.264 features have strong dependencies on the neighboring macroblocks that are
above and to the left– These dependencies limit the amount of parallelism obtainable– Thus the more powerful a GPU is the lower and lower the GPU shader utilization will be
Solutions– Encoding with multiple slices somewhat mitigates this– Can encode multiple frames at once within the stream (B frames)– Can encode multiple streams at once– Future GPUs address with the ability to schedule more than a single kernel at once
24 Video Transcoding with AML on Heterogeneous Compute | June 2011
INTRA-PREDICTION MODE DETAILS (16 X 16)
16 x 16 Luma modes (similar to 8 x 8 chroma modes)
0: vertical 1: horizontal
2: DC (Mean(H+V) 3: Plane
25 Video Transcoding with AML on Heterogeneous Compute | June 2011
INTRA-PREDICTION MODE DETAILS (4 X 4)
8: horizontal up
1: horizontal
3: diagonal down-left
0: vertical
7: vertical left
4: diagonal down-right
6: horizontal down
5: vertical right
2: DC
26 Video Transcoding with AML on Heterogeneous Compute | June 2011
SUMMARY: ADVANTAGES OF THE AML MODEL
Pre-written, optimized OpenCL kernels provided for each AMD GPU family, including APUsPerformance designed to scale with more powerful GPUs for data parallel functionsPerformance designed to scale from generation to generationSpecific compilation for the targeted GPU means the code automatically takes advantage of – AMD instruction set improvements such as SAD (sum of absolute differences) instructions– OpenCL compiler improvements– SIMD engine (CU) design advancements, such as LDS (local data store)
ISVs do not have to become experts in data parallel programming optimizationCan get a high performance codec up and running quicklyISVs can mix and match with their own custom kernels
27 Video Transcoding with AML on Heterogeneous Compute | June 2011
Other Sessions of interest– 1721: M-JPEG Decoding using OpenCL on Fusion– 2112: A Methodology for Optimizing Data Transfer in OpenCL– 2322: Fusion Enabled Video and Imaging Pipelines– 1741: Optimizing Video Editing Software with OpenCL– 2116: Video Post Processing– 2904: 1) High Quality and Efficient Post Processing on GPU Compute– 2904: 2) Real-time H.264 Video Enhancement Using AMD APP SDK– 2904: 3) Using Fusion System Architecture for Broadcast Video
ADDITIONAL SESSIONS
QUESTIONS
29 Video Transcoding with AML on Heterogeneous Compute | June 2011
Disclaimer & AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limitedto product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.
OpenCL is a trademark of Apple Inc. used by permission by Khronos.
© 2011 Advanced Micro Devices, Inc.