GPU Compute accelerated HEVC decoder on ARM® MaliTM-T600...

20
GPU Compute accelerated HEVC decoder on ARM® Mali TM -T600 GPUs

Transcript of GPU Compute accelerated HEVC decoder on ARM® MaliTM-T600...

GPU Compute accelerated HEVC decoder on ARM® MaliTM-T600 GPUs

Ittiam Systems Introduction

2

DSP Professionals Survey by Forward Concepts

World’s most preferred DSP IP supplier

2004 2005 2006

DSP Systems IP Company Multimedia + Communication Systems Multimedia Components, Systems, Hardware Focus on Broadcast, Video Communication, Video Security, Mobile

IP Licensing Business Model Founded in 2001 Venture funded Flexible mix of one time fees and royalties for licensing

300+ licensees Worldwide Fortune 100 companies, Tier 1 OEMs Consistently rated as Most Preferred DSP IP Supplier

250 strong Engineering Team World Class Talent Deep Multimedia and end application Expertise 29 patents issued 30+ patents filed

Ittiam Multimedia Overview

3

Multimedia Components

Middleware + SDKs

OEM Applications

Audio Codecs Video Codecs/Image Codecs Algorithms for Audio Effects, Acoustics, Imaging ARM CPU , NEON Optimized DSP+HW Accelerators + GPU expertise and capabilities

System components Parsers, Creators, Stacks, Subtitles Multimedia Integration Android, Other Frameworks Use Case validation Enhancements to existing Middleware Application Specific SDKs

Complete Multimedia Applications Covers major Multimedia Use Cases Camera, Gallery, Editor, Players, Video Editor Production tested Customizable to requirements

4x

Ittiam Multimedia Solutions and ARM

4

Strategic Platform

Long Investment

Partnership

Focus on Mobile, Home, Portable segments ARM Connected Community Member Strong Portfolio of IP Expertise in ARM architecture and optimizations for ARM

Many years of development on ARM Platforms Covering ARM9E, ARM11, Cortex®-A8, A9, A15, A5, A7 and NEONTM

In house developed reference C models for all IP Efficient, targeted for ARM, validated across multiple generations

Joint Benchmarking of implementations Early Access to Mali/OpenCL information Early involvement on new platforms

Ittiam Media Processing Elements

Audio Codes Video Codecs Stereo and Multichannel MP12, AAC- LC/HE v1&v2, AC3, DD+ High Quality Resampler Post Processing and Audio Effects Field Proven

MPEG2, MPEG-4, H.264 , HEVC / H.265 Scalable across Multiple ARM Cores Optimized for bandwidth and CPU + NEON Error Resilience for Streaming Use cases In Production

Acoustics

Sin

Voice Quality Enhancements with Echo Cancellation/ AEC), Noise Reduction/ANR Equalizer for Microphone & Speaker AGC , AVC , Audio De-Reverb Mic Beam Forming

De-noise, Face detection, Red-eye correction Panorama, HDR, Low Light, 3D B&W, Sepia, Cross Process Exposure, Colours, Geometric, Filters

5

Image Processing

HEVC Overview

HEVC / H.265 Sandard

HEVC aka H.265 is a video compression standard, jointly developed by ISO/IEC MPEG and ITU-T VCEG

MPEG and VCEG have established a Joint Collaborative Team on Video Coding (JCT-VC) to develop the HEVC standard

HEVC is a successor to H.264 standard

HEVC can support ultra high resolutions upto 8192 x 4320 pixels

HEVC offers substantially higher video compression ratio compared to existing standards

H.265 vs H.264

Tool H.264 H.265

Coding unit 16x16 macroblocks Block coding Structure

Coding tree blocks (64x64) Quadtree coding structure

Transforms 4x4 and 8x8 4x4, 8x8, 16x16 and 32x32

Inter Prediction 4x4 to 16x16 Symmetric partitions

4x4 to 64x64 Asymmetric partitions

Intra Prediction 9 Modes 35 Modes

Motion Prediction Spatial Median Advanced Motion Vection Prediction (Spatial + Temporal)

Luma motion compensation

6 taps for half-pel positions+ Bilinear filter for qpel positions

8 taps for half-pixel positions + 7 tap filter for quarter-pel positions

Chroma motion compensation

2 taps 4 taps

Slices

Slices for parallel parsing Wavefront parallel processing Tiles and slices for parallel parsing

In-loop filters Deblocking Deblocking and SAO

HEVC compression B

itR

ate

1990 2000 2010

MPEG-2

H264/AVC

H265/HEVC

35% reduction in bitrate for same PSNR output when compared to H.264 Perceptual video quality is subjective and cannot be measured with PSNR values Subjective tests have shown around 50% reduction in bitrate for similar perceptual video quality when compared to H.264

About 50% compression over H264 for video resolutions of 1080p and above. 30-40% compression over H264 for lower resolutions

HEVC Applications – Near Term

Over-the-top(OTT) video services market is growing at a rapid pace, thanks to Netflix, Hulu, YouTube etc., Smarter Phones and Tablets contribute significantly to OTT growth with consumers opting to view videos on-the-go OTT video services are popularly used with in TVs/set-top boxes as well

Rapid growth in OTT market chokes the network bandwidth One in five Consumers abandon viewing due to slow feeds , poor quality viewing experience HEVC will enable superior viewing experience with OTT video service

HEVC Applications – Long Term

Higher quality video in the traditional terrestrial and satellite broadcasts Video recording in cameras and mobile phones, for saving storage space or higher quality

Broadcasting 1080p video at 50 or 60 frames per second for the same bandwidth as 1080i (25 or 30 fps) 4K and 8K Ultra-HD broadcasts for theatre-like quality

Need for Software HEVC Decoder

HEVC is a newly ratified standard and there is no hardware support in the current generation of Processors (Embedded / Mobile / Applications SoCs)

Dedicated HW accelerators for HEVC increases the silicon area and hence the cost significantly

Lack of HEVC content makes the early HW implementation risky

Software Decoding is simpler and economically viable option for HEVC deployment NOW

Handling the HEVC decoder complexity on a wide range of processors with constraints on the power consumption is key challenge for the Software Decoder

Why use GPUs for Video Processing ?

Decoding of high resolution videos in software involves high computational complexity and will load the CPU enormously

GPUs are highly compute capable and power efficient devices

GPUs are generally idle during video playout

GPU acceleration will free up the CPU to perform other (system) tasks

Sin

CPU Core(s)

ARM Cortex with NEON

MALI T600 / OpenCL compliant GPU

HEVC Decoding on Capable GPUs

GPUs are massively multithreaded devices capable of handling hundreds or thousands of threads in parallel at any given time

Only highly data parallel algorithms of video codec can be efficiently offloaded to the GPU for processing

Parsing & Entropy Decode

Motion Compensa

tion

Intra Prediction

Recon

Inverse Quant

Inverse Transform

Not suitable for GPU execution Data parallel execution ,suitable for GPU execution

Deblocking & SAO

Sin

Motion Compensation

The current picture/frame pixels is predicted from the reference frame’s pixels

The reference picture can be from past or future

The prediction happens on a block-by-block basis

And there can be multiple reference frames for each block

Sin

Motion Compensation

The most compute intensive part of Motion compensation is sub-pixel interpolation

• Luma – 8 or 7 tap filter

• Chroma – 4 tap filter

Sub pixel interpolation is data parallel, i.e., interpolation of each block within a frame can happen in parallel and hence suited for GPU computing

Sin

Inverse Quantization and Transform

• The residue value need to be Inverse quantized

• 2-D Inverse DCT transformations should be performed over the inverse quantized data

Inverse Quantization & Transform

• Reconstruction : The output from the Motion compensation and intra prediction should be added with the output from Inverse transform

• In loop filtering such as Deblocking and SAO filters are applied over reconstructed samples

Recon & InLoop Filters

Parsing & Entropy Decode

Motion Compensati

on

Intra Prediction

Recon

Inverse Quant

Inverse Transform

Deblocking & SAO

Challenges in CPU+GPU Implementation

• The effective FPS of decoder will be the minimum of the FPS achieved by the CPU and GPU for their respective work

• So the partitioning needs to be efficient so that both of them perform their respective work at almost the same speed(FPS)

Efficient Partitioning of work between

CPU and GPU

• The algorithms running on CPU will depend on the output of algorithms from GPU and/or vice versa

• A good design should make sure neither the CPU nor the GPU spend any time waiting for the output of the other

Efficient pipelining data between CPU

and GPU

• Cache coherency between CPU and GPU data need to ensured. Cache coherency

Benefits of Mali T600 GPU

The 128-bit vector processing

• Suits DSP algorithms like Video processing

Presence of GPU cache instead of Local

memory

• No requirement for data transfers from/to global memory. Can be understood just like a CPU.

Flexible OpenCL workgroup size

• Works optimizally for a large range of OpenCL workgroup sizes. Multiple block sizes in a Video frame can be handled efficiently.

No divergent threads • Similar to CPU code, conditional code can be used in OpenCL

kernels as well. Different kinds of filter types, filter lengths etc., in video decode can be handled efficiently.

Unified memory • CPU and GPU share the same memory. Video YUV buffers are

pretty big. There is no need of costly memory transfers of those buffers.

MALI GPUs are well suited for Video Acceleration with significant power/performance benefits

Thank You

For more information visit www.ittiam.com or contact us at [email protected]