Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

15
Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform Xiangwen Wang 1 , Li Song 2 , Yanan Zhao 2 , Min Chen 1 1 Shanghai University of Electric Power 2 Shanghai Jiao Tong University

description

This presentation give a framework for paralleling ME of HEVC on multicore platform

Transcript of Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

Page 1: Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

Paralleling Variable Block Size Motion

Estimation of HEVC On

CPU plus GPU Platform

Xiangwen Wang1, Li Song2, Yanan Zhao2, Min Chen1

1Shanghai University of Electric Power

2 Shanghai Jiao Tong University

Page 2: Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

Outline

Introduction

The proposed paralleling VBSME of HEVC for

GPU+CPU

Preliminary result and Future work

Page 3: Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

Introduction

HEVC is the newest video coding standard introduced by ITU-T VCEG and

ISO/IEC MEPG.

Compared with H.264/AVC, HEVC decreases the bitrate by 50% percent

on average while maintaning the same visual quality.

BQMall_832x480. Left: HEVC 1.5Mbps, right: x264 3.0Mbps

Page 4: Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

Introduction

However, encoding complexity is several times more complex than H.264.

• RDO: iterate over all mode and partition combinations to dicide the best coding

information

• RDoQ: iterate over many QP candidates for each block

• Intra: prediction modes increased to 35 for luma

• SAO: works pixel by pixel

• Quadtree structure: bigger block sizes and numerous partition manners

• Some other highly computational modules ...

As a result, the traditional method which performs the encoding in a

sequential way could no longer provide a real-time demand, especially

when it comes to HD (1920x1080) and UHD (3840x2160) videos.

Parallellism in the encoding procedure must be extensively utilized.

Page 5: Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

Overview OF VBSME In HEVC

• Three independent block concepts

• CU - Coding Unit

• PU - Prediction Unit

• TU - Transform Unit

• The total number of allowed PU size is 12 (from 64x64 to 4x8/8x4)

up to 425 times ME for one 64x64 CTU

(5+4x5+16x5+64x5 = 425)

CU size and PU partition structure

CU PU

Depth1: 64x64

Depth2: 32x32

Depth3: 16x16

Depth4:8x8

Page 6: Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

Two stages:

• ME to select best MV for candidate PUs

• CU depth and PU partition mode decision

MV selection criterion for each PU:

Jpred,SAD =SA(T)D + λpred * Rpred

CU sizes and PU partition mode decision:

Jmode =SSD + λmode * Rmode,

To calculate the Jmode for each PU, reconstruction and

entropy coding of all syntaxes are necessary, the

complexity is beyond the computational capability of

common computers for real applications.

Mode Decision with VBSME In HM

Page 7: Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

The proposed parallel encoding framework

Copy to GPU fEnc

Interpolate&

Border Pad

fEnc

ME Kernel

64x64~8x8

PU MVs

Half/quarter

pixel img

buff

Half/quarter

pixel img

buff

fRecEncode &

Reconstruct

64x64~8x8

PU MVs

MC

CPU GPU

GPU-MEMCPU-MEM

Read Img

Launch new

LCU Line ME

Sync to

last LCU Line

ME

Launch

Interpolate

Entropy

coding

Sync to

Interpolate

All LCU line?N

Mode

Decision

fRec

Next frame

frame

loop

LCU

line

loop

Lunch ME

for one CTU

line

Page 8: Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

Fast PU partition mode decsion scheme

SKIP ?

Half/quarter

pixel img

buff

CBF_fast ?

Fast CU

partition

64x64~8x8

PU MVs

PART_2Nx2N?

CU partition

or next CU

N

Y

Y

CU depth==4

CU_idx==4?

Y

RD cost

Calculate

CU depth

= 4

Sync to

last LCU Line

ME

CPU-MEM

MC

The MV and residual information are employed for PU

partition decision

Two edge feature parameters:

00 01 10 11

_8

S S S SV

QP stepN

00 10 01 11

_8

S S S SH

QP stepN

If (H==V &&H!=0)

PART_2Nx2N

Elseif (H==V and H==0)

PART_NxN

Elseif H>V

PART_Nx2N

else

PART_2NxN

Page 9: Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

Parallel realization of VBSME on CUDA

8x8 block size

SAD calculation

16x16 block size

Jpred calculation

Integer Pixel Jpred

Comparison 16x16

Fractional pixel

MV refinement

Variable block size

Jpred generation

Variable block size

Jpred calculation

Integer Pixel Jpred

Comparison

four 16x16 lines

Variable block size ME

The MV selection criterion is as

follow: Jpred =SAD +λpred * DMV = SAD +λpred *(MV_C - PMV)

where MV_C: current point MV,

PMV: next slice

Page 10: Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

PMV for MV cost calculate

MV0 MV1 MV2

MV3

MV4

PMV=medium(MV0, MV1, MV2, MV3, MV4)

one CTU(64x64) line is divided into four 16x16 block lines;

The ME process of each 16x16 line is done by GPU sequentially;

The MVs of 16x16 block size are used as the MV predictions for

all other block sizes.

Page 11: Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

Variable block size SAD Generation

8x80 1 2 3 4 5 6 7 63626160

16x16

32x32

64x64

Variable block size SAD Generation on CUDA

Page 12: Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

Experimental Results

Platform:Z620 = NVIDIA Tesla [email protected], with Win7 OS

The CUDA driver version of the GPU is 5.0 and the CUDA

Capability version number is 2.0.

The search range is 64x64 with the full search strategy for IMV and

24 fractional-pixel positions around the IMV.

sequence CPU(fps) GPU(fps) Speedup

ratio

Traffic_2560x

1600_crop 0.21 23.77 113.2

ParkScreen_1

920x1080_24 0.69 77.76 112.7

The speed-up ratio is about 113 times

Page 13: Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

Experimental Results:RD comparison

Note1, The propose algorithm is realization based on X265 encoder, a first open source

encoder implementation of HEVC "x265 project, http://code.google.com/p/x265/";

Note2, the Cactus_Proposed implies the RD curve generated by the X265 encoder with

the proposed algorithm.

Page 14: Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

Conclusion

We present a parallel friendly VBSME(variable block

size motion estimation) scheme which make full of

available computation resources from CPU and GPU

respectively

Preliminary results are reported with speedup ratio over

100 times compared to single thread CPU only solution

We will continue to exploit parallelism, targeting a

4K@30fps realtime HEVC encoder over multicore CPU

and GPGPU platform.

Page 15: Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

Thanks!