Analysis, Fast Algorithm, and VLSI Architecture Design for H.264/AVC Intra Frame Coder

31
Analysis, Fast Algorithm, and VLSI Architecture Design for H.264/AVC Intra Frame Coder Yu-Wen Huang, Bing-Yu Hsieh, Tung-Chien Chen, and Liang-Ge e Chen IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2005

description

Analysis, Fast Algorithm, and VLSI Architecture Design for H.264/AVC Intra Frame Coder. Yu-Wen Huang, Bing-Yu Hsieh, Tung-Chien Chen, and Liang-Gee Chen IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2005. Outline. Introduction H.264/AVC Intra Coding Computation Reduction - PowerPoint PPT Presentation

Transcript of Analysis, Fast Algorithm, and VLSI Architecture Design for H.264/AVC Intra Frame Coder

Analysis, Fast Algorithm, and VLSI Architecture Design for H.264/AVC Intra Frame Coder

Yu-Wen Huang, Bing-Yu Hsieh, Tung-Chien Chen, and Liang-Gee Chen

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2005

Outline

Introduction H.264/AVC Intra Coding Computation Reduction Hardware Architecture

Introduction

EntropyCoding

Scaling & Inv. Transform

Motion-Compensation

ControlData

Quant.Transf. coeffs

MotionData

Intra/Inter

CoderControl

Decoder

MotionEstimation

Transform/Scal./Quant.-

InputVideoSignal

Split intoMacroblocks16x16 pixels

Intra-frame Prediction

De-blockingFilter

OutputVideoSignal

Multiple Reference Frames &Variable Block sizes

Introduction

Prediction Transform Quantization Entropy CodingSource

CompressedData

44/1616 Luma88 Chroma

4 4 DCTScalar

Nonuniform QCAVLCCABAC

(Bit per pixel)

lossy

lossless

Introduction

H.264/AVC I-Frame Coder (CAVLC) vs. JPEG2000 (DWT 53) Computational Complexity

Block-based coding vs. Frame-based coding

DWT 53

Hardware-friendlyHardware-friendly Memory-wastingMemory-wasting

Introduction

Comparison between different image coding standards

JPEG JPEG 2000 DWT53 H.264 I-Frame CAVLC

0.225 bpp

Introduction

Two solutions for platform-based design of H.264/AVC intra frame coder Fast algorithm for software

implementation Reduce 45% complexity PSNR drop 0.3 dB

Hardware accelerator Max. clock rate 55 MHz 31 fps for 4:2:0 SDTV (All intra frames)

H.264/AVC Intra Coding

Intra Prediction

I4MB (44)

0

1

3 45

6

7

8

Current

I16MB (1616)

0

1+ DC + DC + Plane

H.264/AVC Intra Coding

Mode Decision Low complexity mode

SATD (Original pels – Predictors) Rate (bit of Mode information)

High complexity mode MSE (Original pels – Reconstructed pels) Rate (Mode information + Residual)

)|,()|,(),|,( QPModeMBRateQPModeMBDistortionQPModeMBJ kkMODEkkMODEkkMODE

H.264/AVC Intra Coding

Transform and Quantization 4 4 integer transform

DCT-basedinteger transform

Hadamard transform

H.264/AVC Intra Coding

Entropy Coding Context-Based Adaptive Binary

Arithmetic Coding (CABAC) Context-Based Adaptive Variable Length

Coding (CAVLC)

H.264/AVC Intra Coding

Run-time percentage 720 480 4:2:0 30fps 10829 MIPS

Computation Reduction

Intra Prediction Table look-up

Cost generation Sub-sampling

Computation Reduction

Fast Intra Prediction The smaller the mode number is, the

more possible it will occur. global statistics cannot reflect the correlation

of local modes. Local statistics of neighboring blocks are

applied.

Computation Reduction

Fast Intra Prediction Skip unlikely candidates

Computation Reduction

Rate-distortion under different numbers of local-searched I4MB modes without insertion of full-search blocks

1 246

All DC modes

Computation Reduction

Fast Intra Prediction Prevention of error propagation

Periodic insertion of full-search 4x4 blocks

Adaptive threshold on the distortion for a MB If min SATD of P > THMinSATD, then search all modes.

THMinSATD = (min SATD of F) = 2.0

F PP P

F PP P

F PP P

F PP P

Computation Reduction

1111

1111

1111

1111

1111

1111

1111

1111

33323130

23222120

13121110

03020100

rrrr

rrrr

rrrr

rrrr

1111

1111

1111

1111

1111

1111

1111

1111

33333131

22222020

13131111

02020000

rrrr

rrrr

rrrr

rrrr

Subsampling Patterns

Computation Reduction

Saved Computation and PSNR Drop

Global: subsampling + partial search using global statisticsLocal: subsampling + partial searchProposed: subsampling + partial search + periodic insertion of full search + adaptive SATD threshold

PSNR drop < 0.3 dB

Hardware Architecture

Assumptions A RISC can execute one instruction per

cycle, except multiplication requiring two. A processing element (PE) can generate

predictors of one pixel per cycle.

Hardware Architecture

Solutions

30fps # of modes Avg. cycles per predictors

lumachroma

Produce all modes per cycle Produce one mode per cycle

Hardware Architecture

Comparisons in different degrees of parallelism

Hardware ArchitectureMA B CD E F GHI

KJ

L

Register

DRAM

Hardware Architecture

Four-Parallel Reconfigurable Intra Prediction Generator

8-bit adder

9-bit adder

Hardware Architecture

Intra Prediction Generator

MA B CD E F GHI

KJ

L

Hardware ArchitectureCycle 1: T0+T4+T8+T12 Cycle 1: T1+T5+T9+T13 Cycle 1: T2+T6+T10+T14 Cycle 1: T3+T7+T11+T15

Cycle 2: +L0+L4+L8 Cycle 2: +L0+L5+L9 Cycle 2: +L2+L6+L10 Cycle 2: +L3+L7+L11

Cycle 3: +L12 Cycle 3: +L13 Cycle 3: +L14 Cycle 3: +L15

Cycle 4: +++

PE0 PE1 PE2 PE3

I16MB DC Prediction Mode

Top

Left

Hardware Architecture

I16MB Plane Prediction ModePred[y, x] = Clip1((a + b (x – 7) + c (y – 7) >> 5)a = 16 (p[-1, 15] + p[15, -1])b = (5 H + 32) >> 6c = (5 V + 32) >> 6H = 7

x’=0 (x’+1) (p[-1, 8+x’] – p[-1, 6 – x’])V = 7

x’=0 (y’+1) (p[8+y’, -1] – p[6 – y’, -1])

A0 A1 A2 A3

Pred[0,0]Pred[0,4]

Pred[0,8]Pred[0,12]

Hardware Architecture

A0 A1 A2 A3

Hardware Architecture

Hardware Architecture

Transform

DCT iDCT Hadamard

(Implemented by shifters and adders)

Hardware Architecture