Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding Florian H. Seitner, Michael...

28
Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding Florian H. Seitner, Michael Bleyer, Ralf M. Schreier, Margrit Gelautz International Conference on Advances in Mobile & Multimedia (MoMM 2008)
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    1

Transcript of Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding Florian H. Seitner, Michael...

Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding

Florian H. Seitner, Michael Bleyer, Ralf M. Schreier, Margrit Gelautz

International Conference on Advances in Mobile & Multimedia (MoMM 2008)

Outline

• Introduction• Parallel H.264 Decoding• Evaluated Methods• Experimental Results• Conclusions

Introduction

• H.264 video standard is currently used in a wide range of video-related areas– Video content distribution– Television broadcasting

• High coding efficiency– Qpel motion estimation– Variable block size– Multiple reference frames

Significantly increased CPU and memory loads

Introduction

• Using multi-core systems to increase system performance– How to distribute H.264 decoding algorithm

among multiple processing units ?• The decoding load should be distributed equally• Data dependency issues• Inter-communication• Synchronization

Introduction

• The aim of this work is to evaluate the behavior of different decoding approaches– Run-time complexity– Efficient core usage– Data transfers

Parallel H.264 DecodingFunctional and Data-parallel splitting

• Functional partitioned decoding system– Decoding tasks are assigned to individual

processing cores• Each processing unit can be optimized for a certain task• Unequal workload distribution• High transfer rate for inter-communication

Parallel H.264 DecodingFunctional and Data-parallel splitting

• Data-parallel decoding system– Distributing MBs among multiple processing unit• Data dependencies between different cores must be

minimized • MB distribution onto the processing cores must achieve

an equal workload balancing

Parallel H.264 DecodingThe H.264 Decoder

• The H.264 decoding process

Stream Parsing

Entropy Decoder

Inverse Quantization

Inverse DCT

Spatial Prediction

Motion Compensation

Reference Frames

Deblocking+

Enco

ded

Bits

trea

m

ParserReconstructorData-Parallel Processing

Parallel H.264 DecodingMacroblock Dependencies

• Data-parallel splitting of the decoder’s reconstruction module is challenging due to spatial and temporal dependencies

Intra prediction Deblocking Inter prediction

Evaluated MethodsOverview

• Comparing the performance of five different approaches for accomplishing data-parallel splitting of the decoder’s reconstructor module– Single row approach– Multi-column approach– Blocking slice-parallel method– Nonblocking slice-parallel method– Diagonal approach

Evaluated MethodsSingle Row Approach

• The assignment of MBs to processors

2 Cores 4 Cores 8 Cores

N is the number of processorsProcessor i ( i = 0, 1, …, N - 1 ) is responsible for decoding the yth row of MBs if ( y mod N ) = i

Evaluated MethodsSingle Row Approach

• An example of SR approach ( 2 cores )– It takes a constant value of 1 unit of time to

process a macroblock

T = 2 T = 3 T = 8 T = 10 T = 34

Evaluated MethodsSingle Row Approach

• Advantage– Simplicity– Only a small start delay

• Disadvantage– So many dependencies across processor

assignment borders

Evaluated MethodsMulti-column Approach

• The assignment of MBs to processors

2 Cores 4 Cores 8 Cores

w is the width of a multi-column Processor i ( i = 0, 1, …, N - 1 ) is responsible for decoding a MB of the xth column if iw < x < ( i + 1)w

Evaluated MethodsMulti-column Approach

• An example of MC approach ( 2 cores )

• Advantage– Less dependencies across processors • One processor has to wait for the results only at the

boundaries

T = 4 T = 5 T = 8 T = 36

Evaluated MethodsSlice-parallel Approach

• The assignment of MBs to processors

2 Cores 4 Cores 8 Cores

h is the height of a sliceProcessor i ( i = 0, 1, …, N - 1 ) is responsible for decoding a MB of the yth row if ih < x < (i + 1)h

Evaluated MethodsSlice-parallel Approach

• An example of SP approach in the blocking version ( 2 cores)

• Disadvantage– Long delay– CPU idle, less core usage

T = 26 T = 32 T = 58

Evaluated MethodsSlice-parallel Approach

• An example of SP approach in the non-blocking version ( 2 cores )– No dependencies is considered across slice

boundaries (completely independent)– NBSP requires having full control over the encoder

T = 1 T = 32

Evaluated MethodsDiagonal Approach

• The assignment of MBs to processors– Dividing the first line of MBs into equally-sized

columns– The assignments for the subsequent lines are

derived by left-shifting the MB of the line above

2 Cores 4 Cores 8 Cores

Evaluated MethodsDiagonal Approach

• An example of DG approach

T = 4 T = 10 T = 12 T = 13 T = 16

T = 18 T = 20 T = 23 T = 24 T = 43

Evaluated MethodsDiagonal Approach

• Comparing the inter-processor dependencies introduced by DG and MC approach

Diagonal approach Multi-column approach

Dependencies for CPU 2 originate solely from MB assigned to CPU1

MBs assigned to CPU 2 are also dependent on CPU 3

Experimental ResultsOverview

• Test sequences

• Parameters– GOP size = 14– Search range = +/- 16 pixels– 5 reference frames

Experimental ResultsRun-time Complexity

• Two major indicators for the efficiency of multi-core decoding system– Decoder’s run-time• A low run-time indicates a high system decoding

performance

– Number of data-dependency stalls occurring during the decoding process• The number of stalls provides an estimate on how

efficiently the system’s computational resources are used

Experimental ResultsRun-time Complexity

• Speed-up in run-time– The speed increase for each parallelization approach in

multiples of the single-core performance

Experimental ResultsRun-time Complexity

• Stall cycles caused by data dependencies between the cores

Experimental ResultsInter-communication

• Memory transfer to and from the external DRAM and between the cores’ local memories are expensive in terms of power consumption and transfer time– Core inter-communication– Loading reference data and deblocking pixels

Experimental ResultsInter-communication

• Data transform volume for reference data and deblocking information

Conclusions

• In this study, we have evaluated 5 data-parallel approaches for the H.264 decoder

• The run-time of each parallelization approaches is influenced by the frame partitions’ sizes and shapes

• Large and dependency-minimizing partitions cause less inter-communication between cores