Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding Florian H. Seitner, Michael...
-
date post
19-Dec-2015 -
Category
Documents
-
view
221 -
download
1
Transcript of Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding Florian H. Seitner, Michael...
Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding
Florian H. Seitner, Michael Bleyer, Ralf M. Schreier, Margrit Gelautz
International Conference on Advances in Mobile & Multimedia (MoMM 2008)
Outline
• Introduction• Parallel H.264 Decoding• Evaluated Methods• Experimental Results• Conclusions
Introduction
• H.264 video standard is currently used in a wide range of video-related areas– Video content distribution– Television broadcasting
• High coding efficiency– Qpel motion estimation– Variable block size– Multiple reference frames
Significantly increased CPU and memory loads
Introduction
• Using multi-core systems to increase system performance– How to distribute H.264 decoding algorithm
among multiple processing units ?• The decoding load should be distributed equally• Data dependency issues• Inter-communication• Synchronization
Introduction
• The aim of this work is to evaluate the behavior of different decoding approaches– Run-time complexity– Efficient core usage– Data transfers
Parallel H.264 DecodingFunctional and Data-parallel splitting
• Functional partitioned decoding system– Decoding tasks are assigned to individual
processing cores• Each processing unit can be optimized for a certain task• Unequal workload distribution• High transfer rate for inter-communication
Parallel H.264 DecodingFunctional and Data-parallel splitting
• Data-parallel decoding system– Distributing MBs among multiple processing unit• Data dependencies between different cores must be
minimized • MB distribution onto the processing cores must achieve
an equal workload balancing
Parallel H.264 DecodingThe H.264 Decoder
• The H.264 decoding process
Stream Parsing
Entropy Decoder
Inverse Quantization
Inverse DCT
Spatial Prediction
Motion Compensation
Reference Frames
Deblocking+
Enco
ded
Bits
trea
m
ParserReconstructorData-Parallel Processing
Parallel H.264 DecodingMacroblock Dependencies
• Data-parallel splitting of the decoder’s reconstruction module is challenging due to spatial and temporal dependencies
Intra prediction Deblocking Inter prediction
Evaluated MethodsOverview
• Comparing the performance of five different approaches for accomplishing data-parallel splitting of the decoder’s reconstructor module– Single row approach– Multi-column approach– Blocking slice-parallel method– Nonblocking slice-parallel method– Diagonal approach
Evaluated MethodsSingle Row Approach
• The assignment of MBs to processors
2 Cores 4 Cores 8 Cores
N is the number of processorsProcessor i ( i = 0, 1, …, N - 1 ) is responsible for decoding the yth row of MBs if ( y mod N ) = i
Evaluated MethodsSingle Row Approach
• An example of SR approach ( 2 cores )– It takes a constant value of 1 unit of time to
process a macroblock
T = 2 T = 3 T = 8 T = 10 T = 34
Evaluated MethodsSingle Row Approach
• Advantage– Simplicity– Only a small start delay
• Disadvantage– So many dependencies across processor
assignment borders
Evaluated MethodsMulti-column Approach
• The assignment of MBs to processors
2 Cores 4 Cores 8 Cores
w is the width of a multi-column Processor i ( i = 0, 1, …, N - 1 ) is responsible for decoding a MB of the xth column if iw < x < ( i + 1)w
Evaluated MethodsMulti-column Approach
• An example of MC approach ( 2 cores )
• Advantage– Less dependencies across processors • One processor has to wait for the results only at the
boundaries
T = 4 T = 5 T = 8 T = 36
Evaluated MethodsSlice-parallel Approach
• The assignment of MBs to processors
2 Cores 4 Cores 8 Cores
h is the height of a sliceProcessor i ( i = 0, 1, …, N - 1 ) is responsible for decoding a MB of the yth row if ih < x < (i + 1)h
Evaluated MethodsSlice-parallel Approach
• An example of SP approach in the blocking version ( 2 cores)
• Disadvantage– Long delay– CPU idle, less core usage
T = 26 T = 32 T = 58
Evaluated MethodsSlice-parallel Approach
• An example of SP approach in the non-blocking version ( 2 cores )– No dependencies is considered across slice
boundaries (completely independent)– NBSP requires having full control over the encoder
T = 1 T = 32
Evaluated MethodsDiagonal Approach
• The assignment of MBs to processors– Dividing the first line of MBs into equally-sized
columns– The assignments for the subsequent lines are
derived by left-shifting the MB of the line above
2 Cores 4 Cores 8 Cores
Evaluated MethodsDiagonal Approach
• An example of DG approach
T = 4 T = 10 T = 12 T = 13 T = 16
T = 18 T = 20 T = 23 T = 24 T = 43
Evaluated MethodsDiagonal Approach
• Comparing the inter-processor dependencies introduced by DG and MC approach
Diagonal approach Multi-column approach
Dependencies for CPU 2 originate solely from MB assigned to CPU1
MBs assigned to CPU 2 are also dependent on CPU 3
Experimental ResultsOverview
• Test sequences
• Parameters– GOP size = 14– Search range = +/- 16 pixels– 5 reference frames
Experimental ResultsRun-time Complexity
• Two major indicators for the efficiency of multi-core decoding system– Decoder’s run-time• A low run-time indicates a high system decoding
performance
– Number of data-dependency stalls occurring during the decoding process• The number of stalls provides an estimate on how
efficiently the system’s computational resources are used
Experimental ResultsRun-time Complexity
• Speed-up in run-time– The speed increase for each parallelization approach in
multiples of the single-core performance
Experimental ResultsRun-time Complexity
• Stall cycles caused by data dependencies between the cores
Experimental ResultsInter-communication
• Memory transfer to and from the external DRAM and between the cores’ local memories are expensive in terms of power consumption and transfer time– Core inter-communication– Loading reference data and deblocking pixels
Experimental ResultsInter-communication
• Data transform volume for reference data and deblocking information