Optimization of Arithmetic Coding By Kolluru Krishna Bharath.

Optimization of Optimization of Arithmetic CodingArithmetic Coding

ByKolluru Krishna Bharath

OutlineOutline

ObjectiveMotivationOptimizations w.r.t platforms

◦PredicationOptimizations w.r.t algorithm (Arithmetic

Coding)◦Sequential AC◦Parallel AC

Conclusion

ObjectiveObjective

To study the performance of the algorithm on different platforms.

To optimize the algorithm to achieve better performance.

MotivationMotivation

MQ CodingMQ Coding

MQ Coding:1. Resizing of the interval to eliminate the need for high precision for range calculation2. Adaptive Probability(MPS) calculation (requires only one pass)3. Integer Arithmetic .

Machine specific optimizationsMachine specific optimizations

Compilers take advantage of the architecture underneath.

Examples of machine specific optimizations are ◦Predication◦Software pipelining

ARM core supports Predication.

PredicationPredication

Objective◦Eliminate hard-to-predict branches.◦Increase ILP.

Advantage of using ARM:◦Supports predication(conditional codes) and◦Gives the option of setting the flag for every

arithmetic and logical instruction.

Predication-ExamplesPredication-Examples

UPDATING THE COUNT LDRB R4,[R1] ;R4 HAS THE SYMBOL. LDRB R5,[R2] ;R5 HAS THE COUNT OF NUMBER OF ZEROS. LDRB R6,[R3] ;R6 HAS THE COUNT OF NUMBER OF ONES. CMP R4,#0 ;CHECK IF THE VALUE THAT IS READ FROM THE SOURCE IS 1/0. ADDNE R5,R5,#1 ;IF ZERO, ADD 1 TO THE COUNT OF COUNT_0. ADDEQ R6,R6,#1 ;IF ONE, ADD 1 TO THE COUNT OF COUNT_1. STRB R5,[R2] ;THE COUNT_1 HAS BEEN UPDATED WITH THE NEW VALUE. STRB R6,[R3] ;THE COUNT_0 HAS BEEN UPDATED WITH THE NEW VALUE.

EVALUATING THE MPS AND THE LPS. CMP R5,R6 ;CHECK IF THE COUNT_0>COUNT_1. MOVGE R4,#1 ;IF YES, MOVE 1 INTO R4. MOVLT R4,#0 ;IF NO, MOVE 0 INTO R4. LDR R0,=(MPS) STRB R4,[R0]

IF NO PREDICATION ; THE CODE WOULD LOOK LIKE LDR R0,=(MPS) CMP R5,R6 BGE LOOP1 ; HIGHLY UNPREDICTABLE BRANCH. MOV R4,#1 STRB R4,[R0] B EXIT LOOP1: MOV R4,#0 STRB R4,[R0] EXIT

Predication - ContinuedPredication - Continued

What is the advantage of using predication for MQ coding?◦The algorithm has small sized loops &◦The branches are highly unpredictable.

This favors predication.Performance Analysis shows that using

Predication, we get a fractional speed up of 2.75 on replacing a conditional branch instruction.

Predication – Continued Predication – Continued

Can predication be used for all algorithms?

No. Certain characteristics are required which best suit the usage of predication, such as◦Highly unpredictable branches◦Small loops (preferable), otherwise the cost of

executing both direction could be more the cost of misprediction.

Optimization of the AlgorithmOptimization of the Algorithm

Sequential AC-Data Flow DiagramSequential AC-Data Flow Diagram

BeginL=0;L=1;F(0)=0;for(j=1 to N) { i=index_of_symbol(j); L(j+1)=L(j)+(H(j)-L(j))*F(i-1); H(j+1)=L(j)+(H(j)-L(j))*F(i); }output((L+H)/2);end

Dependence Matrix is given by[ 1 1]

Sequential AC-Dependence GraphSequential AC-Dependence Graph

1 Dimension Loop

2 Dimensional Loop

1. Inner loop & outer loop parallelism are absent.

2. Loop interchange doesn’t help.

J

I

I

Parallel AC- Data Flow GraphParallel AC- Data Flow Graph

Parallel AC – Dependence GraphParallel AC – Dependence Graph

Do all i=1 to 2Do j=1 to 2

{l=index_of_symbol(j,i);L(j,i)=L(j,i)+(H(j,i)-L(j,i))*F(l-1); H(j,i)=L(j,i)+(H(j,i)-L(j,i))*F(l); }

EnddoEnddoall

L_final = L12 + (H12-L12)*L22;H_final = L12 + (H12-L12)*H22;

I

J

I

J

Dependence Graph for the code Figure(1).

Figure(1)Dependence graph for Parallel AC

Dependence Matrix for Figure(1)

Performance – Arithmetic CodingPerformance – Arithmetic Coding

The parallel Arithmetic Coding for a text message of length 1800 showed the follow speed up◦4.875 (without the overhead of loading the

values into separate processors)◦1.66 ( with the overhead)

For a text message of length ~10000, parallel showed a speedup of 2(with overhead).

ConclusionConclusion

Running the MQ coder on ARM core improves the performance of the algorithm.

Tuning the AC for parallel execution provides a very good performance .

Thank youThank you

Questions ?Questions ?

Backup SlidesBackup Slides

PredicationPredication

Predication◦Performance Analysis shows that using

Predication, we get a fractional speed up of 2.75 on replacing a conditional branch instruction, i.e. a reduction from 0.264us to 0.096us for a clock frequency of 41MHz (24ns). Each time a symbol (1/0) is encoded we save 7 cycles(i.e. 7 cycles/run) for every predicated instruction used instead of branch instruction. When this is executed, say, on a black & white image of size 256x256, we save ~0.5M cycles.

Performance – Arithmetic CodingPerformance – Arithmetic Coding

The sequential and parallel Arithmetic Coding for the same testbench show dramatic change in the execution time◦Sequential – 0.078seconds◦Parallel – 0.016seconds

ReferencesReferences

Howard & Vitter, “Arithmetic Coding for Data Compression”.

David Sehr, Jay Bharadwaj, Jim Pierce, Priti Shrivastav, Carole Dulong, “IA-64 Compiler Technology”.

Utpal Bannerjee, “Loop Parallelization”.Pierre Boulet,Darte and Silber, “Loop

parallelization Algorithms: from parallelism extraction to code generation”.

Supol and Melichar, “Arithmetic Coding in Parallel”.

Optimization of Arithmetic Coding By Kolluru Krishna Bharath.

Documents

Transcript of Optimization of Arithmetic Coding By Kolluru Krishna Bharath.