Software Optimization

21
EECE 360 – Prof. Schamiloglu The University of New Mexico Software Optimization Vikas, Chaudhary MA 471

description

Software Optimization. Vikas, Chaudhary. MA 471. High-level code. Intermediate code. Object code. Executable. Assembler. Assembly Code. Steps to create an executable. MA 471. Higher Optimizations Procedure within basic blocks. Procedure within single and nested loop structures. - PowerPoint PPT Presentation

Transcript of Software Optimization

Page 1: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

Software Optimization

Vikas, Chaudhary

MA 471

Page 2: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

High-level code

Intermediate code Object code Executable

Assembly Code

Assembler

Steps to create an executable

MA 471

Page 3: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

Higher Optimizations

Procedure within basic blocks.

Procedure within single and nested loop structures.

Entire procedure including all blocks and structures.

File (inter-procedural analysis within a source file)

Cross file (inter-procedural analysis across all procedures)

MA 471

Page 4: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

Compiler Options

There are no strict rules about what each level of optimization means but generally

O0 does one to many translations.

O1 does basic block optimizations.

O2 does loop optimizations.

O4 does interfile optimizations.

Some compilers also provide +odataprefetch to indicate that prefetch instructions should be inserted to prefetch data from memory to cache.

MA 471

Page 5: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

Increasing Register Pressure

When too many registers are needed, compilers must store values to memory and restores values from memory. This degrades the performance.

If we generate assembly code from compiler via –S and see that there is an inordinate number of load and store instructions then it is implied that compiler is generating too many spills.

Use register data type carefully.

MA 471

Page 6: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

Dead Code Elimination

Dead code Elimination is merely the removal of code that is never used.

i=0

If (i!=0) deadcode(i);

MA 471

Page 7: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

Constant Folding and Propagation

Constant folding is when expressions with multiple constants are folded together and evaluated at compile time.

a = 1+ 2

Will be replaced by a = 3.

MA 471

Page 8: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

 

MA 471

Common Subexpression elimination

Common subexpression elimination analyzes lines of code, determines where identical subexpressions are used and creates a temporary variable to hold one instance of these values.

a = b + (c + d)

f = e + (c + d)

Page 9: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

 

MA 471

Strength Reductions

Strength reduction means replacing expensive operations with cheaper ones.

Replacing integer multiplication or division by constants with shift operations.Replacing 32-bit integer division by 64-bit floating point division.Replacing floating point multiplications by small constants with floating point additions. Replacing power function by floating point multiplications.

Page 10: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

 

MA 471

Filling Branch Delay Slots

Branch delay slots are the instructions after a branch that are always executed.

If the compiler is used with no optimization, it will probably insert a nop into branch delay slot.

Page 11: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

 

MA 471

Induction Variable Optimization

for (i=0 ; i< n ; i +=2)

ia[i] = i * k + m;

Where i is induction variable.

The above code can be replaced by

ic = m

for (i = 0; i< n ; i += 2)

{

ia[i] = ic;

ic = ic + k;

}

Page 12: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

Loop Fission This technique is often used when an inner loop consists of a

large number of lines and the compiler has difficulty generating code without spilling.

This technique is also helpful in improving cache performance.

for(i = 0; i < n; i++)

Y[i] = y[i] + x[i] + x[i+m]

Suppose x[i] and x[i +m] maps to same cache location. (Direct mapped cache). This will cause cache thrashing.

MA 471

Page 13: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

Loop can be split as

for(i = 0; i < n; i++)

y[i] = y[i] + x[i];

for(i = 0; i < n; i++)

y[i] = y[i] + x[i + m];

This technique might not be very useful when cache is n-way set associative.

MA 471

Page 14: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

Loop Unrolling

This technique reduces the effect of branches, instruction latency, and potentially the number of cache misses. Do I = 1, N Y(I) = X(I) ENDDO

After UnrollingNEND = 4 * (N/4) Do I = I, N , 4 Y(I) = X(I) Y(I + 1) = X(I + 1) Y(I + 1) = X(I + 1)ENDDO Do I = NEND+1 , N Y(I) = X(I) ENDDO

MA 471

Page 15: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

• Loading all the values of X before the values of Y reduces the possibility of cache thrashing.

• Amount of unrolling can decrease the number of software prefetch instructions.

• Excessive unrolling will cause data to be spilled from register to memory.

• Unrolling increases size of object code, which might cause too many instruction cache misses.

MA 471

Page 16: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

Clock Cycles in an Unrolled Loop

Original order CC Modified order CC

Load X(1) 1 Load X(1) 1

Store Y(1) 7 Load X(2) 2

Load X(2) 8 Load X(3) 3

Store X(2) 14 Load X(4) 4

Load X(3) 15 Store X(1) 7

Store X(3) 21 Store X(2) 8

Load X(4) 22 Store X(3) 9

Store X(4) 28 Store X(4) 10

MA 471

Page 17: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

Loop peeling

This technique is used by compilers to handle boundary conditions. Do I = 1, N if(I .EQ. 1) then X[I] = 0 ELSEIF (I. EQ. N) THEN X(I) = N ELSE X(I) = X (I) + Y(I) ENDDO AFTER LOOP PEELING X(1) = 0 Do I = 2, N-1 X(I) = X(I) + Y(I) ENDDO X(N) = N

MA 471

Page 18: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

Software Pipelining Software pipelining is a technique for recognizing loops such that each

iteration in the software-pipelined code is made from instructions chosen from different iterations of the original loop.

Iteration 0 Iteration 1 Iteration 2

Iteration 3

MA 471

Page 19: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

Software pipeline is an optimization that is impossible to duplicate with high level code since the multiple assembly language instruction that a single line of high level language creates are moved around extensively.

Software pipeline is created only at high optimization level.

MA 471

Page 20: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

Compiler Speculation with Hardware Support

Modern compilers try to speculate either to improve the scheduling or to increase issue rate.

Hurdle Conditional instructions.

In moving instructions across a branch the compiler must ensure that exception behavior is not changed and dynamic data dependence remains same.

Compiler also finds out, which registers are not being used

and those registers are renamed.

MA 471

Page 21: Software Optimization

EECE 360 – Prof. Schamiloglu

The University of New Mexico

Thank you!

MA 471