Day01 Hpc Wrkshp Compiler Opt

download Day01 Hpc Wrkshp Compiler Opt

of 61

Transcript of Day01 Hpc Wrkshp Compiler Opt

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    1/61

    1CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    High Performance Computing Workshop

    Day 1 : October 5, 2004

    Uni-Processor Optimization -Code Restructuring and Loop

    Optimization Techniques

    UniUni--Processor OptimizationProcessor Optimization --Code Restructuring and LoopCode Restructuring and Loop

    Optimization TechniquesOptimization Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    2/61

    2CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Why do you need to do optimization of a sequential code

    Memory hierarchy and how a codes performance depends on it

    Optimization Techniques

    Loop Optimization Techniques

    Collapsing, Fission, Fusion, Unrolling, Interchange, Invariant CodeExtraction, De-factorization, overheads of if-while-goto, NeighborData Dependency

    Arithmetic Optimization

    Compiler Optimizations Use of tuned Math Libraries

    Performance of selective applications and benchmarks

    Conclusions

    Lecture Outline

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    3/61

    3CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Improving Single Processor Performance

    How much sustained performance one can achieve for givenprogram on a machine ?

    It is programmers job to take advantage as much as possible ofthe CPUs hardware /software characteristics to boost theperformance of the program !

    Quite often, just a few simple changes to ones code improvesperformance by a factor of 2, 3 or better !

    Also, simply compiling with some of the optimization flags (-O3, -

    fast, .) can improve the performance dramatically !

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    4/61

    4CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Approximate access times CPU-registers: 0 cycles (thats where the work is done!) L1 Cache: 1 cycle (Data and Instruction cache). Repeated

    access to a cache takes only 1 cycle

    L2 Cache (static RAM): 3-5 cycles? Memory (DRAM): 10 cycles (Cache miss); 30-60 cycles for Translation Lookaside Buffer (TLB) update Disk: about 100,000 cycles! connecting to other nodes - depending on network latency

    IcacheDcache

    L2 DISK

    RAM

    CPU

    registers

    A lot of time is spent accessing/storing data from/to memory. It isimportant to keep in mind the relative times for each memory types:

    The Memory sub-system : Access time

    Access Time

    is Important

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    5/61

    5CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Hierarchical Memory

    A four-level memory hierarchy for a large

    computer system.

    External Cache(SRAMs)

    Main Memory

    (DRAMs)Disk Storage(Magnetic)

    Tap Units

    (Magnetic)

    M1

    M2

    M3

    M4

    Registers, InternalCashes in CPU

    Capacity

    Level 0

    Level 1

    Level 2

    Level 3

    Level 4 I n

    c r e

    a s e

    i n c

    a p a c

    i t y

    a n

    d

    a c c e s s

    t i m e

    I n c r e a s e

    i n c

    o s

    t p e r

    b i t

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    6/61

    6CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Loop Optimization Techniques

    Classical Optimization techniques Compiler Does

    Memory Reference Optimization Compiler does to some extent

    Loop Optimizations Compiler does to some extent

    Loop Fission and Loop Fusion

    Loop Interchange Loop Alignment Loop Collapsing Loop Unrolling

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    7/61

    7CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Loop Collapsing

    It attempts to create one (larger) loop out of two or more small ones.

    This may be profitable if the size of each of the two loops is too smallfor efficient vectorization, but the resulting single loop can beprofitably vectorized.

    REAL A(5,5) B(5,5)

    DO 10 J =1, 5DO 10 I=1, 5

    A(I,J) = B(I,J) + 2.010 CONTINUE

    Before REAL A(25) B(25)

    DO 10 JI =1, 25A(JI) = B(JI) +2.0

    10 CONTINUE

    Loop collapsing is done with multi-dimensional arrays to avoid loopoverheads

    After

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    8/61

    8CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Using this technique, the code may be transferred into a single loop,regardless of the size of M and N

    This may require some additional statement to restart the codeproperly.

    DO 10 L = 1, NxM

    I = (L-1)/M+1J = MOD(L-1,M) +1)A(I,J) = B(I,J) + 2.0

    10 CONTINUE

    After

    (Contd)Loop Collapsing

    DO 10 J =1, NDO 10 I =1, MA(I,J) = B(I,J) + 2.0

    10 CONTINUE

    General Versions of this technique is useful for computing

    systems which support only a single (not nested) DOALLstatement.

    Before

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    9/61

    9CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Loop Collapsing (Contd)

    Loop collapsing is done with multi-dimensional arrays to avoidloop overheads

    Assume declaring a[50][80][4]

    Un Collapsed Loop

    for(I = 0; I

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    10/61

    10CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Loop Fusion:

    It transforms two adjacent loops into one on the basis ofinformation obtained from data-dependencies analysis.

    Two statements will be placed into the same loop if

    there is atleast one variable or array which is referredby both.

    Remark : Loop Fission and Loop Fusion are related techniques to

    Strip mining and loop collapsing

    Loop Fission and Loop Fusion

    Loop Fission:

    Attempts to break a single loop into several loops in order to

    optimize data transfer (behavior main memory, cache andregisters)

    Primary objective of optimization is data transfer.

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    11/61

    11CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    It is merging of several loops into a single loop

    Example : Untuned Example : Tuned

    Loop Fusion

    for(i=0; i < 100000; i++)

    x = x * a[i] + b[i];

    for(i=0; i < 100000; i++)

    y = y * a[i] + c[i];

    for(i=0; i < 100000; i++) {x = x * a[i] + b[i];

    y = y * a[i] + c[i];}

    Tuned code runs atleast 10 times faster on Ultra Sparc (both with O3 flag)

    (Contd)

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    12/61

    12CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Advantages

    The loop overhead is reduced by a factor of two in the abovecase.

    Allows for better instruction overlap in loops with dependencies.

    Cache misses can be decreased if both loops reference the

    same array.

    Loop Fusion (Contd)

    Disadvantages

    Has the potential to increase cache misses if the fused loops

    contain references to more than four arrays and the startingelements of those arrays map to the same cache line.

    e.g:x = x * a[i] + b[i] * c[i] + d[i] / e[i]

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    13/61

    13CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Loop Optimizations : Basic Loop Unrolling

    Loop unrolling is performing multiple loop iterations per pass.

    Loop unrolling is one of the most important optimizations that canbe done on a pipelined machine.

    Loop unrolling helps performance because it fattens up a loop with

    calculations that can be done in parallel

    Remark : Never unroll an inner loop.

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    14/61

    14CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Outer and Inner Loop Unrolling

    Remark : The loop or loops in the center are called the innerloops and the surrounding loops are called outer loops

    Loopnest: Enabled loops within other created loops

    for (i=0; i

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    15/61

    15CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Outer and Inner Loop Unrolling

    Reasons for applying outer loop unrolling are:

    To expose more computations

    To improve memory reference patterns

    for(I =0; i

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    16/61

    16CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Loop Unrolling and Sum Reduction Loop Unrolling should be used to reduce data dependency. Different

    variables can be used to eliminate the data dependency

    a=0.0;

    for (i=0; i

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    17/61

    17CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Qualifying Candidates for Loop Unrolling

    The previous example is an ideal candidate for loop unrolling.

    Study categories of loops that are generally not prime candidatesfor unrolling.

    Loops with low trip counts

    Fat loops

    Loops containing branches

    Recursive loops

    Vector reductions

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    18/61

    18CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Qualifying Candidates for Loop Unrolling To be effective, loop unrolling requires that there be a fairly large

    number of iterations in the original loop.

    When a trip count in loop is low, the preconditioning loop is doingproportionally large amount of work.

    Loop containing procedure calls

    Loop containing subroutine or function calls generally are not good

    candidates for unrolling. First : They often contain a fair number of instructions already. The

    function call can cancel many more instructions.

    Second : When the calling routine and the subroutine are compiledseparately, it is impossible for the compiler to intermix instructions.

    Last : Function call overhead is expensive. Registers have to besaved, argument lists have to be prepared.The time spent callingand returning from a subroutine can be much greater than that ofthe loop overhead.

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    19/61

    19CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    II=IMOD (N,4)

    DO 9 I=1, II

    CALL SHORT (A(I),B(I),C)9 CONTINUE

    DO 10 I=1+II, N,4CALL SHORT(A(I),B(I),C)

    CALL SHORT(A(I+1),B(I+1),C)CALL SHORT(A(I+2),B(I+2),C)CALL SHORT(A(I+3),B(I+3),C)

    10 CONTINUE

    (Contd)

    DO 10 I=1, NCALL SHORT(A(I), B(I),C)

    10 CONTINUESUBROUTINE SHORT

    (A,B,C)A = A+B+C

    RETURNEND

    Qualifying Candidates for Loop Unrolling

    Loop containing procedure calls is not suitable forunrolling

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    20/61

    20CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    If a particular loop is already fat, then unrolling is not going to helpmuch and loop overhead will spread over a fair number ofinstructions.

    A good rule of thumb is to look elsewhere for performance when

    the loop inwards exceed three or four statements.

    Since code indicates that inlining is feasible.

    Qualifying Candidates for Loop Unrolling

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    21/61

    21CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Qualifying Candidates for Loop Unrolling

    Original

    Dependency can be reduced by deriving new set of recursive equations

    Decreasing the dependencies at the expense of creating more work.

    DO 10 I=2, NA(I) = A(I) + A(I-1) x B

    10 CONTINUE

    Modified

    DO 10 I =2, N,2A(I) = A(I+1) + A(I-1) * B + A(I-1) *B*BA(I) = A(I) + A(I-1)*B

    10 CONTINUE

    This is an example of vector recursion

    A Good compiler can make the rolled up version go faster by

    recognizing the dependency as opportunity to save memory traffic.

    A(I) = A(I)+A(I-1)*BA(I+1) = A(I+1)+A(I)*BA(I+2) = A(I+2)+A(I+1)*BA(I+3) = A(I+3)+A(I+2)*B

    Recursive Loops (Contd..)

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    22/61

    22CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Negatives of Loop Unrolling

    Loop unrolling always adds some run time to the program.

    If you unroll a loop and see the performance dip little, you canassume that either:

    The loop wasnt a good candidate for unrolling in the first placeor

    A secondary effort absorbed your performance increase.

    Other possible reasons

    Unrolling by the wrong factor

    Register spitting

    Instruction cache miss

    Other hardware delays

    Outer loop unrolling

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    23/61

    23CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Loop Interchange

    Loop interchange is a technique for rearranging a loop nest so thatthe right stuff at the center. What is the right stuff depends upon

    what you are trying to accomplish.

    Loop interchange to move computations to the center of the loopnest.

    It is also good for improving memory access patterns.

    Iterations on the wrong subscript can cause a large stride and hurt

    your performance.

    Inverting the loops, so that the iterating variables causing the lesserstrides are in the center, you can get performance win.

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    24/61

    24CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    PARAMETER(IDIM=1000,JDIM=10

    00, KDIM = 4)DO 10 K=1, KDIM

    DO 20 J=1, JDIMDO 30 I=1, IDIM

    D(I,J,K)=D(I,J,K)+ V(I,J,K)*DT30 CONTINUE20 CONTINUE10 CONTINUE

    Loop Interchange

    PARAMETER(IDIM=1000,JDIM=1

    000, KDIM=4)DO 10 I =1, IDIMDO 20 J =1, JDIM

    DO 30 K =1, KDIMD(I,J,K)=D(I,J,K)+

    V(I,J,K)*DT30 CONTINUE20 CONTINUE10 CONTINUE

    Loop interchange to move computations to the center

    Frequently, the interchange of nested loops permits a significantincrease in the amount of parallelism

    Example is straight forward: it is easy to see that there are no interiteration dependencies.

    (Contd)

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    25/61

    25CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    float a[2][40][2000]

    for(i=0; i

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    26/61

    26CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Statements that do not change within an inner loop can be moved

    outside of the loop. (Compiler optimizations can usually detect these).

    for(i=0 ; i

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    27/61

    27CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Loop De-factorization consists of removing commonmultiplicative factors outside of inner loops

    for(i=0; i

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    28/61

    28CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Untuned Loops (IFs and GOTOs): Turned Loop : I=0 I = 0

    10 I = I + I 10 I = I + 1IF(I.GT.100000)GOTO 30 A(I)=A(I)+B(I)*C(I)A(I)=A(I) + B(I)*C(I) IF(I.LE.100000)GOTO 10

    GOTO 1030 CONTINUE

    Another Untuned Loop (WHILE Loop) : Turned Loop:

    I = 0 DO I = 1, 100000DO WHILE (I .LT. 100000) A(I) = A(I)+B(I)*C(I)I = I + 1 END DO

    A(I) = A(I)+B(I)*C(I)ENDDO

    Avoid IF/GOTO loops and WHILE loops. They inhibit compileroptimizations and they introduce unnecessary overheads.

    Loop Optimization: IF, WHILE, and DO Loops

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    29/61

    29CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Example: data wrap around, untuned versionjwrap = ARRAY_SIZE 1;for(i=0; i

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    30/61

    30CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    DO 10, JB = 1, N, NB

    DO 10, IB = 1, N, NBDO 10, KB = 1, N, NBDO 10, J = JB, JB + NB 1

    DO 10, I = IB, IB + NB 1DO 10, K = KB, KB + NB 1

    C (I, J) = C (I, J) + A (I, K) * B(K,J)10 CONTINUEThis is most useful as a simple example of cache blocking. Mostcompilers will automatically cache block the original code as part of

    ordinary optimization.

    Programming Techniques Managing the Cache

    DO 10, J = 1, NDO 10, I = 1, N

    DO 10, K = 1, NC(I, J) = C (I, J) + A (I, K) * B (K, J)

    10 CONTINUE

    We can modify the previous code to better use the cache.

    Original code

    Modified code

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    31/61

    31CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Loop optimizations accomplish three things :

    Reduce loop overhead

    Increase Parallelism

    Improve memory performance patterns

    Understanding your tools and how they work is critical for using themwith peak effectiveness. For performance, a compiler is your bestfriend.

    Loop Optimizations: Advantages

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    32/61

    32CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Replace frequent divisions by inverse multiplications Multiplications/divisions by integer powers of 2 can be replaced by

    bit shifts to the left/right (compilers can usually do this) Small integer exponentials such as an should be replaced by

    repeated multiplications a*a*a*a.(compilers will usually do this) Reorganize (or eliminate) repeated (or useless) operation: Use Horners rule to evaluate polynomials.

    Recap of Arithmetic Optimization

    Example :

    Ax5 + Bx4 + Cx3 + Dx2 + E x + F can be written as

    ((((Ax + B)* x + C)*x+D)* x + E)* x+F

    This saves more time in C (speed increases by factor greater than10) than in Fortran (improvement of only about 30%) due to the wayC language handles (poorly ) the function pow(x,5).

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    33/61

    33CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Compiler Optimizations

    Compiler optimization From Wikipedia, the free encyclopedia

    Compiler optimization is used to improve the efficiency (in terms of

    running time or resource usage) of the executables output by a

    compiler.

    Allow programmers to write source code in a straightforwardmanner, expressing their intentions clearly, while allowing thecomputer to make choices about implementation details that leadto efficient execution.

    May or may not result in executables that are perfectly "optimal" byany measure

    Ref: http://en.wikipedia.org/wiki/Compiler_optimization

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    34/61

    34CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Sun Workshop Compiler 6.2

    - O : Set optimization level

    - fast : Select a set of flags likely to improve speed

    - stackvar : put local variables on stack

    - xlibmopt : link optimized libraries

    - xarch : Specify instruction set architecture

    - xchip : Specifies the target processor for use by theoptimizer.

    - native : Compile for best performance on localhost.

    - xprofile : Collects data for a profile or uses a profile to

    optimize.

    - fns : Turns on the SPARC nonstandard floating-point

    mode.- xunroll n : Unroll loops n times.

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    35/61

    35CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    -O Optimize at the level most likely to give close to the maximumperformance for many realistic applications (currently -O3)

    -O1 Do only the basic local optimizations (peephole).-O2 Do basic local and global optimization. This level usually gives

    minimum code size.

    -O3 Adds global optimizations at the function level. In general, this level,and -O4, usually result in the minimum code size when used with the

    -xspace option.-O4 Adds automatic inlining of functions in the same file. -g suppresses

    automatic inlining.

    -O5 Does the highest level of optimization, suitable only for the smallfraction of a program that uses the largest fraction of computer time.

    Uses optimization algorithms that take more compilation time or thatdo not have as high a certainty of improving execution time.Optimization at this level is more likely to improve performance if it isdone with profile feedback. See -xprofile=collect|use.

    Basic Compiler Techniques : Optimizations

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    36/61

    36CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    - stackvar

    Tells the compiler to put most variables on the stack rather than

    statically allocate them. - stackvar is almost always a good idea, and it is crucial when

    parallelization.

    You can control stack versus static allocation for each variable.

    Variables that appear in DATA, COMMON, SAVE, orEQUIVALENCE statements will be static regardless of whetheryou specify -stackvar.

    Basic Compiler Techniques : Local variables on the Stack

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    37/61

    37CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Basic Compiler Techniques

    -xchip

    Specifies the target chip. Specifying the chip lets the compiler know

    that certain implementation details such as specific instructionstimings, number of functional units etc.

    -xarch

    Specifies the target architecture. A target architecture includes theinstruction set but may not include implementation details such asinstruction timing.

    -xarch

    = v8plus on Sun produces an executable file that will take full

    advantage of some UltaSPARC features.

    -native

    Directs the compiler to produce the best executable (performance)that it can for the system on which the program is being compiled.

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    38/61

    38CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Basic Compiler Techniques

    - fast

    Run program with a reasonable level of optimization may changeits meaning on different machines.

    It strikes balance between speed, portability, and safety.

    -fast is often a good way to et a first-cut approximation of how

    fast your program can run with a reasonable level of optimization

    -fast should not be used to build the production code.

    The meaning of fast will often change from one release to

    another

    As with native, -fast may change its meaning on

    different machines

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    39/61

    39CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Basic Compiler Techniques

    - fsimple: (simple floating point model)

    Tells the compiler to use a floating point system that includes onlynumbers.

    - xvector : Vectorization enables the compiler to transform vectorizable loops

    from scalar to vector form. It is generally faster and slower forshort vectors

    - xlibmil: Tells the compiler to inline certain mathematical operations such

    as floor, ceiling, and complex absolute value

    - xlibmopt:

    Tells the linker to use an optimized math library. This mayproduce slightly different answer than the regular math library

    These libraries may get their speed by sacrificing accuracy

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    40/61

    40CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Advanced Compiler Techniques

    - xcrossfile

    Enables the compiler to optimize and inline source code across

    different files. It may compile code to be optimal for the files that are complied

    together Produces very fast executable

    - xpad

    Directs the compiler to insert padding (unused space) betweenadjacent variables in common blocks and local variables to try toimprove cache performance.

    C

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    41/61

    41CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Using Your Compiler Effectively - Classical Optimizations

    The compiler performs the classical optimizations, plus number ofarchitecture specific optimizations.

    Copy propagation

    Constant Folding

    Dead Code removal

    Strength reduction

    Induction Variable Elimination

    Common Sub-expression Elimination

    HPC W k h 2004HPC W k h 2004O ti i ti T h i

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    42/61

    42CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    The compiler performs the classical optimizations, plus number of

    architecture specific optimizations. Loop in-variant code motion.

    Induction variable simplification

    Register variable detection

    Inlining

    Loop Fusion

    Loop Unrollling

    Classical Optimizations

    HPC W k h 2004HPC Workshop 2004O ti i ti T h i

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    43/61

    43CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Copy propagation

    Copy propagation is an optimization that occurs both locally andglobally.

    x=yz=1.0+x

    Compiler may be able to perform copy propagation a cross the flowgraph.

    x=yz=1.0+y

    PROGRAM MAIN

    INTEGER I, K

    PARAMETER (I=200)

    K=200

    J=I+K

    END

    Constant Folding

    A clever compiler can find constantsthroughout your program.

    Classical Optimizations

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    44/61

    44CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Dead Code Removal

    Dead code comes in two types.

    Instructions that are unreachable.

    Instructions that produce results whichone never used.

    Program main

    i=2

    write (x,x)istop

    i=4

    write (x,x)i

    endStrength Reduction

    Operations or expressions havevarious time costs associated withthem.

    There are many opportunities forcompiler generated strengthreductions.

    Y=X*2

    J=Kx2

    Y=X*X

    J=K+K

    Classical Optimizations

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    45/61

    45CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Variable Renaming

    Example: Observe variable in the following fragment of code.

    x = y x zq = r+x+x

    x = a+b

    Variable renaming is an important technique because it clarifies that

    calculations are independent of each other, which increases thenumber of things that can be done in parallel.

    Common sub expression Elimination

    D=Cx(A+B)E=(A+B/2)

    Different computer go to different lengths to find common sub expression

    xx = y x zq = r+xx+xx

    x = a+b

    Temp=A+BD=C X temp

    E=temp p/2

    Classical Optimizations

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    46/61

    46CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Loop invariant code Motion:

    The compiler will look for every opportunity to move calculations out of

    a loop and into the surrounding.

    Loop invariant code motion is simply the act of moving the repeated,unchanging calculations to the outside.Induction Variable Simplification:

    Loop can contain what are called induction variables.

    DO 10 I=1,N

    A(I)=B(I)+CxD

    E=G(K)

    10 CONTINUE

    DO 10 I=1,N

    K=I*4+M

    10 CONTINUE

    temp=CxDDO 10 I=1,N

    A(I)=B(I)+temp

    10 CONTINUE

    E=G(K)

    K=M

    DO 10 I=1,N

    K=K+4

    10 CONTINUE

    Classical Optimizations(Contd..)

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    47/61

    47CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    SUM=0.0DO 10 I=1, N

    SUM=SUM+A(I)xB(I)

    10 CONTINUE

    Example: Dot product of two vectors

    SUM=0.0

    DO 10 I=1,N,4

    SUM = SUM+A(I)xB(I)+A(I+1)*B(I+1) +

    A(I+2)*B(I+2)+A(I+2)*B(I+3)

    10 CONTINUE

    The loop is recursive on that singlevariable, every iteration needs theresult of the previous iteration.

    The assignment is being made to a scalar, unrolling isnt as straightforward as before. Obvious way is to calculate several iteration at a

    time.

    Associative Transformations and Reductions

    Classical Optimizations

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    48/61

    48CopyrightC-DAC2004 October5-9, 2004

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

    Dependency analysis is a technique where by the syntacticconstructs of a program are analyzed with the aim of determining

    whether certain values may depend on other previously computedvalues.

    The real objective of dependence analysis is to determine whethertwo statements are independent of each other

    Example: S1 A=C-AS2 A=B+C

    S3 B=A+C

    DO ALL transformations: This transformation converts everyiteration of a loop into process that is independent of all others

    It assumes that there are no loop-carried dependencies.

    The DO ALL transformation is very efficient if it can be applied.However, many loops carry dependencies.

    Classical Optimizations

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    49/61

    49CopyrightC-DAC2004 October5-9, 2004

    C o s op 00pOpt sat o ec ques

    Register Variable Detection

    On many CISC processors there were few general purposeregisters.

    On RISC designs, there are many more registers to choose from,and everything has to be brought into a register anyway.

    All variables will be registers resident.

    The new challenge is determine which variables should live the

    greater portion of their lives in registers.

    The compiler performs the classical optimizations, plus a number ofarchitecture-specific optimizations.

    Classical Optimizations

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    50/61

    50CopyrightC-DAC2004 October5-9, 2004

    ppp q

    Inlining

    Inlining is the substitution of the body of a subprogram for the call of

    that subprogram. This eliminates function call overhead. To enable inlining by the Sun compilers, use fast or xO4

    f77 fast a.f Loop Fusion :

    Loop fusion is the process of fusing two adjacent loops with thesame loop bounds, which is usually a Good Thing

    Induction Values:

    Induction values that can be computed as a function of the loop

    count variable and possibly other values.

    Classical Optimizations

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    51/61

    51CopyrightC-DAC2004 October5-9, 2004

    pp q

    Parallel programming-Compilation switchesAutomatic and directives based parallelization

    Allow compiler to do automatic and directive based parallelization

    -x autopar, -x explicitpar, -x parallel, -tell the compilerto parallelize your program.

    xautopar: tells the compiler to do only those parallelization that itcan do automatically

    xexplicitpar: tells the compiler to do only those parallelizationthat you have directed it to do with programs in the source

    xparallel: tells the compiler to parallelize both automaticallyand under pragma control

    xreduction: tells the compiler that it may parallelize reductionloops. A reduction loop is a loop that produces output with smallerdimension than the input.

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    52/61

    52CopyrightC-DAC2004 October5-9, 2004

    Parallel Programming Compiler switches

    Remarks

    In some cases, parallelizing a reduction loop can give differentanswers depending on the number of processors on which the loop isrun.

    Compiler directives can usually over come artificial barriers to

    parallelization.

    Compiler directives can also overcome legitimate barriers toparallelization, which introduces errors.

    The efficiency and effectiveness of automatic compiler parallelizationcan be significantly improved by supplying the switches.

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    53/61

    53CopyrightC-DAC2004 October5-9, 2004

    BLAS, IMSL, NAG, LINPACK, ScaLAPACK LAPACK, etc.

    Calls to these math libraries can often simplify coding.

    They are portable across different platform

    They are usually fine-tuned to the specific hardware as well as tothe sizes of the array variables that are sent to them

    Example : Sun performance libraries (-xlic_lib=sunperf), IBM ESSL,ESSLSMP

    Use of MATH LIBRARIES

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    54/61

    54CopyrightC-DAC2004 October5-9, 2004

    Optimization of unsteady state 3D Compressible Navier-Stokesequations by finite difference method

    Computing System used : Sun Ultra Sparc workstation (Each nodeis quad CPU Ultra Enterprise 450 server, operating at 300Mhz)

    Grid Size Iterations Time in seconds

    192*16*16 1000 (No compiler options) 4930

    192*16*16

    1000 (Code restructuring and

    compiler optimization)

    2620

    192*16*16 680

    1000 (with compiler optimization)

    Conclusions : Re-structuring the code and use of proper compileroptimizations reduces the execution time by a factor of 8.0

    Performance of selective application - CFDPerformance of selective application - CFD

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    55/61

    55CopyrightC-DAC2004 October5-9, 2004

    4-way SMP

    POWER 4 1.0 Ghz

    8 GB Main memory (16 GB Max)

    AIX 5.1 and PPC Linux

    XL F77, F90, C, C++

    Performance Libraries: BLAS 1,2,3 BLACS, ESSL

    32-way SMP

    POWER 4 1.1 Ghz

    64 GB Main memory (256 GB Max) AIX 5.1 and PPC Linux

    XL F77, F90, C, C++

    Performance Libraries: BLAS 1,2,3 BLACS, ESSL

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    56/61

    56CopyrightC-DAC2004 October5-9, 2004

    LLCBench: Performance on IBM p630

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    57/61

    57CopyrightC-DAC2004 October5-9, 2004

    LLCBench: Performance on IBM p690

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    58/61

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    59/61

    59CopyrightC-DAC2004 October5-9, 2004

    Reducing Memory Overheads is important for performance ofsequential and parallel programs

    Minimization of memory traffic is the single most important goal.

    For multiple dimensional arrays, access will be fastest if you iterateon the array subscript offering the smallest stride or step size.

    Role of Data Reuse on Memory sub-system will increase the

    performance

    Basic Compiler and Advanced Compiler Optimization flags can beused for performance

    Write code so that a compiler find it easy to locate optimizations

    Compiler performs Classical Optimization Techniques and someloop optimization techniques

    Conclusions

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    60/61

    60CopyrightC-DAC2004 October5-9, 2004

    1. Ernst L. Leiss, Parallel and Vector Computing A practical Introduction, McGraw-HillSeries on Computer Engineering, Newyork (1995).

    2. Albert Y.H. Zomaya, Parallel and distributed Computing Handbook, McGraw-HillSeries on Computing Engineering, Newyork (1996).

    3. Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis, Introduction to ParallelComputing, Design and Analysis of Algorithms, Redwood City, CA,Benjmann/Cummings (1994).

    4. William Gropp, Rusty Lusk, Tuning MPI Applications for Peak Performance, Pittsburgh

    (1996)5. Ian T. Foster, Designing and Building Parallel Programs, Concepts and tools for Parallel

    Software Engineering, Addison-Wesley Publishing Company (1995).

    6. Kai Hwang, Zhiwei Xu, Scalable Parallel Computing (Technology Architecture

    Programming) McGraw Hill Newyork (1997)

    7. Culler David E, Jaswinder Pal Singh with Anoop Gupta, Parallel Computer Architecture,A Hardware/Software Approach, Morgan Kaufmann Publishers, Inc, (1999)

    References

    HPC Workshop 2004HPC Workshop 2004Optimisation Techniques

  • 8/2/2019 Day01 Hpc Wrkshp Compiler Opt

    61/61

    61CopyrightC-DAC2004 October5-9, 2004