Compiler Techniques for Single Processor Tuning

TM

Compiler Techniques for Compiler Techniques for Single Processor Tuning Single Processor Tuning

An introduction

TM

OutlineOutline Compiler is the primary tool of computer program optimization:

• structure of the compiler and the compilation process

• compiler optimizations

• steering of the compilation process - compiler options

• structure of the run time libraries and scientific libraries

• computational domain and computation accuracy

TM

Compiler manages processor resources:

• registers

• integer/floating-point execution units

• load/store/prefetch for data flow in/out of processor

• The implementation details of processor and system architecture are build into the compiler

User Program (C/C++/Fortran, etc.)– high level representation

– low level representation Machine instructions

The CompilerThe Compiler

Solving:–data dependencies–control flow dependencies–parallelization–compactification of code–optimal scheduling of the code

Compilationprocess

Intermediate representation

TM

MIPSpro Compiler ComponentsMIPSpro Compiler Components

• There are no source-to-source optimizers or parallelizers

• source code is translated to WHIRL (Winning Hierarchical Intermediate Representation Language); – same IR for different levels of representation – whirl2f and whirl2c translates back into Fortran or C from IRs

• Inter-Procedural analyser requires final translation at link time

F77/f90 cc/CCdriver

Macropre-

processor

Front-end

(source to WHIRLformat) Inter-

ProceduralAnalyser

Loopnest

optimizer

Paralleloptimizer

LinkerInter-

ProceduralAnalyser

source

Globaloptimizer

Codegenerator

Executable object

TM

Compiler OptimizationsCompiler Optimizations• Global optimizer:

– dead code elimination– copy propagation– loop normalisation

• stride one loops• single induction variable

– memory alias analysis– strength reduction

• Inter-Procedural Analyser:– cross-file function inlining– dead function elimination– dead variable elimination– padding of variables in common blocks– inter-procedural constant propagation

• Automatic parallelizer– loop level work distribution

• Loop Nest optimizer:– loop unrolling (outer)– loop interchange– loop fusion/fission– loop blocking– memory prefetch– padding local variables

• Code Generator:– software pipelining– inner loop unrolling– if-conversion– read/write optimization– recurrence breaking– instruction scheduling

inside basic blocks

TM

SGI Architecture, ABI, LanguagesSGI Architecture, ABI, Languages• Instruction Set Architecture (ISA):

– -mips4 (R1x000, R8000, R5000 processors)– -mips3 (R4400)– -mips[1|2] (R3000, R4000 processors, invokes old ucode compiler)

• ABI (Application Binary Interface):– -n32 (32 bit pointers, 4 byte integers, 4 byte real)– -64 (64 bit pointers, 4 byte integers, 4 byte real)

• Languages:– C– C++– Fortran 77– Fortran 90

Variable C size[bit] F size[bit]-n32 -64 -n32 -64

char/character 8 8 8 8short 16 16 int/integer 32 32 32 32long 32 64long long 64 64logical 32 32float/real 32 32 32 32double 64 64 64 64pointer 32 64

TM

Options: ABI & ISAOptions: ABI & ISA Option Functionality

-n32 invoke the MIPSpro Compiler, use 32 bit addressing -64 invoke the MIPSpro Compiler, use 64 bit addressing -o32/-32 invoke the old ucode compiler, 32 bit addressing -mips[1234] ISA; -mips[12] implies ucode compiler

There are two more ways to define the ABI and ISA:• environment variable “SGI_ABI” can be set to -n32 or -64

• the ABI/ISA/Processor/optimization can be set in a file ~/compiler.defaults or /etc/compiler.defaults. In addition, the location of the file can be defined by “COMPILER_DEFAULTS_PATH” environment variable. The file should contain a line like: -DEFAULT:abi=n32:isa=mips4:proc=r10000:arith=3:opt=O3

There is a way to find which compiler flags were used:

dwarfdump -i file.o | grep DW_AT_producer

TM

Optimization LevelsOptimization Levels Compilation speed degrades with higher optimization• -O0 turn off all optimizations

• -O1 only local optimizations

• -O2 or -O extensive but conservative optimizations

• -O3 aggressive optimizations, LNO, software pipelining

• -ipa inter-procedural analysis (only at -O2 and -O3)

• -apo automatic parallelization option (same as -pfa)

• -g[0|3] debugging switch: -g0 forces -O0-g3 to debug with -O3

TM

Options: PerformanceOptions: Performance Option Functionality -r10000 Generate optimal instruction schedule for the R10000 proc -r8000 Generate optimal instruction schedule for the R8000 proc

-O[0|1|2|3] Set optimization Level to 0, 1, 2, 3

-Ofast=[ipXX] Select best optimization for the given architecture

-mp Enable multi-processing directives -mpio Support I/O from a parallel region -apo Invoke automatic parallelization option

XX machine (output of the hinv -c processor command)27 Origin2000 (all cpu frequencies and cache sizes)35 Origin3000 (all cpu frequencies and cache sizes)

optimizations may differ on the version of the compiler. Currently:

-O3 -IPA -TARG:platform=ip27 -n32 -OPT:Olimit=0:roundoff=3:div_split=ON:alias=typed

(thus -Ofast switch invokes the Interprocedural Analyser)

TM

Options: PortingOptions: Porting Option Functionality -d8/d16 Double precision variables as 8 or 16 bytes -r8 Convert REAL to REAL*8 and COMPLEX to COMPLEX*16 (1)

-i8 Convert INTEGER to INTEGER*8 and LOGICAL to 8 byte sizes (1)

-static Local variables will be initialised in fixed locations on the heap -col[72|120] Source line is 72 or 120 columns

-Dname Define name for the pre-processor -Idir Define include directory dir -alignN Assume alignment on the N=8,16,32,64,128 bit boundary

-version Show compiler version -show Put the compiler in verbose mode: all switches are displayed

(1) Note: explicit sizes are preserved, i.e. REAL*4 remains 32 bit

TM

Options: DebuggingOptions: Debugging Option Functionality -g Disable optimization and keep all symbol tables -DEBUG: the DEBUG group option (man DEBUG_GROUP):• check_div=n n=1 (default) check integer divide by zero

n=2 check integer overflow n=3 check integer divide by zero and overflow• subscript_check (default ON) to check for subscripts out of range

C/C++: produces trap #8 f77: aborts run and dumps core f90: aborts run if setenv F90_BOUNDS_CHECK_ABORT• verbose_runtime(default OFF) to give source line number of failures• trap_uninitialized (default OFF) initialise all variables to 0xFFFA5A5

when used as pointer - access violation when used as fp values - NaN causes fp trap Example: f77 -n32 -mips4 -g file.f \ -DEBUG:subscript_check:verbose_runtime=ON \ -DEBUG:check_div=3 -DEBUG:trap_uninitialized=ON

TM

Compilation ExamplesCompilation Examples 1. Produce executable a.out with default compilation options: f77 source.f cc source.f

be aware of the defaults setting (e.g. /etc/compiler.defaults ) same flags for Fortran and C

2. Options for debugging: f77/cc -o prog -n32 -g -static source.f

2. Explicit setting of ABI/ISA/Processor, highest opt:

f77/cc -o prog -n32 -mips4 -r10000 -O3 source.f

3. Detailed control of the optimization process with the group options : f77/cc -o prog -64 -mips4 -O3 -Ofast=ip27 -OPT:round=3:IEEE_arith=3 -IPA:dfe=on ...

TM

Compiler performs many sophisticated optimizations on the source code under certain assumption about the program. Typically:• program data is large (does not fit into the cache)• program does not violate language standard• program is insensitive to roundoff errors• all data in the program is alias-ed, unless it can prove otherwise

if one or more of these assumptions does not hold, compiler should be tuned to the program with the compiler options. Most important:• OPT for general optimizations assumptions• LNO for the Loop Nest optimizer options• IPA for the Inter-Procedural Analyser options

Additional options that help to tune the compiler properly:• TENV, TARG for the target machine and environment description• LIST, DEBUG for the listing and debugging options

Fine Tuning Compiler ActionsFine Tuning Compiler Actions

TM

Group OptionsGroup Options Compiler options can be set with the key=value expressions on the command line. These options are combined in logical groups. Multiple key=val expressions are colon separated; same group headings can be specified several times, the effects are cumulative:

E.g.: -OPT:roundoff=2:alias=restrict -OPT:IEEE_arithmetic=3 etc.

Group Heading Reference page Usage comments -OPT:key=val cc(1) f77(1) opt(5) optimizations -TENV:key=val cc(1) f77(1) Control target environment -TARG:key=val cc(1) f77(1) Control target architecture -FLIST/CLIST cc(1) f77(1) Listing control -LIST:key=val cc(1) f77(1) Options to control listing -DEBUG:key=val debug_group(5) Debugging options -IPA:key=val ipa(5) Inter-Procedural Analyser control -INLINE:key=val ipa(5) Procedure inliner control -LNO:key=val lno(5) Loop Nest optimizer control -MP:key=val cc(1) f77(1) parallelization control -LANG: cc(1) f77(1) language compatibility features

TM

Compiler man PagesCompiler man Pages• Primary man pages:

man f77(1) f90(1) cc(1) CC(1) ld(1)• some of the compiler option groups are rather large and deserve their

own man pages

man opt(5)

man lno(5)

man ipa(5)

man DEBUG_GROUP(5)

man mp(3F)

man pe_environ(5)

man sigfpe(3C)

TM

The Run-Time Library StructureThe Run-Time Library Structure

/usr

lib/

lib32/

lib64/

*.a, *.so

ucode-compiler

nonshared/*.a

*.a, *.so

Cmplrs/mongoose-compiler

nonshared/*.a

mips3

mips4

*.a, *.sononshared/*.a


*.a, *.so

Cmplrs/mongoose-compiler

nonshared/*.a

mips3

mips4



n32

n64

o32

Environment variables:LD_LIBRARY_PATHLD_LIBRARY32_PATHLD_LIBRARY64_PATH

TM

The Scientific LibrariesThe Scientific Libraries Standard scientific libraries containing:• Basic Linear Algebra operations and algorithms:

– BLAS1, BLAS2, BLAS3 (see man intro_blas1,_blas2,_blas3)– LAPACK (see man intro_lapack)

• Fast Fourier Transformations (FFT):– 1D, 2D, 3D, multiple 1D transformations (see man intro_fft)

• Convolutions (Signal Processing, e.g. man SIIR2D)• Sparse Solvers (see man solvers; man PSLDLT)

To use: – -lscs serial versions– -lscs_mp -mp for parallel versions– man intro_scsl for detailed description– -lcomplib.sgimath or -lcomplib.sgimath_mp for older versions– man complib.sgimath for detailed desctiption

TM

Computational DomainComputational Domain Range of numbers (from /usr/include/limits.h): FLT_DIG 6 /* decimal digits of precision of a float */ FLT_MAX 3.40282347E+38F FLT_MIN 1.17549435E-38F

DBL_DIG 15 /* decimal digits of precision of a double */ DBL_MAX 1.7976931348623157E+308 DBL_MIN 2.2250738585072014E-308

LONGLONG_MIN -9223372036854775807LL-1LL LONGLONG_MAX 9223372036854775807LL ULONGLONG_MAX 18446744073709551615LLU

The extended precision (REAL*16) is available and supported by the compiler. But this mode of calculation is slow (by factor ~40)

TM

Underflow and Denormal NumbersUnderflow and Denormal Numbers When de-normalised numbers emerge in a computation

(i.e. numbers x<DBL_MIN) they are flushed to zero by default:

will print zero. To force IEEE-754 gradual underflow it is necessary to manipulate status register on the R1x000 cpu. Calling no_flush at the beginning of the program will print 0.22250738585072014D-308 Flush-to-zero property can lead to x-y=0, while xy . Keeping de-normalised numbers in computations will avoid that condition, but will cause fp exception, that must be processed in software.

It is a performance issue - not to manipulate the de-normalized numbers in calculations.

Program denormreal*8 a,b

a = 2.2250738585072014D-308b = a/10.0D0write(6,10) bend

#include <sys/fpu.h>void no_flush_(){

union fpc_csr f;f.fc_word = get_fpc_csr();f.fc_struct.flush = 0;set_fpc_csr(f.fc_word);

}

TM

Overflow ExampleOverflow Example Program example that generates overflows and underflows:

Output with all exceptions ignored by default:

setenv TRAP_FPE „UNDERFL=TRACE; OVERFL=TRACE“ will trap at Overflow and Underflow and produce traceback (Link -lfpe).

Parameter (N=20)INCLUDE “/usr/include/limits.h”Real*8 A(N),B(N)complex*16 C(N)

do I=1,N A(I) = (FLT_MAX/10)*I ! single precision range B(I) = (FLT_MIN*10)/I ! will fit into doubleenddo

C = CMPLX(A,B) ! Standard requires passing from base precision: real*4 !

write (0,’(I3,2(2G22.15/))’) (I,A(I),B(I),C(I),I=1,N)

10 0.340282347000000E+39 0.117549435000000E-37 A,B0.340282346638529E+39 0.117549435082229E-37 Cr,Ci

11 0.374310581700000E+39 0.106863122727273E-37 Infinity 0.000000000000000

12 etc…

Compile with: f77 -n32 -mips4 -O3Note: Compilation with -r8 avoids the error.

Flush to zero!Overflow!

TM

Floating Point ExceptionsFloating Point Exceptions A fp status register flag is set when fpu is has an illegal condition:• division by zero• overflow• underflow• invalid• inexact

By default, all exceptions are ignored! (e.g. for 1/0 NaN value is set and execution continues)

The status register can be programmed to raise a Floating Point Exception. If an FPE occurs, the system can take a specified action:• abort• ignore the exception• repair the illegal condition

You can manipulate the status register to select action:• with calls to the FPE library, link with -lfpe• with environment variable TRAP_FPE

see man handle_sigfpes

TM

Compiler-Generated ExceptionsCompiler-Generated Exceptions compiler can do more optimizations if it is allowed to generate code that could cause exceptions (-TENV:X=0..4) X = 0 no speculative code motion X = 1 IEEE-754 underflow and inexact FPE disabled (default -O0 and -O2) X = 2 all IEEE-754 exceptions disabled except 1/0 (default -O3) X = 3 all IEEE-754 exceptions are disabled X = 4 memory access exceptions are disabled

IF-conversion with conditional moves for Software Pipelining (with -O3):

Removing IF(…) will cause divide by zero!In this case this exception must be ignored

Do i=1,Nif(a(i) .lt. eps) then

x = x + 1/epselse

x = x + 1/a(i)endifenddo

#put eps in $f1 and (1/eps) in $f0

ldc1 $f5,-8($2) load a(i)c.lt.d $fcc0,$f5,$f1 if(a(i) < eps)recip.d $f2,$f5 1/a(i)movt.d $f2,$f0,$fcc0 y = 1/a or 1/eps add.d $f1,$f1,$f2 x = x+ y

Note: transf applied already at -O3 & X=1!

TM

IEEE_754 Compliance IEEE_754 Compliance The MIPS4 instruction set contains IEEE-754 non-compliant instructions:• recip.s/d (reciprocal 1/x) instruction is accurate to 1 ulp

• rsqrt.s/d (reciprocal-sqrt: 1/sqrt(x)) instruction to 2 ulp

-OPT:IEEE_arithmetic=X specify degree of non-compliance and what to do with inf and NaN operands X=1 strict IEEE-754 compliance; does not use the recip and rsqrt instructions (-O1,2)

X=2 optimize 0*x=0 and x/x=1 while x can be NaN (default at -O3)

X=3 any mathematically valid transformation is allowed, including recip & rsqrt instr.

Do i=1,n x = x + a(i)/yenddo

y_tmp = 1/ydo i=1,n x = x + a(I)*y_tmpenddo

21 cycles/iteration; 4% peak1 cycles/iteration; 100% peak

-O3 -OPT:IEEE_arithmetic=3

Note: X=3is required!

TM

Rounding AccuracyRounding Accuracy Rounding mode can be specified with -OPT:roundoff=X switch: X = 0 no optimizations that affect fp behaviour (default at -O1 -O2)

X = 1 allows simple transformations with limited round-off and overflow differences

X = 2 allows reordering of reduction loops (default at -O3)

X = 3 any mathematically valid transformation is allowed

do i=1,n x = x + a(i)enddo

do i=1,n,8 x0 = x0 + a(i) x1 = x1 + a(i+1) …enddox = x0 + x1 + …

-O3 -OPT:roundoff=2(default at -O3)

With -O3 -OPT:roundoff=12 cycles/iter; 25% peak 1 cycles/iter; 50% peak

Recommendation: Your program should work correctly when compiled with -O3 -OPT:IEEE_arithmetic=3:roundoff=3

TM

SummarySummary

Compiler is the primary tool of program optimization

• compilation is the process of lowering the code representation from high level to low, I.e. processor level

• the MipsPro compiler targets the MIPS R1x000 processor and has build in the features of the processor and Origin2000 architecture

• A large number of options exist to steer the compilation process– ABI, ISA and optimization options selections

– setting of assumptions about the program behaviour

• there are optimized and parallelized libraries of subroutines for scientific computation

• when programming for a digital computer, it is important to remember the limitations due to limited validity range of the floating point calculations

Compiler Techniques for Single Processor Tuning

Documents

Transcript of Compiler Techniques for Single Processor Tuning