EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from...

EFFECTIVE COMPILER UTILIZATION

Linux Supercluster User’s ConferenceSeptember 12, 2000 Albuquerque, NM

Eric Stoltz, PhD – senior compiler engineerThe Portland Group, Inc (PGI)

[email protected]://www.pgroup.com

Tutorial Overview

• The PGI compiler

• Vector parallelization

• Multiple processor parallelization

• Additional optimization techniques

Compaq Tru64CRAY T3E

HP ExemplarIBM SPNEC SX

SGI OriginSun E10K

The PGI CompilersPGF77 PGF90 PGCC PGC++PGHPF

OptimizingCore:

Global OptimizationAuto-ParallelizationOpenMP ParallelizationInterProcedural OptsProfile Feedback

Func InliningLoop UnrollingVectorizationPipelining

IA-32 IA-64 SPARC DSPs

OtherF77/F90

ExistingIn Development

PGI Linux/IA32 Fortran/C/C++• Fortran 90 – Robust, commercially-supported, current Fortran• Native Full HPF – high-level data parallel programming • EDG 2.40 C++ – state-of-the-art C++ language support• Auto-parallel for multi-CPU SMP nodes – F77, F90, C, C++• Native OpenMP for multi-CPU SMP nodes –F77, F90, C, C++ • Pentium II/III optimizations – including Pentium III SSE• Pentium III/Athlon prefetch – 32-bit and 64-bit vector data• RISC/UNIX compatibility – byte-swapping I/O, symbols, options• GNU interoperability – can cross-link gcc/g77 objects/libraries• PGDBG graphical debugger and PGPROF graphical profiler• Infrastructure – NAG, MPI-Pro, TotalView, VAMPIR, ISVs

Important Compiler Options-O2 Scheduling, register allocation, global optimization-tp p6 Pentium II/III specific optimizations/instructions-Munroll Unroll loops-fast Equivalent to “-O2 -Munroll -Mnoframe”-Minline Inline functions and subroutines-Mconcur Try to auto-parallelize loops for SMP systems-mp Process OpenMP/SGI directives and pragmas-Mvect=sse Try to use Pentium III SSE instructions-byteswapio Swap bytes from big-endian to little-endian or vice

versa upon input/output of unformatted data files-Minfo Compile-time optimization/parallelization messages-Mprof Enable function or line-level profiling

Important Compiler Options (cont)-pc {32|64|80}, -Kieee

Set precision of ops on FP stack, strict IEEE-Mcache_align

Align unconstrained objects >= 16 bytes to a cache line boundary

-Mneginfo={concur|loop}Display messages that indicate why loops are not parallelized or vectorized

-Msecond_underscore, -g77libsg77 compatibility

-Mx,4,4Use 3 FP stack locations as global registers (default is to use only 1); Try it and see if it helps

Using the –Kieee and -pc Options-pc {32 | 64 | 80}

Perform FP operations on the floating-point stack using the specified number of bits of precision; default is 80, with values being rounded when stored back to memory

-Kieee

Perform FP ops in strict compliance to the IEEE 754 standard; disables copy propagation and reciprocal division, and makes function calls to perform all transcendental operations

Generally, –pc 64 or –pc 64 in combination with –Kieee Produces arithmetic results that match most RISC/UNIX systems

Vector Parallelization• PIII SSE• Other processors

AMD 3D-Now! InstructionsPIIII processorAMD x86-64

Pentium III Streaming SIMD Enhancements (SSE)

• Eight 128-bit registers + new instructions • Length 4 vector regs/ops for 32-bit FP data• Instructions include add, subtract, multiply,

divide, sqrt, max, min, mov, conversion; still no multiply/accumulate

• PGI Compilers utilize SSE via vectorization –idioms and inline SSE code generation

• Linux distributions not SSE-enabled; http://sourceware.cygnus.com/gdb/papers/linux/linux-sse.html

SSE Registers

Current SSE Constraints• Single precision• Stride-1 references• 16-byte alignment

Using the –Mvect=sse OptionCandidate vector loops operate on 32-bit FP data:

% pgcc –fast –Mvect=sse –Mcache_align –Minfo matmul32.cmain:

29, Interchange produces reordered loop nest: 30, 29Loop unrolled 10 times

30, Outer loop unroll (4 times) and jam34, Interchange produces reordered loop nest: 35, 34

Loop unrolled 10 times35, Outer loop unroll (4 times) and jam46, Outer loop unroll (4 times) and jam47, Loop unrolled 10 times53, Loop unrolled 10 times58, Call to __pgi_adotp4 generated

Linking:%

32-bit Matrix Multiply PerformancePIII 550, 512KB, 100Mhz bus

161581024x1024

167108 - 150256x256

31720364x64

MFLOPsSSE

MFLOPsNo SSE

Size

Other Processors• AMD – 3D Now!, overlays fp stack• PIIII – can vectorize 64-bit doubles• AMD x86-64 – will support SSE,

add 8 new integer registers, and16 128-bit SSE registers.More info: www.x86-64.org

SMP and Distributed Memory Parallelization with Multiple

Processors

SMP• Low-level thread calls• Lightweight is good – Linux so-so• Scaling presents a problem• Threads = number of processors?

Low-Overhead Parallel Regions

SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, http://www.epcc.ed.ac.uk/research/openmpbench;

PGI Results from runs of EPCC bmk on Quad-PIII 550 node at OSC (Linux 2.2.12)

PARALLEL Region Startup/Shutdown on 2 CPUs

0

1000

2000

3000

4000

5000

6000

7000

SGI/MIPS KAI/SPARC Compaq/Alpha PGI/Linux

Cyc

les

Using the -Mconcur Option-Mconcur[=option[,option]] where option is:

altcode:n Instructs the parallelizer to generate alternate scalar code for parallelized loops

noaltcode Do not generate alternate scalar code

dist:block Parallelize with block distribution (default)

dist:cyclic Parallelize with cyclic distribution

cncall Loops containing calls are candidates for parallelization

noassoc Disables parallelization of loops with reductions

What is automatically parallelized?

• Countable loops• No calls or conditionals• No loop-carried dependences• Automation must be conservative

Auto-Parallel LINPACK 100Quad PIII 550, 512KB Cache, Linux 2.2.12

Compiler options: -fast –Mconcur -Minline=saxpy,sscal

Procs = 0 is the non-parallel version on 1 CPU (no –Mconcur)

Results courtesy of Ohio Supercomputer Center

123 120

185

229

263

0

50

100

150

200

250

300

0 1 2 3 4

Processors

MF

LO

Ps

Using the -mp Option

• PGI supports Full OpenMP in F77, F90, C, and C++; thus is OpenMP Compliant

• Also supports SGI C$DOACROSS in Fortran and SGI pragmas in C/C++

• Use –Mnosgimp to ignore SGI, and –Mnoopenmp to ignore OpenMP directives in programs that have both (e.g. MM5)

OpenMP directives/pragmas• Sentinel-based way to direct SMP

parallel code (!$OMP or #pragma omp)• C/C++ current version 1.0, October 98• Fortran current version 1.1, September 99• Fortran 2.0 version out for public comment• More information: www.openmp.org

( 42) for (l=1; l<=NTIMES; l++) {( 43) #pragma omp parallel private (j,i,ii,k)( 44) {( 45) #pragma omp for( 46) for (j=0; j<p; j++) {( 47) for (i=0; i<m; i++) {( 48) c[j][i] = 0.0;( 49) }( 50) }( 51) for (i=0; i<m; i++) {( 52) #pragma omp for( 53) for (ii=0; ii<n; ii++) {( 54) arow[ii] = a[ii][i];( 55) }( 56) #pragma omp for ( 57) for (j=0; j<p; j++) {( 58) for (k=0; k<n; k++) {( 59) c[j][i] = c[j][i] + arow[k] * b[j][k];( 60) }( 61) }( 62) }( 63) }( 64) (void) dummy(c);( 65) }

Programming in OpenMP• Concept of master thread• Scoping of variables (private, shared)• Private within local stack space of each

processor (off the stack pointer)• Shared in global memory (off frame

pointer)

Key OMP constructs• firstprivate• lastprivate• threadprivate• shared• private• Loop iterators private by default• Many others (barrier, critical sections)• Task parallelism available using sections

main(){

int i, a[100];

#pragma omp parallel for

for(i=0; i< 100; i++)

a[i] = i;

}

/* identical semantics */

main(){

int i, a[100];

#pragma omp parallel private(i)

{

#pragma omp for

for(i=0; i< 100; i++)

a[i] = i;

}

}

pgcc -mp -Minfo par.c

main:

5, Parallel region activated

6, Parallel loop activated; static block iteration allocation

7, Parallel region terminated

EPCC OpenMP Microbenchmarks

Results from runs of EPCC Microbenchmarks on Quad-PIII 550 node at OSC (Linux 2.2.12)

0500

1000150020002500300035004000

paral

lel do

paral

lel do

barrie

r

single

critica

l

lock/u

nlock

atomic

reduc

tion

OpenMP Construct

Cyc

les

/ Ove

rhea

d

1 CPU

2 CPUs

3 CPUs

4 CPUsc

• LSDYNA = Over 750,000 lines of Fortran and C source • 1.74X Speedup on 2 CPUs, 2.74X Speedup on 4 CPUs• Compiled using PGI Workstation Fortran and OpenMP

111,029 Elements1 CPU Time = 119955 Seconds2 CPU Time = 69113 Seconds4 CPU Time = 43864 Seconds4-CPU 500 Mhz PIII Xeon

LS-DYNA on Linux/IA32 - OpenMP

0

0.5

1

1.5

2

2.5

3

1 2 4

CPUs

Sp

eed

-UP

0

0.5

1

1.5

2

2.5

3

1 2 4

CPUs

Sp

eed

-UP

Distributed Parallelism• Linux clusters MIMD by definition• Must use communication between nodes• MPI – do it yourself message passing

must link in MPI libraries• HPF – High Performance Fortran, puts

in communication for you

MPI Examples• PGI CDK uses MPI-CH• mpirun script sends copies of

executable to each node• Sample program, ‘Who am I?’

#include <unistd.h>

#include "mpi.h"

main(int argc, char **argv){

int len,ierr;

char hname[32];

len = 32;

MPI_Init( &argc, &argv );

gethostname(hname,len);

printf("My name is %s\n",hname);

MPI_Finalize( );

}

pgcc myname.c -lmpich -o myname

Linking:

mpirun -nolocal -np 4 myname

My name is bigfoot3.pgroup.com




High Performance Fortran(HPF)

• Set of Fortran extensions expressingparallel execution at a high level

• Data-to-processor mapping done intwo stages, using the DISTRIBUTEand ALIGN operations.

2-Level Parallelism• Distribute work across nodes• Split up work within each node• Need a programming paradigm• Must determine number of

processors within each node

4-node cluster at PGInamed BIGFOOT

• Master node 500Mhz Celeron, bigfoot0• 4 compute nodes, dual 550 PIIIs

with 256 MB of memory,named bigfoot1 – bigfoot4

• Connected with fast ethernet

#include <unistd.h>#include "mpi.h“#include <omp.h>

main(int argc, char **argv){

int len = 32, ierr;

char hname[32]

omp_set_num_threads( omp_get_num_procs() );

MPI_Init( &argc, &argv );

gethostname(hname,len);

printf("My name is %s\n",hname);

#pragma omp parallel

{

printf("...thread number d\n", omp_get_thread_num() );

}

MPI_Finalize( );}

[stoltz@bigfoot]$ pgcc -mp -Minfo myname.c -o myname -lmpich

main:

12, Parallel region activated

14, Parallel region terminated

Linking:

[stoltz@bigfoot]$ mpirun -np 4 myname

My name is bigfoot.pgroup.com

...thread number 0


...thread number 0

...thread number 1


...thread number 0

...thread number 1


...thread number 0

...thread number 1

[stoltz@bigfoot]$ mpirun -np 4 -nolocal myname


...thread number 0

...thread number 1


...thread number 0

...thread number 1


...thread number 0

...thread number 1


...thread number 0

...thread number 1

A Real Problem• Matrix multiply, arrays 1000x1000• Use HPF Distribute and Align• OMP calls to set number of threads• Timer from master thread

program matmul

include 'lib3f.h'

integer size, ntimes, m, n, p, i, j, k

integer OMP_GET_NUM_PROCS

parameter (size=500, m=size,n=size,p=size,ntimes=5)

real*8 a, b, c, arow

dimension a(m,n), b(n,p), c(n,p), arow(n)

!hpf$ distribute (*,block) :: a, b

!hpf$ align c(:,:) with b(:,:)

integer l

real walltime, mflops

integer hz, clock0, clock1, clock2

real etime

real tarray(2), time0, time1, time2

do i = 1,m

do j = 1, n

a(i,j) = 1.0

enddo

enddo

do i = 1, n

do j = 1, p

b(i,j) = 1.0

enddo

enddo

time0 = etime(tarray)

do l = 1, ntimesdo j = 1, p

do i = 1, m

c(i,j) = 0.0

enddo

enddo

do i = 1, m

do ii = 1, n

arow(ii) = a(i,ii)

enddo

!hpf$ independent

do j = 1, p

do k = 1, n

c(i,j) = c(i,j) + arow(k) * b(k,j)

enddo

enddo

enddo

call dummy(c)enddo


do i = 1, ntimes

call dummy(c)

enddo


walltime = ((time1 - time0) - (time2 - time1)) / real(ntimes)

mflops = (m*p*(2*n-1)) / (walltime * 1.0e+06)

print *, walltime, time0, time1, time2

print *, "M =",M,", N =",N,", P =",P

print *, "MFLOPS = ", mflops

print *, "c(1,1) = ", c(1,1)

end

subroutine dummy(c)

return

end

pghpf -fast -Mautopar -Minfo -Mmpi matmul.f -o matmul

matmul:

22, 1 FORALL generated


34, Invariant communication calls hoisted out of loop




no parallelism: replicated array, arow

45, Independent loop parallelized

65, expensive communication: scalar communication (get_scalar)

matmul:

23, Loop unrolled 10 times



[stoltz@bigfoot]$ mpirun -nolocal -np 1 matmul

32.26400 0.1500000 161.4700 161.4700

M = 1000, N = 1000, P = 1000

MFLOPS = 61.95760

c(1,1) = 1000.000000000000


16.93800 7.9999998E-02 84.77000 84.77000

M = 1000, N = 1000, P = 1000

MFLOPS = 118.0187

c(1,1) = 1000.000000000000


9.502000 4.9999997E-02 47.56000 47.56000

M = 1000, N = 1000, P = 1000

MFLOPS = 210.3768

c(1,1) = 1000.000000000000

pghpf -fast -Mconcur -Mautopar -Minfo -Mmpi matmul.f -o matmul







45, Independent loop parallelized

23, Parallel code for non-innermost loop generated; block distribution

Loop unrolled 10 times


Loop unrolled 10 times



[stoltz@bigfoot]$ mpirun -nolocal -np 1 a.out

21.33000 0.1200000 106.7700 106.7700

M = 1000, N = 1000, P = 1000

MFLOPS = 93.71777

c(1,1) = 1000.000000000000


6.584001 2.9999999E-02 32.95000 32.95000

M = 1000, N = 1000, P = 1000

MFLOPS = 303.6148

c(1,1) = 1000.000000000000

( 63) do i = 1, m( 64) !$omp do( 65) do ii = 1, n( 66) arow(ii) = a(i,ii)( 67) enddo( 68) !hpf$ independent( 69) !$omp do( 70) do j = 1, p( 71) c$mem prefetch arow(1),b(1,j)( 72) c$mem prefetch arow(5),b(5,j)( 73) c$mem prefetch arow(9),b(9,j)( 74) do k = 1, n, 4( 75) c$mem prefetch arow(k+12),b(k+12,j)( 76) c(i,j) = c(i,j)+arow(k)*b(k,j)( 77) c(i,j) = c(i,j)+arow(k+1)*b(k+1,j)( 78) c(i,j) = c(i,j)+arow(k+2)*b(k+2,j)( 79) c(i,j) = c(i,j)+arow(k+3)*b(k+3,j)( 80) enddo( 81) enddo( 82) enddo

PGI compilers allow directive-based prefetch, or automaticallyGenerate prefetch instructions in patterns similar to the above

Pentium III Prefetch Instruction• First available on Pentium III, not backward-compatible• Allows prefetch of a cache line so it will be in cache the next time a

memory reference is made to an address w/in the line• Can be used on any type of data, so applies to optimization of both 32-bit

FP- and 64-bit FP-intensive loops• Prefetching should occur about 25 cycles prior to the actual memory

reference, prefetching too late can cause a big performance penalty• Development version of PGI compilers support a directive c$mem

prefetch(<data element>) that causes prefetches to be inserted in the generated assembly; not in rel 3.1!

• Code generator being enhanced to automatically use prefetch in combination with vectorization and loop unrolling

Using the -Minline Option

-Minline[=option[,option]] where option is:

size:n Instructs the inliner to inline functions or subroutines with n or fewer lines.

[name:]f1[,f2[…]]] Inline functions/subroutines with names f1, f2

levels:n Instructs the inliner to inline functions up to nlevels deep. The default value of n is 2.

[lib:]<file.ext> Inline functions/subroutines from the inline library file.ext.

64-bit Matrix Multiply PerformancePIII 550, 512KB, 100Mhz bus

121

144

162

MFLOPsPrefetch

68771024x1024

104130 - 133192x192

16015064x64

MFLOPsPrefetch

Too Early

MFLOPsNo Prefetch

Size

EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from...

Documents

Transcript of EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from...