EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from...

55
EFFECTIVE COMPILER UTILIZATION Linux Supercluster User’s Conference September 12, 2000 Albuquerque, NM Eric Stoltz, PhD – senior compiler engineer The Portland Group, Inc (PGI) [email protected] http://www.pgroup.com

Transcript of EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from...

Page 1: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

EFFECTIVE COMPILER UTILIZATION

Linux Supercluster User’s ConferenceSeptember 12, 2000 Albuquerque, NM

Eric Stoltz, PhD – senior compiler engineerThe Portland Group, Inc (PGI)

[email protected]://www.pgroup.com

Page 2: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Tutorial Overview

• The PGI compiler

• Vector parallelization

• Multiple processor parallelization

• Additional optimization techniques

Page 3: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Compaq Tru64CRAY T3E

HP ExemplarIBM SPNEC SX

SGI OriginSun E10K

The PGI CompilersPGF77 PGF90 PGCC PGC++PGHPF

OptimizingCore:

Global OptimizationAuto-ParallelizationOpenMP ParallelizationInterProcedural OptsProfile Feedback

Func InliningLoop UnrollingVectorizationPipelining

IA-32 IA-64 SPARC DSPs

OtherF77/F90

ExistingIn Development

Page 4: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

PGI Linux/IA32 Fortran/C/C++• Fortran 90 – Robust, commercially-supported, current Fortran• Native Full HPF – high-level data parallel programming • EDG 2.40 C++ – state-of-the-art C++ language support• Auto-parallel for multi-CPU SMP nodes – F77, F90, C, C++• Native OpenMP for multi-CPU SMP nodes –F77, F90, C, C++ • Pentium II/III optimizations – including Pentium III SSE• Pentium III/Athlon prefetch – 32-bit and 64-bit vector data• RISC/UNIX compatibility – byte-swapping I/O, symbols, options• GNU interoperability – can cross-link gcc/g77 objects/libraries• PGDBG graphical debugger and PGPROF graphical profiler• Infrastructure – NAG, MPI-Pro, TotalView, VAMPIR, ISVs

Page 5: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Important Compiler Options-O2 Scheduling, register allocation, global optimization-tp p6 Pentium II/III specific optimizations/instructions-Munroll Unroll loops-fast Equivalent to “-O2 -Munroll -Mnoframe”-Minline Inline functions and subroutines-Mconcur Try to auto-parallelize loops for SMP systems-mp Process OpenMP/SGI directives and pragmas-Mvect=sse Try to use Pentium III SSE instructions-byteswapio Swap bytes from big-endian to little-endian or vice

versa upon input/output of unformatted data files-Minfo Compile-time optimization/parallelization messages-Mprof Enable function or line-level profiling

Page 6: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Important Compiler Options (cont)-pc {32|64|80}, -Kieee

Set precision of ops on FP stack, strict IEEE-Mcache_align

Align unconstrained objects >= 16 bytes to a cache line boundary

-Mneginfo={concur|loop}Display messages that indicate why loops are not parallelized or vectorized

-Msecond_underscore, -g77libsg77 compatibility

-Mx,4,4Use 3 FP stack locations as global registers (default is to use only 1); Try it and see if it helps

Page 7: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Using the –Kieee and -pc Options-pc {32 | 64 | 80}

Perform FP operations on the floating-point stack using the specified number of bits of precision; default is 80, with values being rounded when stored back to memory

-Kieee

Perform FP ops in strict compliance to the IEEE 754 standard; disables copy propagation and reciprocal division, and makes function calls to perform all transcendental operations

Generally, –pc 64 or –pc 64 in combination with –Kieee Produces arithmetic results that match most RISC/UNIX systems

Page 8: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Vector Parallelization• PIII SSE• Other processors

AMD 3D-Now! InstructionsPIIII processorAMD x86-64

Page 9: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Pentium III Streaming SIMD Enhancements (SSE)

• Eight 128-bit registers + new instructions • Length 4 vector regs/ops for 32-bit FP data• Instructions include add, subtract, multiply,

divide, sqrt, max, min, mov, conversion; still no multiply/accumulate

• PGI Compilers utilize SSE via vectorization –idioms and inline SSE code generation

• Linux distributions not SSE-enabled; http://sourceware.cygnus.com/gdb/papers/linux/linux-sse.html

Page 10: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

SSE Registers

Page 11: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Current SSE Constraints• Single precision• Stride-1 references• 16-byte alignment

Page 12: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Using the –Mvect=sse OptionCandidate vector loops operate on 32-bit FP data:

% pgcc –fast –Mvect=sse –Mcache_align –Minfo matmul32.cmain:

29, Interchange produces reordered loop nest: 30, 29Loop unrolled 10 times

30, Outer loop unroll (4 times) and jam34, Interchange produces reordered loop nest: 35, 34

Loop unrolled 10 times35, Outer loop unroll (4 times) and jam46, Outer loop unroll (4 times) and jam47, Loop unrolled 10 times53, Loop unrolled 10 times58, Call to __pgi_adotp4 generated

Linking:%

Page 13: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

32-bit Matrix Multiply PerformancePIII 550, 512KB, 100Mhz bus

161581024x1024

167108 - 150256x256

31720364x64

MFLOPsSSE

MFLOPsNo SSE

Size

Page 14: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Other Processors• AMD – 3D Now!, overlays fp stack• PIIII – can vectorize 64-bit doubles• AMD x86-64 – will support SSE,

add 8 new integer registers, and16 128-bit SSE registers.More info: www.x86-64.org

Page 15: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

SMP and Distributed Memory Parallelization with Multiple

Processors

Page 16: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

SMP• Low-level thread calls• Lightweight is good – Linux so-so• Scaling presents a problem• Threads = number of processors?

Page 17: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Low-Overhead Parallel Regions

SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, http://www.epcc.ed.ac.uk/research/openmpbench;

PGI Results from runs of EPCC bmk on Quad-PIII 550 node at OSC (Linux 2.2.12)

PARALLEL Region Startup/Shutdown on 2 CPUs

0

1000

2000

3000

4000

5000

6000

7000

SGI/MIPS KAI/SPARC Compaq/Alpha PGI/Linux

Cyc

les

Page 18: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Using the -Mconcur Option-Mconcur[=option[,option]] where option is:

altcode:n Instructs the parallelizer to generate alternate scalar code for parallelized loops

noaltcode Do not generate alternate scalar code

dist:block Parallelize with block distribution (default)

dist:cyclic Parallelize with cyclic distribution

cncall Loops containing calls are candidates for parallelization

noassoc Disables parallelization of loops with reductions

Page 19: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

What is automatically parallelized?

• Countable loops• No calls or conditionals• No loop-carried dependences• Automation must be conservative

Page 20: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Auto-Parallel LINPACK 100Quad PIII 550, 512KB Cache, Linux 2.2.12

Compiler options: -fast –Mconcur -Minline=saxpy,sscal

Procs = 0 is the non-parallel version on 1 CPU (no –Mconcur)

Results courtesy of Ohio Supercomputer Center

123 120

185

229

263

0

50

100

150

200

250

300

0 1 2 3 4

Processors

MF

LO

Ps

Page 21: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Using the -mp Option

• PGI supports Full OpenMP in F77, F90, C, and C++; thus is OpenMP Compliant

• Also supports SGI C$DOACROSS in Fortran and SGI pragmas in C/C++

• Use –Mnosgimp to ignore SGI, and –Mnoopenmp to ignore OpenMP directives in programs that have both (e.g. MM5)

Page 22: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

OpenMP directives/pragmas• Sentinel-based way to direct SMP

parallel code (!$OMP or #pragma omp)• C/C++ current version 1.0, October 98• Fortran current version 1.1, September 99• Fortran 2.0 version out for public comment• More information: www.openmp.org

Page 23: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

( 42) for (l=1; l<=NTIMES; l++) {( 43) #pragma omp parallel private (j,i,ii,k)( 44) {( 45) #pragma omp for( 46) for (j=0; j<p; j++) {( 47) for (i=0; i<m; i++) {( 48) c[j][i] = 0.0;( 49) }( 50) }( 51) for (i=0; i<m; i++) {( 52) #pragma omp for( 53) for (ii=0; ii<n; ii++) {( 54) arow[ii] = a[ii][i];( 55) }( 56) #pragma omp for ( 57) for (j=0; j<p; j++) {( 58) for (k=0; k<n; k++) {( 59) c[j][i] = c[j][i] + arow[k] * b[j][k];( 60) }( 61) }( 62) }( 63) }( 64) (void) dummy(c);( 65) }

Page 24: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Programming in OpenMP• Concept of master thread• Scoping of variables (private, shared)• Private within local stack space of each

processor (off the stack pointer)• Shared in global memory (off frame

pointer)

Page 25: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Key OMP constructs• firstprivate• lastprivate• threadprivate• shared• private• Loop iterators private by default• Many others (barrier, critical sections)• Task parallelism available using sections

Page 26: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

main(){

int i, a[100];

#pragma omp parallel for

for(i=0; i< 100; i++)

a[i] = i;

}

/* identical semantics */

main(){

int i, a[100];

#pragma omp parallel private(i)

{

#pragma omp for

for(i=0; i< 100; i++)

a[i] = i;

}

}

Page 27: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

pgcc -mp -Minfo par.c

main:

5, Parallel region activated

6, Parallel loop activated; static block iteration allocation

7, Parallel region terminated

Page 28: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

EPCC OpenMP Microbenchmarks

Results from runs of EPCC Microbenchmarks on Quad-PIII 550 node at OSC (Linux 2.2.12)

0500

1000150020002500300035004000

paral

lel do

paral

lel do

barrie

r

single

critica

l

lock/u

nlock

atomic

reduc

tion

OpenMP Construct

Cyc

les

/ Ove

rhea

d

1 CPU

2 CPUs

3 CPUs

4 CPUsc

Page 29: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

• LSDYNA = Over 750,000 lines of Fortran and C source • 1.74X Speedup on 2 CPUs, 2.74X Speedup on 4 CPUs• Compiled using PGI Workstation Fortran and OpenMP

111,029 Elements1 CPU Time = 119955 Seconds2 CPU Time = 69113 Seconds4 CPU Time = 43864 Seconds4-CPU 500 Mhz PIII Xeon

LS-DYNA on Linux/IA32 - OpenMP

0

0.5

1

1.5

2

2.5

3

1 2 4

CPUs

Sp

eed

-UP

0

0.5

1

1.5

2

2.5

3

1 2 4

CPUs

Sp

eed

-UP

Page 30: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler
Page 31: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Distributed Parallelism• Linux clusters MIMD by definition• Must use communication between nodes• MPI – do it yourself message passing

must link in MPI libraries• HPF – High Performance Fortran, puts

in communication for you

Page 32: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

MPI Examples• PGI CDK uses MPI-CH• mpirun script sends copies of

executable to each node• Sample program, ‘Who am I?’

Page 33: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

#include <unistd.h>

#include "mpi.h"

main(int argc, char **argv){

int len,ierr;

char hname[32];

len = 32;

MPI_Init( &argc, &argv );

gethostname(hname,len);

printf("My name is %s\n",hname);

MPI_Finalize( );

}

Page 34: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

pgcc myname.c -lmpich -o myname

Linking:

mpirun -nolocal -np 4 myname

My name is bigfoot3.pgroup.com

My name is bigfoot2.pgroup.com

My name is bigfoot4.pgroup.com

My name is bigfoot1.pgroup.com

Page 35: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

High Performance Fortran(HPF)

• Set of Fortran extensions expressingparallel execution at a high level

• Data-to-processor mapping done intwo stages, using the DISTRIBUTEand ALIGN operations.

Page 36: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

2-Level Parallelism• Distribute work across nodes• Split up work within each node• Need a programming paradigm• Must determine number of

processors within each node

Page 37: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

4-node cluster at PGInamed BIGFOOT

• Master node 500Mhz Celeron, bigfoot0• 4 compute nodes, dual 550 PIIIs

with 256 MB of memory,named bigfoot1 – bigfoot4

• Connected with fast ethernet

Page 38: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

#include <unistd.h>#include "mpi.h“#include <omp.h>

main(int argc, char **argv){

int len = 32, ierr;

char hname[32]

omp_set_num_threads( omp_get_num_procs() );

MPI_Init( &argc, &argv );

gethostname(hname,len);

printf("My name is %s\n",hname);

#pragma omp parallel

{

printf("...thread number d\n", omp_get_thread_num() );

}

MPI_Finalize( );}

Page 39: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

[stoltz@bigfoot]$ pgcc -mp -Minfo myname.c -o myname -lmpich

main:

12, Parallel region activated

14, Parallel region terminated

Linking:

Page 40: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

[stoltz@bigfoot]$ mpirun -np 4 myname

My name is bigfoot.pgroup.com

...thread number 0

My name is bigfoot2.pgroup.com

...thread number 0

...thread number 1

My name is bigfoot1.pgroup.com

...thread number 0

...thread number 1

My name is bigfoot3.pgroup.com

...thread number 0

...thread number 1

Page 41: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

[stoltz@bigfoot]$ mpirun -np 4 -nolocal myname

My name is bigfoot3.pgroup.com

...thread number 0

...thread number 1

My name is bigfoot2.pgroup.com

...thread number 0

...thread number 1

My name is bigfoot4.pgroup.com

...thread number 0

...thread number 1

My name is bigfoot1.pgroup.com

...thread number 0

...thread number 1

Page 42: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

A Real Problem• Matrix multiply, arrays 1000x1000• Use HPF Distribute and Align• OMP calls to set number of threads• Timer from master thread

Page 43: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

program matmul

include 'lib3f.h'

integer size, ntimes, m, n, p, i, j, k

integer OMP_GET_NUM_PROCS

parameter (size=500, m=size,n=size,p=size,ntimes=5)

real*8 a, b, c, arow

dimension a(m,n), b(n,p), c(n,p), arow(n)

!hpf$ distribute (*,block) :: a, b

!hpf$ align c(:,:) with b(:,:)

integer l

real walltime, mflops

integer hz, clock0, clock1, clock2

real etime

real tarray(2), time0, time1, time2

Page 44: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

do i = 1,m

do j = 1, n

a(i,j) = 1.0

enddo

enddo

do i = 1, n

do j = 1, p

b(i,j) = 1.0

enddo

enddo

time0 = etime(tarray)

Page 45: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

do l = 1, ntimesdo j = 1, p

do i = 1, m

c(i,j) = 0.0

enddo

enddo

do i = 1, m

do ii = 1, n

arow(ii) = a(i,ii)

enddo

!hpf$ independent

do j = 1, p

do k = 1, n

c(i,j) = c(i,j) + arow(k) * b(k,j)

enddo

enddo

enddo

call dummy(c)enddo

Page 46: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

time1 = etime(tarray)

do i = 1, ntimes

call dummy(c)

enddo

time2 = etime(tarray)

walltime = ((time1 - time0) - (time2 - time1)) / real(ntimes)

mflops = (m*p*(2*n-1)) / (walltime * 1.0e+06)

print *, walltime, time0, time1, time2

print *, "M =",M,", N =",N,", P =",P

print *, "MFLOPS = ", mflops

print *, "c(1,1) = ", c(1,1)

end

subroutine dummy(c)

return

end

Page 47: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

pghpf -fast -Mautopar -Minfo -Mmpi matmul.f -o matmul

matmul:

22, 1 FORALL generated

27, 1 FORALL generated

34, Invariant communication calls hoisted out of loop

35, 1 FORALL generated

40, Invariant communication calls hoisted out of loop

41, 1 FORALL generated

no parallelism: replicated array, arow

45, Independent loop parallelized

65, expensive communication: scalar communication (get_scalar)

matmul:

23, Loop unrolled 10 times

36, Loop unrolled 10 times

46, Loop unrolled 3 times

Page 48: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

[stoltz@bigfoot]$ mpirun -nolocal -np 1 matmul

32.26400 0.1500000 161.4700 161.4700

M = 1000, N = 1000, P = 1000

MFLOPS = 61.95760

c(1,1) = 1000.000000000000

[stoltz@bigfoot]$ mpirun -nolocal -np 2 matmul

16.93800 7.9999998E-02 84.77000 84.77000

M = 1000, N = 1000, P = 1000

MFLOPS = 118.0187

c(1,1) = 1000.000000000000

Page 49: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

[stoltz@bigfoot]$ mpirun -nolocal -np 4 matmul

9.502000 4.9999997E-02 47.56000 47.56000

M = 1000, N = 1000, P = 1000

MFLOPS = 210.3768

c(1,1) = 1000.000000000000

Page 50: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

pghpf -fast -Mconcur -Mautopar -Minfo -Mmpi matmul.f -o matmul

22, 1 FORALL generated

27, 1 FORALL generated

34, Invariant communication calls hoisted out of loop

35, 1 FORALL generated

40, Invariant communication calls hoisted out of loop

41, 1 FORALL generated

45, Independent loop parallelized

23, Parallel code for non-innermost loop generated; block distribution

Loop unrolled 10 times

36, Parallel code for non-innermost loop generated; block distribution

Loop unrolled 10 times

45, Parallel code for non-innermost loop generated; block distribution

46, Loop unrolled 3 times

Page 51: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

[stoltz@bigfoot]$ mpirun -nolocal -np 1 a.out

21.33000 0.1200000 106.7700 106.7700

M = 1000, N = 1000, P = 1000

MFLOPS = 93.71777

c(1,1) = 1000.000000000000

[stoltz@bigfoot]$ mpirun -nolocal -np 4 matmul

6.584001 2.9999999E-02 32.95000 32.95000

M = 1000, N = 1000, P = 1000

MFLOPS = 303.6148

c(1,1) = 1000.000000000000

Page 52: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

( 63) do i = 1, m( 64) !$omp do( 65) do ii = 1, n( 66) arow(ii) = a(i,ii)( 67) enddo( 68) !hpf$ independent( 69) !$omp do( 70) do j = 1, p( 71) c$mem prefetch arow(1),b(1,j)( 72) c$mem prefetch arow(5),b(5,j)( 73) c$mem prefetch arow(9),b(9,j)( 74) do k = 1, n, 4( 75) c$mem prefetch arow(k+12),b(k+12,j)( 76) c(i,j) = c(i,j)+arow(k)*b(k,j)( 77) c(i,j) = c(i,j)+arow(k+1)*b(k+1,j)( 78) c(i,j) = c(i,j)+arow(k+2)*b(k+2,j)( 79) c(i,j) = c(i,j)+arow(k+3)*b(k+3,j)( 80) enddo( 81) enddo( 82) enddo

PGI compilers allow directive-based prefetch, or automaticallyGenerate prefetch instructions in patterns similar to the above

Page 53: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Pentium III Prefetch Instruction• First available on Pentium III, not backward-compatible• Allows prefetch of a cache line so it will be in cache the next time a

memory reference is made to an address w/in the line• Can be used on any type of data, so applies to optimization of both 32-bit

FP- and 64-bit FP-intensive loops• Prefetching should occur about 25 cycles prior to the actual memory

reference, prefetching too late can cause a big performance penalty• Development version of PGI compilers support a directive c$mem

prefetch(<data element>) that causes prefetches to be inserted in the generated assembly; not in rel 3.1!

• Code generator being enhanced to automatically use prefetch in combination with vectorization and loop unrolling

Page 54: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

Using the -Minline Option

-Minline[=option[,option]] where option is:

size:n Instructs the inliner to inline functions or subroutines with n or fewer lines.

[name:]f1[,f2[…]]] Inline functions/subroutines with names f1, f2

levels:n Instructs the inliner to inline functions up to nlevels deep. The default value of n is 2.

[lib:]<file.ext> Inline functions/subroutines from the inline library file.ext.

Page 55: EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, ... Compiler

64-bit Matrix Multiply PerformancePIII 550, 512KB, 100Mhz bus

121

144

162

MFLOPsPrefetch

68771024x1024

104130 - 133192x192

16015064x64

MFLOPsPrefetch

Too Early

MFLOPsNo Prefetch

Size