EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from...
Transcript of EFFECTIVE COMPILER UTILIZATION · Low-Overhead Parallel Regions SGI, Sun, Compaq Results from...
EFFECTIVE COMPILER UTILIZATION
Linux Supercluster User’s ConferenceSeptember 12, 2000 Albuquerque, NM
Eric Stoltz, PhD – senior compiler engineerThe Portland Group, Inc (PGI)
[email protected]://www.pgroup.com
Tutorial Overview
• The PGI compiler
• Vector parallelization
• Multiple processor parallelization
• Additional optimization techniques
Compaq Tru64CRAY T3E
HP ExemplarIBM SPNEC SX
SGI OriginSun E10K
The PGI CompilersPGF77 PGF90 PGCC PGC++PGHPF
OptimizingCore:
Global OptimizationAuto-ParallelizationOpenMP ParallelizationInterProcedural OptsProfile Feedback
Func InliningLoop UnrollingVectorizationPipelining
IA-32 IA-64 SPARC DSPs
OtherF77/F90
ExistingIn Development
PGI Linux/IA32 Fortran/C/C++• Fortran 90 – Robust, commercially-supported, current Fortran• Native Full HPF – high-level data parallel programming • EDG 2.40 C++ – state-of-the-art C++ language support• Auto-parallel for multi-CPU SMP nodes – F77, F90, C, C++• Native OpenMP for multi-CPU SMP nodes –F77, F90, C, C++ • Pentium II/III optimizations – including Pentium III SSE• Pentium III/Athlon prefetch – 32-bit and 64-bit vector data• RISC/UNIX compatibility – byte-swapping I/O, symbols, options• GNU interoperability – can cross-link gcc/g77 objects/libraries• PGDBG graphical debugger and PGPROF graphical profiler• Infrastructure – NAG, MPI-Pro, TotalView, VAMPIR, ISVs
Important Compiler Options-O2 Scheduling, register allocation, global optimization-tp p6 Pentium II/III specific optimizations/instructions-Munroll Unroll loops-fast Equivalent to “-O2 -Munroll -Mnoframe”-Minline Inline functions and subroutines-Mconcur Try to auto-parallelize loops for SMP systems-mp Process OpenMP/SGI directives and pragmas-Mvect=sse Try to use Pentium III SSE instructions-byteswapio Swap bytes from big-endian to little-endian or vice
versa upon input/output of unformatted data files-Minfo Compile-time optimization/parallelization messages-Mprof Enable function or line-level profiling
Important Compiler Options (cont)-pc {32|64|80}, -Kieee
Set precision of ops on FP stack, strict IEEE-Mcache_align
Align unconstrained objects >= 16 bytes to a cache line boundary
-Mneginfo={concur|loop}Display messages that indicate why loops are not parallelized or vectorized
-Msecond_underscore, -g77libsg77 compatibility
-Mx,4,4Use 3 FP stack locations as global registers (default is to use only 1); Try it and see if it helps
Using the –Kieee and -pc Options-pc {32 | 64 | 80}
Perform FP operations on the floating-point stack using the specified number of bits of precision; default is 80, with values being rounded when stored back to memory
-Kieee
Perform FP ops in strict compliance to the IEEE 754 standard; disables copy propagation and reciprocal division, and makes function calls to perform all transcendental operations
Generally, –pc 64 or –pc 64 in combination with –Kieee Produces arithmetic results that match most RISC/UNIX systems
Vector Parallelization• PIII SSE• Other processors
AMD 3D-Now! InstructionsPIIII processorAMD x86-64
Pentium III Streaming SIMD Enhancements (SSE)
• Eight 128-bit registers + new instructions • Length 4 vector regs/ops for 32-bit FP data• Instructions include add, subtract, multiply,
divide, sqrt, max, min, mov, conversion; still no multiply/accumulate
• PGI Compilers utilize SSE via vectorization –idioms and inline SSE code generation
• Linux distributions not SSE-enabled; http://sourceware.cygnus.com/gdb/papers/linux/linux-sse.html
SSE Registers
Current SSE Constraints• Single precision• Stride-1 references• 16-byte alignment
Using the –Mvect=sse OptionCandidate vector loops operate on 32-bit FP data:
% pgcc –fast –Mvect=sse –Mcache_align –Minfo matmul32.cmain:
29, Interchange produces reordered loop nest: 30, 29Loop unrolled 10 times
30, Outer loop unroll (4 times) and jam34, Interchange produces reordered loop nest: 35, 34
Loop unrolled 10 times35, Outer loop unroll (4 times) and jam46, Outer loop unroll (4 times) and jam47, Loop unrolled 10 times53, Loop unrolled 10 times58, Call to __pgi_adotp4 generated
Linking:%
32-bit Matrix Multiply PerformancePIII 550, 512KB, 100Mhz bus
161581024x1024
167108 - 150256x256
31720364x64
MFLOPsSSE
MFLOPsNo SSE
Size
Other Processors• AMD – 3D Now!, overlays fp stack• PIIII – can vectorize 64-bit doubles• AMD x86-64 – will support SSE,
add 8 new integer registers, and16 128-bit SSE registers.More info: www.x86-64.org
SMP and Distributed Memory Parallelization with Multiple
Processors
SMP• Low-level thread calls• Lightweight is good – Linux so-so• Scaling presents a problem• Threads = number of processors?
Low-Overhead Parallel Regions
SGI, Sun, Compaq Results from Measuring Synchronization and Scheduling Overheads in OpenMP, J.M. Bull, EPCC, http://www.epcc.ed.ac.uk/research/openmpbench;
PGI Results from runs of EPCC bmk on Quad-PIII 550 node at OSC (Linux 2.2.12)
PARALLEL Region Startup/Shutdown on 2 CPUs
0
1000
2000
3000
4000
5000
6000
7000
SGI/MIPS KAI/SPARC Compaq/Alpha PGI/Linux
Cyc
les
Using the -Mconcur Option-Mconcur[=option[,option]] where option is:
altcode:n Instructs the parallelizer to generate alternate scalar code for parallelized loops
noaltcode Do not generate alternate scalar code
dist:block Parallelize with block distribution (default)
dist:cyclic Parallelize with cyclic distribution
cncall Loops containing calls are candidates for parallelization
noassoc Disables parallelization of loops with reductions
What is automatically parallelized?
• Countable loops• No calls or conditionals• No loop-carried dependences• Automation must be conservative
Auto-Parallel LINPACK 100Quad PIII 550, 512KB Cache, Linux 2.2.12
Compiler options: -fast –Mconcur -Minline=saxpy,sscal
Procs = 0 is the non-parallel version on 1 CPU (no –Mconcur)
Results courtesy of Ohio Supercomputer Center
123 120
185
229
263
0
50
100
150
200
250
300
0 1 2 3 4
Processors
MF
LO
Ps
Using the -mp Option
• PGI supports Full OpenMP in F77, F90, C, and C++; thus is OpenMP Compliant
• Also supports SGI C$DOACROSS in Fortran and SGI pragmas in C/C++
• Use –Mnosgimp to ignore SGI, and –Mnoopenmp to ignore OpenMP directives in programs that have both (e.g. MM5)
OpenMP directives/pragmas• Sentinel-based way to direct SMP
parallel code (!$OMP or #pragma omp)• C/C++ current version 1.0, October 98• Fortran current version 1.1, September 99• Fortran 2.0 version out for public comment• More information: www.openmp.org
( 42) for (l=1; l<=NTIMES; l++) {( 43) #pragma omp parallel private (j,i,ii,k)( 44) {( 45) #pragma omp for( 46) for (j=0; j<p; j++) {( 47) for (i=0; i<m; i++) {( 48) c[j][i] = 0.0;( 49) }( 50) }( 51) for (i=0; i<m; i++) {( 52) #pragma omp for( 53) for (ii=0; ii<n; ii++) {( 54) arow[ii] = a[ii][i];( 55) }( 56) #pragma omp for ( 57) for (j=0; j<p; j++) {( 58) for (k=0; k<n; k++) {( 59) c[j][i] = c[j][i] + arow[k] * b[j][k];( 60) }( 61) }( 62) }( 63) }( 64) (void) dummy(c);( 65) }
Programming in OpenMP• Concept of master thread• Scoping of variables (private, shared)• Private within local stack space of each
processor (off the stack pointer)• Shared in global memory (off frame
pointer)
Key OMP constructs• firstprivate• lastprivate• threadprivate• shared• private• Loop iterators private by default• Many others (barrier, critical sections)• Task parallelism available using sections
main(){
int i, a[100];
#pragma omp parallel for
for(i=0; i< 100; i++)
a[i] = i;
}
/* identical semantics */
main(){
int i, a[100];
#pragma omp parallel private(i)
{
#pragma omp for
for(i=0; i< 100; i++)
a[i] = i;
}
}
pgcc -mp -Minfo par.c
main:
5, Parallel region activated
6, Parallel loop activated; static block iteration allocation
7, Parallel region terminated
EPCC OpenMP Microbenchmarks
Results from runs of EPCC Microbenchmarks on Quad-PIII 550 node at OSC (Linux 2.2.12)
0500
1000150020002500300035004000
paral
lel do
paral
lel do
barrie
r
single
critica
l
lock/u
nlock
atomic
reduc
tion
OpenMP Construct
Cyc
les
/ Ove
rhea
d
1 CPU
2 CPUs
3 CPUs
4 CPUsc
• LSDYNA = Over 750,000 lines of Fortran and C source • 1.74X Speedup on 2 CPUs, 2.74X Speedup on 4 CPUs• Compiled using PGI Workstation Fortran and OpenMP
111,029 Elements1 CPU Time = 119955 Seconds2 CPU Time = 69113 Seconds4 CPU Time = 43864 Seconds4-CPU 500 Mhz PIII Xeon
LS-DYNA on Linux/IA32 - OpenMP
0
0.5
1
1.5
2
2.5
3
1 2 4
CPUs
Sp
eed
-UP
0
0.5
1
1.5
2
2.5
3
1 2 4
CPUs
Sp
eed
-UP
Distributed Parallelism• Linux clusters MIMD by definition• Must use communication between nodes• MPI – do it yourself message passing
must link in MPI libraries• HPF – High Performance Fortran, puts
in communication for you
MPI Examples• PGI CDK uses MPI-CH• mpirun script sends copies of
executable to each node• Sample program, ‘Who am I?’
#include <unistd.h>
#include "mpi.h"
main(int argc, char **argv){
int len,ierr;
char hname[32];
len = 32;
MPI_Init( &argc, &argv );
gethostname(hname,len);
printf("My name is %s\n",hname);
MPI_Finalize( );
}
pgcc myname.c -lmpich -o myname
Linking:
mpirun -nolocal -np 4 myname
My name is bigfoot3.pgroup.com
My name is bigfoot2.pgroup.com
My name is bigfoot4.pgroup.com
My name is bigfoot1.pgroup.com
High Performance Fortran(HPF)
• Set of Fortran extensions expressingparallel execution at a high level
• Data-to-processor mapping done intwo stages, using the DISTRIBUTEand ALIGN operations.
2-Level Parallelism• Distribute work across nodes• Split up work within each node• Need a programming paradigm• Must determine number of
processors within each node
4-node cluster at PGInamed BIGFOOT
• Master node 500Mhz Celeron, bigfoot0• 4 compute nodes, dual 550 PIIIs
with 256 MB of memory,named bigfoot1 – bigfoot4
• Connected with fast ethernet
#include <unistd.h>#include "mpi.h“#include <omp.h>
main(int argc, char **argv){
int len = 32, ierr;
char hname[32]
omp_set_num_threads( omp_get_num_procs() );
MPI_Init( &argc, &argv );
gethostname(hname,len);
printf("My name is %s\n",hname);
#pragma omp parallel
{
printf("...thread number d\n", omp_get_thread_num() );
}
MPI_Finalize( );}
[stoltz@bigfoot]$ pgcc -mp -Minfo myname.c -o myname -lmpich
main:
12, Parallel region activated
14, Parallel region terminated
Linking:
[stoltz@bigfoot]$ mpirun -np 4 myname
My name is bigfoot.pgroup.com
...thread number 0
My name is bigfoot2.pgroup.com
...thread number 0
...thread number 1
My name is bigfoot1.pgroup.com
...thread number 0
...thread number 1
My name is bigfoot3.pgroup.com
...thread number 0
...thread number 1
[stoltz@bigfoot]$ mpirun -np 4 -nolocal myname
My name is bigfoot3.pgroup.com
...thread number 0
...thread number 1
My name is bigfoot2.pgroup.com
...thread number 0
...thread number 1
My name is bigfoot4.pgroup.com
...thread number 0
...thread number 1
My name is bigfoot1.pgroup.com
...thread number 0
...thread number 1
A Real Problem• Matrix multiply, arrays 1000x1000• Use HPF Distribute and Align• OMP calls to set number of threads• Timer from master thread
program matmul
include 'lib3f.h'
integer size, ntimes, m, n, p, i, j, k
integer OMP_GET_NUM_PROCS
parameter (size=500, m=size,n=size,p=size,ntimes=5)
real*8 a, b, c, arow
dimension a(m,n), b(n,p), c(n,p), arow(n)
!hpf$ distribute (*,block) :: a, b
!hpf$ align c(:,:) with b(:,:)
integer l
real walltime, mflops
integer hz, clock0, clock1, clock2
real etime
real tarray(2), time0, time1, time2
do i = 1,m
do j = 1, n
a(i,j) = 1.0
enddo
enddo
do i = 1, n
do j = 1, p
b(i,j) = 1.0
enddo
enddo
time0 = etime(tarray)
do l = 1, ntimesdo j = 1, p
do i = 1, m
c(i,j) = 0.0
enddo
enddo
do i = 1, m
do ii = 1, n
arow(ii) = a(i,ii)
enddo
!hpf$ independent
do j = 1, p
do k = 1, n
c(i,j) = c(i,j) + arow(k) * b(k,j)
enddo
enddo
enddo
call dummy(c)enddo
time1 = etime(tarray)
do i = 1, ntimes
call dummy(c)
enddo
time2 = etime(tarray)
walltime = ((time1 - time0) - (time2 - time1)) / real(ntimes)
mflops = (m*p*(2*n-1)) / (walltime * 1.0e+06)
print *, walltime, time0, time1, time2
print *, "M =",M,", N =",N,", P =",P
print *, "MFLOPS = ", mflops
print *, "c(1,1) = ", c(1,1)
end
subroutine dummy(c)
return
end
pghpf -fast -Mautopar -Minfo -Mmpi matmul.f -o matmul
matmul:
22, 1 FORALL generated
27, 1 FORALL generated
34, Invariant communication calls hoisted out of loop
35, 1 FORALL generated
40, Invariant communication calls hoisted out of loop
41, 1 FORALL generated
no parallelism: replicated array, arow
45, Independent loop parallelized
65, expensive communication: scalar communication (get_scalar)
matmul:
23, Loop unrolled 10 times
36, Loop unrolled 10 times
46, Loop unrolled 3 times
[stoltz@bigfoot]$ mpirun -nolocal -np 1 matmul
32.26400 0.1500000 161.4700 161.4700
M = 1000, N = 1000, P = 1000
MFLOPS = 61.95760
c(1,1) = 1000.000000000000
[stoltz@bigfoot]$ mpirun -nolocal -np 2 matmul
16.93800 7.9999998E-02 84.77000 84.77000
M = 1000, N = 1000, P = 1000
MFLOPS = 118.0187
c(1,1) = 1000.000000000000
[stoltz@bigfoot]$ mpirun -nolocal -np 4 matmul
9.502000 4.9999997E-02 47.56000 47.56000
M = 1000, N = 1000, P = 1000
MFLOPS = 210.3768
c(1,1) = 1000.000000000000
pghpf -fast -Mconcur -Mautopar -Minfo -Mmpi matmul.f -o matmul
22, 1 FORALL generated
27, 1 FORALL generated
34, Invariant communication calls hoisted out of loop
35, 1 FORALL generated
40, Invariant communication calls hoisted out of loop
41, 1 FORALL generated
45, Independent loop parallelized
23, Parallel code for non-innermost loop generated; block distribution
Loop unrolled 10 times
36, Parallel code for non-innermost loop generated; block distribution
Loop unrolled 10 times
45, Parallel code for non-innermost loop generated; block distribution
46, Loop unrolled 3 times
[stoltz@bigfoot]$ mpirun -nolocal -np 1 a.out
21.33000 0.1200000 106.7700 106.7700
M = 1000, N = 1000, P = 1000
MFLOPS = 93.71777
c(1,1) = 1000.000000000000
[stoltz@bigfoot]$ mpirun -nolocal -np 4 matmul
6.584001 2.9999999E-02 32.95000 32.95000
M = 1000, N = 1000, P = 1000
MFLOPS = 303.6148
c(1,1) = 1000.000000000000
( 63) do i = 1, m( 64) !$omp do( 65) do ii = 1, n( 66) arow(ii) = a(i,ii)( 67) enddo( 68) !hpf$ independent( 69) !$omp do( 70) do j = 1, p( 71) c$mem prefetch arow(1),b(1,j)( 72) c$mem prefetch arow(5),b(5,j)( 73) c$mem prefetch arow(9),b(9,j)( 74) do k = 1, n, 4( 75) c$mem prefetch arow(k+12),b(k+12,j)( 76) c(i,j) = c(i,j)+arow(k)*b(k,j)( 77) c(i,j) = c(i,j)+arow(k+1)*b(k+1,j)( 78) c(i,j) = c(i,j)+arow(k+2)*b(k+2,j)( 79) c(i,j) = c(i,j)+arow(k+3)*b(k+3,j)( 80) enddo( 81) enddo( 82) enddo
PGI compilers allow directive-based prefetch, or automaticallyGenerate prefetch instructions in patterns similar to the above
Pentium III Prefetch Instruction• First available on Pentium III, not backward-compatible• Allows prefetch of a cache line so it will be in cache the next time a
memory reference is made to an address w/in the line• Can be used on any type of data, so applies to optimization of both 32-bit
FP- and 64-bit FP-intensive loops• Prefetching should occur about 25 cycles prior to the actual memory
reference, prefetching too late can cause a big performance penalty• Development version of PGI compilers support a directive c$mem
prefetch(<data element>) that causes prefetches to be inserted in the generated assembly; not in rel 3.1!
• Code generator being enhanced to automatically use prefetch in combination with vectorization and loop unrolling
Using the -Minline Option
-Minline[=option[,option]] where option is:
size:n Instructs the inliner to inline functions or subroutines with n or fewer lines.
[name:]f1[,f2[…]]] Inline functions/subroutines with names f1, f2
levels:n Instructs the inliner to inline functions up to nlevels deep. The default value of n is 2.
[lib:]<file.ext> Inline functions/subroutines from the inline library file.ext.
64-bit Matrix Multiply PerformancePIII 550, 512KB, 100Mhz bus
121
144
162
MFLOPsPrefetch
68771024x1024
104130 - 133192x192
16015064x64
MFLOPsPrefetch
Too Early
MFLOPsNo Prefetch
Size