Post on 27-Feb-2018
Introduction to GPU Computing with OpenACC
Michael Wolfe Michael.Wolfe@pgroup.com
http://www.pgroup.com
November 2012
1-1
The latest updates to these tutorial notes are available at:
http://www.pgroup.com/lit/presentations/
sc12_tutorial_openacc_intro.pdf
Also available:
Training at Pittsburgh Supercomputer Center
January 15-16
www.psc.edu/index.php/training/openacc-gpu-programming
Updates to these Tutorial Notes
1-2
Outline
Introduction
GPU architecture vs. Host architecture
GPU parallel programming vs. Host parallel programming
Introduction to OpenACC Directives
Elements of Directives-based Model
Controlling Data Movement
Explicit Parallelism
Reductions, the Loop directive
Interacting with CUDA
Problems, Pitfalls, Wrapup
Appropriate algorithms and data structures
Obvious Performance Bottlenecks
1-3
Host Architecture Features
Register files (integer, float)
Functional units (integer, float, address), Icache, Dcache
Execution pipeline (fetch,decode,issue,execute,cache,commit)
branch prediction, hazards (control, data, structural)
pipelined functional units, superpipelining, register bypass
stall, scoreboarding, reservation stations, register renaming
Multiscalar execution (superscalar, control unit lookahead)
LIW (long instruction word)
Multithreading, Simultaneous multithreading
Vector instruction set
Multiprocessor, Multicore, coherent caches (MESI protocols) 1-5
Making It Faster
Processor:
Faster clocks
More work per clock:
superscalar
VLIW
more cores
vector / SIMD instructions
Memory
Latency reduction
Latency tolerance
1-6
GPU Architecture Features
Optimized for high degree of regular parallelism
Classically optimized for low precision
High bandwidth memory
Highly multithreaded (slack parallelism)
Hardware thread scheduling
Non-coherent software-managed data caches
some hardware data caches as well
No multiprocessor memory model guarantees
some guarantees with fence operations
1-11
Tesla-10 Features Summary
Massively parallel thread processors
Organized into multiprocessors
up to 30, see deviceQuery or pgaccelinfo
Physically: 8 thread processors per multiprocessor
Logically: 32 threads per warp
Memory hierarchy
host memory, device memory, constant memory, shared memory,
register
Queue of operations (kernels) on device
1-12
Fermi (Tesla-20) Features Summary
Massively parallel thread processors
Organized into multiprocessors
up to 16, see deviceQuery or pgaccelinfo
Physically: two groups of 16 thread processors per multiprocessor
Logically: still 32 threads per warp
Memory hierarchy
host memory, device memory (two level hardware cache), constant
memory, (configurable) shared memory, register
Queue of operations (kernels) on device
ECC memory protection (supported, not default)
Much improved double precision performance
Hardware 32-bit integer multiply 1-13
Kepler (K20) Features Summary
Massively parallel thread processors
Organized into multiprocessors
8 or more, see deviceQuery or pgaccelinfo
Physically: 12 groups of 16 SP thread processors + 4 groups of
DP thread processors per SM
Logically: still 32 threads per warp
Memory hierarchy
host memory, device memory (hardware cache), constant memory,
(configurable) shared memory, register
Queue of operations (kernels) on device
HyperQ – more asynchronous compute operations
ECC memory protection
less memory per "CUDA core" 1-14
Parallel Programming on CPUs
Instruction level parallelism (ILP)
Loop unrolling, instruction scheduling
Vector parallelism
Vectorized loops (or vector intrinsics)
Thread level / Multiprocessor / multicore parallelism
Parallel loops, parallel tasks
Posix threads, OpenMP, Cilk, TBB, .....
Large scale cluster / multicomputer parallelism
MPI (& HPF, co-array Fortran, UPC, Titanium, X10, Fortress,
Chapel)
1-15
pthreads main routine
call pthread_create( t1, NULL, jacobi, 1 )
call pthread_create( t2, NULL, jacobi, 2 )
call pthread_create( t3, NULL, jacobi, 3 )
call jacobi( 4 )
call pthread_join( t1, NULL )
call pthread_join( t2, NULL )
call pthread_join( t3, NULL )
1-16
pthreads subroutine
subroutine jacobi( threadnum )
lchange = 0
do j = threadnum+1, n-1, numthreads
do i = 2, m-1
newa(i,j) = w0*a(i,j) + &
w1 * (a(i-1,j) + a(i,j-1) + &
a(i+1,j) + a(i,j+1)) + &
w2 * (a(i-1,j-1) + a(i-1,j+1) + &
a(i+1,j-1) + a(i+1,j+1))
lchange = max(lchange,abs(newa(i,j)-a(i,j)))
enddo
enddo
call pthread_mutex_lock( lck )
change = max(change,lchange)
call pthread_mutex_unlock( lck )
end subroutine 1-17
Jacobi Relaxation with OpenMP directives
change = 0
!$omp parallel private(i,j)
!$omp do reduction(max:change)
do j = 2, n-1
do i = 2, m-1
newa(i,j) = w0*a(i,j) + &
w1 * (a(i-1,j) + a(i,j-1) + &
a(i+1,j) + a(i,j+1)) + &
w2 * (a(i-1,j-1) + a(i-1,j+1) + &
a(i+1,j-1) + a(i+1,j+1))
change = max(change,abs(newa(i,j)-a(i,j)))
enddo
enddo
!$omp end parallel
1-18
Behind the Scenes
Compiler generates code for N threads:
split up the iterations across N threads
accumulate N partial sums (no synchronization)
accumulate final sum as threads complete
Assumptions
uniform memory access costs
coherent cache mechanism
1-19
More Behind the Scenes
Virtualization penalties
load balancing
cache locality
vectorization within the threads
thread management
loop scheduling (which thread does what iteration)
NUMA memory access penalty
1-20
Parallel Programming on GPUs
High degree of regular parallelism
lots of scalar threads
threads organized into thread groups / blocks
SIMD, pseudo-SIMD
thread groups organized into grid
MIMD
Languages
CUDA, OpenCL, graphics: OpenGL, DirectX
may include vector datatypes (float4, int2)
CAPS HMPP
Platforms
formerly ArBB (fka Rapidmind, now owned by Intel)
Thrust library
1-21
GPU Programming
Allocate data on the GPU
Move data from host, or initialize data on GPU
Launch kernel(s)
GPU driver can generate ISA code at runtime
preserves forward compatibility without requiring
ISA compatibility
Gather results from GPU
Deallocate data
1-22
Appropriate GPU programs
Characterized by nested parallel loops
High compute intensity
Regular data access
Isolated host/GPU data movement
1-23
Behind the Scenes
What you write is what you get
Implicitly parallel
threads into warps
warps into thread groups
thread groups into a grid
Hardware thread scheduler
Highly multithreaded
1-24
CUDA C and CUDA Fortran
Simple introductory program
Programming model
Low-level Programming with CUDA
Building CUDA programs
2-25
VADD on Host
subroutine host_vadd(A,B,C,N)
real(4) :: A(N), B(N), C(N)
integer :: N, i
do i = 1,N
C(i) = A(i) + B(i)
enddo
end subroutine
2-26
void host_vadd( float* A, float* B, float* C, int n ){
int i;
for( i = 0; i < n; ++i ){
C[i] = A[i] + B[i];
}
}
CUDA C VADD Device Code
__global__ void vaddkernel( float* A, float* B,
float* C, int n ){
int i;
i = blockIdx.x*blockDim.x + threadIdx.x;
if( i <= N ) C[i] = A[i] + B[i];
}
2-27
CUDA Fortran VADD Device Code
module kmod
use cudafor
contains
attributes(global) subroutine vaddkernel(A,B,C,N)
real(4), device :: A(N), B(N), C(N)
integer, value :: N
integer :: i
i = (blockidx%x-1)*blockdim%x + threadidx%x
if( i <= N ) C(i) = A(i) + B(i)
end subroutine
end module
2-28
CUDA C VADD Host Code
void vadd( float* A, float* B, float* C, int n ){
float *Ad, *Bd, *Cd;
cudaMalloc( (void**)&Ad, n*sizeof(float) );
cudaMalloc( (void**)&Bd, n*sizeof(float) );
cudaMalloc( (void**)&Cd, n*sizeof(float) );
cudaMemcpy( Ad, A, n*sizeof(float),
cudaMemcpyHostToDevice );
cudaMemcpy( Bd, B, n*sizeof(float),
cudaMemcpyHostToDevice );
vaddkernel<<< (n+31)/32, 32 >>>( Ad, Bd, Cd, n );
cudaMemcpy( C, Cd, n*sizeof(float),
cudaMemcpyDeviceToHost );
cudaFree( Ad ); cudaFree( Bd ); cudaFree( Cd );
}
2-29
CUDA Fortran VADD Host Code
subroutine vadd( A, B, C )
use kmod
real(4), dimension(:) :: A, B, C
real(4), device, allocatable, dimension(:):: &
Ad, Bd, Cd
integer :: N
N = size( A, 1 )
allocate( Ad(N), Bd(N), Cd(N) )
Ad = A(1:N)
Bd = B(1:N)
call vaddkernel<<<(N+31)/32,32>>>( Ad, Bd, Cd, N )
C(1:N) = Cd
deallocate( Ad, Bd, Cd )
end subroutine
2-30
CUDA Programming
Host code
Optional: select a GPU
Allocate device memory
Copy data to device memory
Launch kernel(s)
Copy data from device memory
Deallocate device memory
Device code
Scalar thread code, limited operations
Implicitly parallel
threads within a thread block, essentially SIMD
thread blocks within a grid, MIMD parallelism
2-31
CUDA Programming: the GPU
A scalar program, runs on one thread
All threads run the same code
Executed in thread groups
grid may be 1D or 2D (max 65535x65535)
3D in CUDA 4.0 (Fermi)
thread block may be 1D, 2D, or 3D
max total size 512 (Tesla) or 1024 (Fermi)
blockidx gives block index in grid (x,y)
threadidx gives thread index within block (x,y,z)
Kernel runs implicitly in parallel
thread blocks scheduled by hardware on any multiprocessor
runs to completion before next kernel
2-32
Directive-Based GPU Programming
Simple introductory program
Programming model
Building Accelerator programs
High-level Programming
3-33
Jacobi Relaxation
change = tolerance + 1.0
do while(change > tolerance)
change = 0
do i = 2, m-1
do j = 2, n-1
newa(i,j) = w0*a(i,j) + &
w1 * (a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1)) + &
w2 * (a(i-1,j-1)+a(i-1,j+1)+a(i+1,j-1)+a(i+1,j+1))
change = max(change,abs(newa(i,j)-a(i,j)))
enddo
enddo
do i = 2, m-1
do j = 2, n-1
a(i,j) = newa(i,j)
enddo
enddo
enddo
3-34
Jacobi Relaxation
do{
change = 0;
for( j = 1; j < m-1; ++j ){
for( i = 1; i < n-1; ++i ){
newa[j][i] = w0*a[j][i] +
w1 * (a[j][i-1] + a[j-1][i] +
a[j][i+1] + a[j+1][i]) +
w2 * (a[j-1][i-1] + a[j+1][i-1] +
a[j-1][i+1] + a[j+1][i+1]);
change = fmax(change,fabs(newa[j][i]-a[j][i]));
}
}
tmp = a; a = newa; newa = tmp;
}while( change > tolerance );
3-35
Jacobi Relaxation
change = tolerance + 1.0
!$omp parallel shared(change)
do while(change > tolerance)
change = 0
!$omp do reduction(max:change) private(i,j)
do i = 2, m-1
do j = 2, n-1
newa(i,j) = w0*a(i,j) + &
w1 * (a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1)) + &
w2 * (a(i-1,j-1)+a(i-1,j+1)+a(i+1,j-1)+a(i+1,j+1))
change = max(change,abs(newa(i,j)-a(i,j)))
enddo
enddo
!$omp do private(i,j)
do i = 2, m-1
do j = 2, n-1
a(i,j) = newa(i,j)
enddo
enddo
enddo
3-36
Jacobi Relaxation
#pragma omp parallel shared(change) private(tmp)
do{
change = 0;
#pragma omp for private(i,j) reduction(max:change)
for( j = 1; j < m-1; ++j ){
for( i = 1; i < n-1; ++i ){
newa[j][i] = w0*a[j][i] +
w1 * (a[j][i-1] + a[j-1][i] +
a[j][i+1] + a[j+1][i]) +
w2 * (a[j-1][i-1] + a[j+1][i-1] +
a[j-1][i+1] + a[j+1][i+1]);
change = fmax(change,fabs(newa[j][i]-a[j][i]));
}
}
tmp = a; a = newa; newa = tmp;
}while( change > tolerance );
3-37
change = tolerance + 1.0
!$acc data create(newa(1:m,1:n)) copy(a(1:m,1:n))
do while(change > tolerance)
change = 0
!$acc kernels reduction(max:change)
do i = 2, m-1
do j = 2, n-1
newa(i,j) = w0*a(i,j) + &
w1 * (a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1)) + &
w2 * (a(i-1,j-1)+a(i-1,j+1)+a(i+1,j-1)+a(i+1,j+1))
change = max(change,abs(newa(i,j)-a(i,j)))
enddo
enddo
do i = 2, m-1
do j = 2, n-1
a(i,j) = newa(i,j)
enddo
enddo
!$acc end kernels
enddo
!$acc end data 3-38
#pragma acc data create(newa[0:n-1][0:m-1])\
copy(a[0:n-1][0:m-1])
do{
change = 0;
#pragma acc kernels private(i,j) reduction(max:change)
for( j = 1; j < m-1; ++j ){
for( i = 1; i < n-1; ++i ){
newa[j][i] = w0*a[j][i] +
w1 * (a[j][i-1] + a[j-1][i] +
a[j][i+1] + a[j+1][i]) +
w2 * (a[j-1][i-1] + a[j+1][i-1] +
a[j-1][i+1] + a[j+1][i+1]);
change = fmax(change,fabs(newa[j][i]-a[j][i]));
}
}
tmp = a; a = newa; newa = tmp;
}while( change > tolerance );
3-39
Why use Accelerator Directives?
Productivity
Higher level programming model
a la OpenMP
Portability
ignore directives, portable to the host
portable to other accelerators
performance portability
Performance feedback
Downsides
it’s not as easy as inserting a few directives
a good host algorithm is not necessarily a good GPU algorithm
3-40
Basic Syntactic Concepts
Fortran accelerator directive syntax
!$acc directive [clause]...
& continuation
Fortran-77 syntax rules
!$acc or C$acc or *$acc in columns 1-5
continuation with nonblank in column 6
C accelerator directive syntax
#pragma acc directive [clause]... eol
continue to next line with backslash
3-41
Construct
construct is single-entry/single-exit construct
in Fortran, delimited by begin/end directives
in C, a single statement, or {...} region
no jumps into/out of region, no return
compute region contains loops to send to GPU
loop iterations translated to GPU threads
loop indices become threadidx/blockidx indices
data region encloses compute regions
data moved at region boundaries
Construct vs. Region
construct is the lexical instance in the program
region is the dynamic instance when running the program
3-42
Appropriate Algorithm
Nested parallel loops
iterations map to threads
parallelism means threads are independent
nested loops means lots of parallelism
Regular array indexing
allows for stride-1 array fetches
3-43
#pragma acc data create(newa[0:n][0:m])\
copy(a[0:n][0:m])
do{
change = 0;
#pragma acc kernels private(i,j) reduction(max:change)
for( j = 1; j < m-1; ++j ){
for( i = 1; i < n-1; ++i ){
newa[j][i] = w0*a[j][i] +
w1 * (a[j][i-1] + a[j-1][i] +
a[j][i+1] + a[j+1][i]) +
w2 * (a[j-1][i-1] + a[j+1][i-1] +
a[j-1][i+1] + a[j+1][i+1]);
change = fmax(change,fabs(newa[j][i]-a[j][i]));
}
}
tmp = a; a = newa; newa = tmp;
}while( change > tolerance );
3-44
#pragma acc data create(newa[0:n][0:m])\
copy(a[0:n][0:m])
do{
change = 0;
#pragma acc parallel private(i,j) reduction(max:change)
{
#pragma acc loop
for( j = 1; j < m-1; ++j ){
for( i = 1; i < n-1; ++i ){
newa[j][i] = w0*a[j][i] +
w1 * (a[j][i-1] + a[j-1][i] +
a[j][i+1] + a[j+1][i]) +
w2 * (a[j-1][i-1] + a[j+1][i-1] +
a[j-1][i+1] + a[j+1][i+1]);
change = fmax(change,fabs(newa[j][i]-a[j][i]));
}
}
}
tmp = a; a = newa; newa = tmp;
}while( change > tolerance );
3-45
Behind the Scenes
compiler determines parallelism (kernels)
or applies parallelism (parallel)
compiler generates thread code
split up the iterations into threads, thread groups
inserts code to use software data cache
accumulate partial sum
second kernel to combine final sum
compiler also inserts data movement
compiler or user determines what data to move
data moved at boundaries of data/compute region
3-46
Behind the Scenes
virtualization penalties
fine grain control
thread scheduling
shared memory usage
loop unrolling
3-47
Building Accelerator Programs
pgfortran –acc a.f90
pgcc –acc a.c
Other options:
-ta=nvidia[,cc1x|cc2x|cc3x]
default in siterc file:
set COMPUTECAP=30;
-ta=nvidia[,cuda5.0]
default in siterc file:
set DEFCUDAVERSION=5.0;
-ta=nvidia,fastmath,nofma
Enable compiler feedback with –Minfo or –Minfo=accel
3-48
Program Execution Model
Host
executes most of the program
allocates accelerator memory
initiates data copy from host memory to accelerator
sends kernel code to accelerator
queues kernels for execution on accelerator
waits for kernel completion
initiates data copy from accelerator to host memory
deallocates accelerator memory
Accelerator
executes kernels, one after another
concurrently, may transfer data between host and accelerator
3-49
Controlling Data Movement
data clauses
data construct and data region
update directive
asynchronous data movement, the wait directive
3-50
Data Clauses
C #pragma acc data copyin(a[0:n]) copyout(r[0:n])
{
....
}
Fortran !$acc data copyin(a(1:n)) copyout(r(1:n))
...
!$acc end data
3-51
Data Clauses
C #pragma acc data copyin(a[0:n][0:m]) \ copy(r[0:n][0:m]) { ... }
Fortran !$acc data copyin(a(1:m,1:n)) copy(r(:,:)) ... !$acc end data
3-52
Data Clauses
C #pragma acc data copyin(a[0:n][0:m]) \ create(r[0:n][0:m]) { ... }
Fortran !$acc data copyin(a(1:m,1:n)) create(r) ... !$acc end data
3-53
Data Clauses
C #pragma acc data copyin(a[0:n][0:m]) \ create(r[1:n-1][1:m-1]) { ... }
Fortran !$acc data copyin(a(1:m,1:n)) & create(r(2:m-1,2:n-1)) ... !$acc end data
3-54
Data Clauses
C #pragma acc data copyin(a[0:n][0:m]) /* data copied to Accelerator here */ { ... } /* data copied to Host here */
Fortran !$acc data copyin(a(1:m,1:n)) ! data copied to Accelerator here ... ! data copied to Host here !$acc end data
3-55
Data Clauses
C #pragma acc data copyin(a[0:n][0:m]) \ present(r[1:n-1][1:m-1]) { ... }
Fortran !$acc data copyin(a(1:m,1:n)) & present(r(2:m-1,2:n-1)) ... !$acc end data
3-56
Data Clauses
C #pragma acc data present_or_copyin(a[0:n]) \ present(r[1:n-1]) { ... }
Fortran !$acc data present_or_copyin(a(1:m)) & present(r(2:m-1)) ... !$acc end data
3-57
Data Clauses
C #pragma acc data pcopyin(a[0:n]) \ present(r[1:n-1]) { ... }
Fortran !$acc data pcopyin(a(1:m)) & present(r(2:m-1)) ... !$acc end data
3-58
Data Clauses
copy( list )
copyin( list )
copyout( list )
create( list )
present( list )
present_or_copy( list ) pcopy( list )
present_or_copyin( list ) pcopyin( list )
present_or_copyout( list ) pcopyout( list )
present_or_create( list ) pcreate( list )
deviceptr( list )
3-59
Data Region
C
#pragma acc data data-clauses if( condition )
{
....
}
Fortran
!$acc data data-clauses if( condition )
....
!$acc end data
May be nested and may contain compute regions
May not be nested within a compute region
May contain procedure calls
3-60
Update Directive
C
#pragma acc update host( list )
#pragma acc update device( list )
Fortran
!$acc update host( list )
!$acc update device( list )
data must be in a data clause for an enclosing data region
implies present( list )
both may be on a single line
update host( list ) device( list )
3-61
Asynchronous Update Directive
C
#pragma acc update host( list ) async(1)
#pragma acc update device( list ) async(2)
Fortran
!$acc update host( list ) async(3)
!$acc update device( list ) async(n)
async value should be >= 0
mapped down to some number of actual async queues
updates with same value will execute in program order
async with no value is same as async(acc_async_noval)
async(acc_async_sync) is same as no async clause
host program may continue
3-62
Wait Directive
C
#pragma acc wait( 2 )
#pragma acc wait
Fortran
!$acc wait( n, n-1 )
!$acc wait
wait with no values waits for ALL asynchronous queues
wait with multiple values waits for each queue
NO implied wait at end of data construct
async updates in a data region require wait
3-63
Data Regions Across Procedures
subroutine sub( a, b )
real :: a(:), b(:)
!$acc kernels copyin(b)
do i = 1,n
a(i) = a(i) * b(i)
enddo
!$acc end kernels
...
end subroutine
subroutine bus(x, y)
real :: x(:), y(:)
!$acc data copy(x)
call sub( x, y )
!$acc end data
3-64
Data Regions Across Procedures
void sub( float* a, float* b, int n ){
int i;
#pragma acc kernels copyin(b[0:n])
for( i = 0; i < n; ++i )
a[i] *= b[i];
...
}
void bus( float* x, float* y, int n ){
#pragma acc data copy(x[0:n])
{
sub( x, y, n );
}
...
3-65
Loop Directive
C
#pragma acc loop clause...
for( i = 0; i < n; ++i ){
....
}
Fortran
!$acc loop clause...
do i = 1, n
3-66
Kernels Loop Scheduling Clauses
!$acc loop gang
runs in ‘gang’ mode only (blockIdx)
does not declare that the loop is in fact parallel (use independent)
!$acc loop gang(32)
runs in ‘parallel’ mode only with gridDim == 32 (32 blocks)
!$acc loop vector(128)
runs in ‘vector’ mode (threadIdx) with blockDim == 128 (128
threads)
vector size, if present, must be compile-time constant
!$acc loop gang vector(128)
strip mines loop
inner loop runs in vector mode, 128 threads (threadIdx)
outer loop runs across thread blocks (blockIdx) 3-67
Loop Scheduling Clauses
Want stride-1 loop to be in ‘vector’ mode (threadIdx)
look at –Minfo messages!
Want lots of parallelism
3-68
Loop Directive Clauses
Scheduling Clauses
vector or vector(n)
gang or gang(n)
worker or worker(n)
independent
use with care, overrides compiler analysis for dependence, private
variables
private( list )
private data for each iteration of the loop
different from local (how?)
reduction( red:var )
reduction across the loop
3-69
Compiler Feedback Messages
Data related
Generating copyin(b(1:n,1:m))
Generating copyout(b(2:n-1,2:m-1))
Generating copy(a(1:n,1:n))
Generating create(c(1:n,1:n))
Loop or kernel related
Loop is parallelizable
Accelerator kernel generated
Barriers to GPU code generation
No parallel kernels found, accelerator region ignored
Loop carried dependence due to exposed use of ... prevents
parallelization
Parallelization would require privatization of array ...
Compiler Messages Continued
Memory optimization related
Cached references to size [(x+2)x(y+2)] block of
‘b’
Non-stride-1 memory accesses for ‘a’
Combined Directives
C #pragma acc kernels loop copyin(a[0:n])\
gang vector(32)
for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;
Fortran
!$acc kernels loop copyin(a(1:n)) &
gang vector(32)
do i = 1,n
r(i) = a(i) * 2.0
enddo
3-72
Parallel Region
C #pragma acc parallel loop copyin(a[0:n])
for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;
Fortran
!$acc parallel loop copyin(a(1:n))
do i = 1,n
r(i) = a(i) * 2.0
enddo
3-73
Selecting Device
C
#include <accel.h>
n = acc_get_num_devices(acc_device_nvidia);
...
acc_set_device_num( 1, acc_device_nvidia);
Fortran
use accel_lib
n = acc_get_num_devices( acc_device_nvidia )
...
call acc_set_device_num( 0, acc_device_nvidia)
Environment Variable
setenv ACC_DEVICE_NUM 1
export ACC_DEVICE_NUM=1
3-74
Performance Profiling
TAU (Tuning and Analysis Utilities, University of Oregon)
collects performance information
cudaprof (NVIDIA)
gives a trace of kernel execution
PGI_ACC_TIME environment variable
dump of region-level and kernel-level performance
upload/download data movement time
kernel execution time
pgcollect a.out
PGI_ACC_PROFILE environment variable
enables profile data collection for accelerator regions
ACC_NOTIFY environment variable
prints one line for each kernel invocation 3-75
Performance Profiling
Accelerator Kernel Timing data
f3.f90
smooth
24: region entered 1 time
time(us): total=1116701 init=1115986 region=715
kernels=22 data=693
w/o init: total=715 max=715 min=715 avg=715
27: kernel launched 5 times
grid: [7x7] block: [16x16]
time(us): total=17 max=10 min=1 avg=3
34: kernel launched 5 times
grid: [7x7] block: [16x16]
time(us): total=5 max=1 min=1 avg=1
3-76
Directives Summary acc kernels
if(condition)
copy(list) copyin(list) copyout(list) create(list)
acc parallel
if(condition)
num_gangs(n) num_workers(n) vector_length(n)
acc data
copy(list) copyin(list) copyout(list) create(list)
if(condition)
acc update device(list) host(list)
acc loop
gang(width) worker(width) vector(width)
independent
private(list)
reduction(red:var)
3-77
Runtime Summary
acc_get_num_devices( acc_device_nvidia )
acc_set_device_num( n, acc_device_nvidia )
acc_set_device( acc_device_nvidia | acc_device_host )
acc_get_device()
acc_init( acc_device_nvidia | acc_device_host )
3-78
Command Line Summary
-ta=nvidia
-ta=nvidia,nofma
-ta=nvidia,fastmath
-ta=nvidia,[cc1x|cc2x|cc3x] (multiple allowed)
-ta=nvidia,maxregcount:n
-ta=nvidia,cuda5.0
3-80
Accelerator Programming Timeline
3-81
late 1990s – GPGPU Programming
early 2000s – Brook project (and others)
2007 – NVIDIA CUDA release, SC07 tutorial
HiCUDA, HMPP (and others)
EXOCHI: PLDI 2007 (Intel)
2008 – PGI Accelerator Programming Model
Larrabee paper at SIGGRAPH (Intel)
2009 – OpenMP BoF at SC09
Programming Model for Heterogeneous x86: PLDI 2009 (Intel)
Larrabee featured in Justin Rattner (Intel) opening address
2010 – OpenMP Accelerator committee
2011 – OpenACC API 1.0
Language Extensions for Offload (LEO, Intel)
Where to get help
• PGI Customer Support - trs@pgroup.com
• PGI User's Forum - http://www.pgroup.com/userforum/index.php
• PGI Articles - http://www.pgroup.com/resources/articles.htm
http://www.pgroup.com/resources/accel.htm
• PGI User's Guide - http://www.pgroup.com/doc/pgiug.pdf
• CUDA Fortran Reference Guide -
http://www.pgroup.com/doc/pgicudafortug.pdf
OpenACC Members
CAPS Entreprise
Cray, Inc.
NVIDIA
The Portland Group, Inc.
Allinea
CSCS (Switzerland)
Oak Ridge National Lab
Tech. Univ. Dresden
Univ. Houston
3-83
Sandia
Georgia Tech
Rogue Wave
OpenACC Supporters